Skip to content

Fix recall computation for fewer than k groundtruth results#1069

Open
magdalendobson wants to merge 13 commits into
mainfrom
users/magdalen/patch_recall_calculation
Open

Fix recall computation for fewer than k groundtruth results#1069
magdalendobson wants to merge 13 commits into
mainfrom
users/magdalen/patch_recall_calculation

Conversation

@magdalendobson
Copy link
Copy Markdown
Contributor

Groundtruth for filtered datasets uses the .rangeres format, since a filtered query may have few or no results. However, at the time of recall computation for top-k search, fewer than k results causes an error. This PR patches that error and computes recall for each point using the following paradigm:

  1. If the point has 0 < k' <= k groundtruth results, recall is the number of correctly reported points divided by k'.
  2. If the point has no groundtruth results, it is completely excluded from the recall calculation.

The average recall is then the sum of all recalls divided by the number of non-zero results. The points with zero results do not affect recall. The reasoning for this is that in a dataset with majority-null results for filtered queries, we want recall to capture our performance on the queries that actually have results, rather than being artificially high when we always correctly report that these points have no results.

Since this error will no longer trigger even for a regular top-k search in recall computation, I have pushed the error for not enough groundtruth results to when the groundtruth in the regular bin format is actually read. This is an improvement since a failure will now trigger before an index build and search, making a better experience for the user. I leave some of the instances of NotEnoughGroundTruth inside recall computation as an extra layer of failsafe, but they could also be removed if reviewers feel we should completely remove them.

Along the way I get rid of the minimum and maximum values in the recall stats, since after some discussion with the team it appears that no one uses them.

@magdalendobson
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@magdalendobson magdalendobson marked this pull request as ready for review May 14, 2026 20:00
@magdalendobson magdalendobson requested review from a team and Copilot May 14, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes recall computation when filtered/variable-length groundtruth contains fewer than k results by computing per-query recall against the available groundtruth and excluding zero-result queries from the average.

Changes:

  • Update recall computation to handle per-row k' <= k groundtruth sizes and ignore empty groundtruth rows in the average.
  • Move “insufficient groundtruth for top-k” validation to groundtruth loading for fixed-size .bin groundtruth.
  • Remove minimum / maximum from reported recall metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
diskann-benchmark/src/utils/recall.rs Removes min/max fields from the benchmark-facing recall metrics wrapper.
diskann-benchmark/src/utils/datafiles.rs Adds optional k validation when loading groundtruth from .bin.
diskann-benchmark/src/backend/index/benchmarks.rs Computes max recall_k across runs and passes it into groundtruth loading.
diskann-benchmark-core/src/recall.rs Implements variable-length groundtruth recall + removes min/max metrics and updates tests accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +118 to +126
if let Some(expected_k) = k {
if groundtruth.ncols() != expected_k {
return Err(anyhow::anyhow!(
"Each row of groundtruth must have length {} (got {})",
expected_k,
groundtruth.ncols()
));
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f0b954d. Changed the check from ncols() != expected_k to ncols() < expected_k and updated the error message to "at least N neighbors" so .bin groundtruth files with more columns than the requested recall_k are accepted. Also updated the related test that was incorrectly expecting an error for dynamic groundtruth with fewer entries than recall_k — that's now valid behavior per the PR (filtered queries can have fewer results). Added test cases with non-zero values verifying correct recall values (full match → 1.0, partial match → 0.6) in 3de6550.

Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment on lines 23 to 24
/// The average recall across all queries.
pub average: f64,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fd8f772. The doc comment for the average field now clarifies that it's "The average recall across queries with non-empty groundtruth" and that "Queries with zero groundtruth results are excluded from the average."

Comment thread diskann-benchmark-core/src/recall.rs
Comment thread diskann-benchmark-core/src/recall.rs
Comment thread diskann-benchmark-core/src/recall.rs
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment on lines +245 to +247
// Compute the average recall
let total: f64 = recall_values.iter().sum();
let average = (total) / (num_nonzero as f64);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in fd8f772. Two new test cases:

  1. Mixed zero/non-zero groundtruth rows: 5 queries with groundtruth [1,2,3,4,5] (all match → recall=1.0 each) and 5 queries with empty groundtruth (excluded). Verifies average = 1.0 (denominator uses only 5 non-zero rows).
  2. All-zero groundtruth: All 10 queries have empty groundtruth. Verifies average = 0.0 (not NaN/inf).

Comment thread diskann-benchmark/src/backend/index/benchmarks.rs Outdated
Comment thread diskann-benchmark/src/utils/datafiles.rs
magdalendobson and others added 3 commits May 14, 2026 17:55
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…more neighbors than needed

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…oundtruth tests

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Copilot AI and others added 2 commits May 14, 2026 22:09
…groundtruth mix

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 93.16239% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.51%. Comparing base (f4757a3) to head (bdee1db).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
diskann-benchmark/src/backend/index/benchmarks.rs 30.00% 7 Missing ⚠️
diskann-benchmark/src/utils/datafiles.rs 96.55% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1069   +/-   ##
=======================================
  Coverage   89.51%   89.51%           
=======================================
  Files         459      459           
  Lines       85646    85707   +61     
=======================================
+ Hits        76663    76723   +60     
- Misses       8983     8984    +1     
Flag Coverage Δ
miri 89.51% <93.16%> (+<0.01%) ⬆️
unittests 89.14% <93.16%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark-core/src/recall.rs 97.99% <100.00%> (+0.18%) ⬆️
diskann-benchmark/src/backend/exhaustive/minmax.rs 100.00% <ø> (ø)
...iskann-benchmark/src/backend/exhaustive/product.rs 100.00% <ø> (ø)
...kann-benchmark/src/backend/exhaustive/spherical.rs 100.00% <ø> (ø)
diskann-benchmark/src/backend/index/spherical.rs 100.00% <ø> (ø)
diskann-benchmark/src/inputs/graph_index.rs 37.33% <100.00%> (+0.31%) ⬆️
diskann-benchmark/src/utils/recall.rs 57.14% <ø> (-5.36%) ⬇️
diskann-benchmark/src/utils/datafiles.rs 87.36% <96.55%> (+3.78%) ⬆️
diskann-benchmark/src/backend/index/benchmarks.rs 48.16% <30.00%> (+0.27%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix recall computation when fewer than k results are present

4 participants