Fix recall computation for fewer than k groundtruth results#1069
Fix recall computation for fewer than k groundtruth results#1069magdalendobson wants to merge 13 commits into
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes recall computation when filtered/variable-length groundtruth contains fewer than k results by computing per-query recall against the available groundtruth and excluding zero-result queries from the average.
Changes:
- Update recall computation to handle per-row
k' <= kgroundtruth sizes and ignore empty groundtruth rows in the average. - Move “insufficient groundtruth for top-k” validation to groundtruth loading for fixed-size
.bingroundtruth. - Remove
minimum/maximumfrom reported recall metrics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| diskann-benchmark/src/utils/recall.rs | Removes min/max fields from the benchmark-facing recall metrics wrapper. |
| diskann-benchmark/src/utils/datafiles.rs | Adds optional k validation when loading groundtruth from .bin. |
| diskann-benchmark/src/backend/index/benchmarks.rs | Computes max recall_k across runs and passes it into groundtruth loading. |
| diskann-benchmark-core/src/recall.rs | Implements variable-length groundtruth recall + removes min/max metrics and updates tests accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Some(expected_k) = k { | ||
| if groundtruth.ncols() != expected_k { | ||
| return Err(anyhow::anyhow!( | ||
| "Each row of groundtruth must have length {} (got {})", | ||
| expected_k, | ||
| groundtruth.ncols() | ||
| )); | ||
| } | ||
| } |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Fixed in f0b954d. Changed the check from ncols() != expected_k to ncols() < expected_k and updated the error message to "at least N neighbors" so .bin groundtruth files with more columns than the requested recall_k are accepted. Also updated the related test that was incorrectly expecting an error for dynamic groundtruth with fewer entries than recall_k — that's now valid behavior per the PR (filtered queries can have fewer results). Added test cases with non-zero values verifying correct recall values (full match → 1.0, partial match → 0.6) in 3de6550.
| /// The average recall across all queries. | ||
| pub average: f64, |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Updated in fd8f772. The doc comment for the average field now clarifies that it's "The average recall across queries with non-empty groundtruth" and that "Queries with zero groundtruth results are excluded from the average."
| // Compute the average recall | ||
| let total: f64 = recall_values.iter().sum(); | ||
| let average = (total) / (num_nonzero as f64); |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Added in fd8f772. Two new test cases:
- Mixed zero/non-zero groundtruth rows: 5 queries with groundtruth [1,2,3,4,5] (all match → recall=1.0 each) and 5 queries with empty groundtruth (excluded). Verifies average = 1.0 (denominator uses only 5 non-zero rows).
- All-zero groundtruth: All 10 queries have empty groundtruth. Verifies average = 0.0 (not NaN/inf).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…more neighbors than needed Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…oundtruth tests Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…groundtruth mix Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1069 +/- ##
=======================================
Coverage 89.51% 89.51%
=======================================
Files 459 459
Lines 85646 85707 +61
=======================================
+ Hits 76663 76723 +60
- Misses 8983 8984 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Groundtruth for filtered datasets uses the
.rangeresformat, since a filtered query may have few or no results. However, at the time of recall computation for top-k search, fewer thankresults causes an error. This PR patches that error and computes recall for each point using the following paradigm:0 < k' <= kgroundtruth results, recall is the number of correctly reported points divided byk'.The average recall is then the sum of all recalls divided by the number of non-zero results. The points with zero results do not affect recall. The reasoning for this is that in a dataset with majority-null results for filtered queries, we want recall to capture our performance on the queries that actually have results, rather than being artificially high when we always correctly report that these points have no results.
Since this error will no longer trigger even for a regular top-k search in recall computation, I have pushed the error for not enough groundtruth results to when the groundtruth in the regular bin format is actually read. This is an improvement since a failure will now trigger before an index build and search, making a better experience for the user. I leave some of the instances of
NotEnoughGroundTruthinside recall computation as an extra layer of failsafe, but they could also be removed if reviewers feel we should completely remove them.Along the way I get rid of the
minimumandmaximumvalues in the recall stats, since after some discussion with the team it appears that no one uses them.