Add counterfactual dataset#615
Draft
Jiawen-CS wants to merge 48 commits into
Draft
Conversation
added 7 commits
April 14, 2026 22:08
- CounterfactualEntry model: resolves base fields from bcbench.jsonl at load time - Register COUNTERFACTUAL_EVALUATION in EvaluationCategory enum - Reuse BugFixPipeline, BugFixResult, ExecutionBasedEvaluationResultSummary - Add counterfactual-evaluation prompt template (same as bug-fix) - Add leaderboard placeholder (docs/_data/counterfactual-evaluation.json) - Add COUNTERFACTUAL.md documentation - Add 8 tests for CF entry loading, schema, and category properties
Structure (restructured from old thesis modules): - src/bcbench/analysis/family.py: FamilyOutcome, FamilyType, InstanceResult - src/bcbench/analysis/aggregator.py: build_families() (was family_aggregator.py) - src/bcbench/analysis/metrics.py: fragility_rate, severity, layer distribution (was evaluator/thesis_metrics.py) - src/bcbench/analysis/annotation.py: failure sampling + CSV export (was sample_failures.py) - src/bcbench/types.py: FailureLayer enum (L1-L5) - evaluator/counterfactual_scores.py: Braintrust scorers (was thesis_scores.py) - notebooks updated to use proper imports instead of inline stubs - Removed stale notebooks/thesis/ folder - 20 new tests (444 total)
…sis/counterfactual-dataset
Replace single COUNTERFACTUAL_EVALUATION category with CF_1, CF_2, CF_3, CF_4 to enable batched GitHub Actions runs per variant number. - Add is_counterfactual, cf_variant, prompt_template_key properties - Filter entries by __cf-N suffix in dataset list command - Share counterfactual-template across all CF categories - Create per-category leaderboard files (cf-1.json..cf-4.json) - Update copilot-instructions.md and COUNTERFACTUAL.md - Update tests for new category structure
haoranpb
reviewed
Apr 16, 2026
…sis/counterfactual-dataset
…crosoft/BC-Bench into thesis/counterfactual-dataset
Fix 11 CF dataset entries with compile errors: - NAV-226875__cf-1: fix GetVendorName() proc nested inside UpdateBalance() - NAV-226223__cf-1: move if-exit from var section to begin block - NAV-220452__cf-1: fix missing end for FindFirst/begin block - NAV-215972__cf-1: fix CreateTransferHeader -> use LibraryInventory - NAV-213629__cf-1: fix Item ref -> use Resource.FindFirst() - NAV-215645__cf-1: fix LibraryInventory ref -> use ContractTestLibrary - NAV-224447__cf-1: fix test proc inserted inside another proc body - NAV-227219, NAV-188438, NAV-216057, NAV-222488: verified already fixed All patches verified to apply cleanly at their base commits. All CF entries confirmed to differ semantically from their base entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- NAV-174794: Move posting group override from InitGenJnlLine to OnRun call site, using the already-loaded Customer local variable. This avoids potential issues with a local Customer.Get() inside InitGenJnlLine while keeping the conditional Allow Multiple check. - NAV-205825: Fix test to use non-empty Ship-to Country scenario where alt VAT reg SHOULD be applied. The old test used empty ship-to country which produces the same result with or without the code patch. - NAV-201169: Add SkippedRecord logging in the codeunit OnRun instead of just exiting silently when Item.Description is empty. The test expects a skipped record entry which the old code patch never created. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 4 counterfactual entries whose CF patches were semantically identical to base patches (comment/whitespace-only), causing tests to pass when expected to fail (TEST_3b). Entries fixed: - NAV-203923__cf-1: Wrap exit(0) with Full VAT type guard at 3 locations (nested if instead of adding to in-list) - NAV-174794__cf-1: Conditional posting group assignment from memo header (if <> '' guard instead of unconditional assignment) - NAV-209737__cf-1: Guard on Line Discount Amount instead of Line Discount % (different field, functionally equivalent) - NAV-224009__cf-1: Use exclusion filter (<>Surplus) instead of inclusion filter (Reservation|Tracking); test uses 200+240 lot split Each entry also required minimal test_patch assertion changes so the test detects the bug at base_commit (required for step 2 of validation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sis/counterfactual-dataset
Recompute hunk header line counts via fix_patch_hunks() for entries whose patch/test_patch fields had stale counts after re-serialization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TEST_3a fixes (tests expected to pass but failed): - NAV-205825: Change guard from Ship-to Country/Region Code to Ship-to Code - NAV-207236: Add missing SalesBatchPostMgt changes (keep no-recalculate variant) - NAV-208649: Remove QtyRoundPrecision override to preserve fractional values - NAV-208320: Add same-customer guard in OnLookup triggers - NAV-216572: Change Percent > 0 to >= 0 to not block zero allocations - NAV-208851: Nest Status=Active check inside Source Type check - NAV-218323: Make bypass conditional on preview mode (not PostSettlement) - NAV-222092: Use >= 0 guard on Dimension Set ID (effectively non-blocking) - NAV-218995: Add Job Line Type <> blank condition for exit bypass - NAV-226448: Add < 100 condition for partial non-deductible VAT only - NAV-209737: Change Line Discount Amount to Inv. Discount Amount TEST_3b fixes (tests expected to fail but passed - designed semantic deviations): - NAV-207878: Remove AvailableInventory fallback in AbleToAssemble calculation - NAV-223493: Use External Document No. instead of Your Reference (wrong field) - NAV-221877: Remove empty-check guard, validate Repair Status Code unconditionally Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- NAV-208320: Replace non-existent 'Bill-to Contact E-Mail' field with Contact record email preservation using OldContact/NewContact variables - NAV-207247: Replace ::Released enum (doesn't exist at base_commit) with ::"Firm Planned" in test_patch helpers - NAV-214557: Rebuild test_patch from base to fix syntax error (procedures were inserted inside existing procedure bodies) - NAV-217974: Replace no-op integration event with direct ToSalesHeader.UpdateShipToAddress() call - NAV-216572: Rebuild code patch with extended hunk and inline logic to avoid cross-procedure hunk application failure - NAV-217797: Replace early exit with CurrentJnlBatchName assignment from existing filter to maintain FilterGroup setup - NAV-221877: Reorder Modify(true) before Validate to ensure service item line is committed before status update triggers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes for counterfactual entries failing validation: COMPILE fixes (code patch): - NAV-188438: Replace non-existent field 'Emission Factor CO2' with valid 'CO2' boolean - NAV-216057: Replace undefined RequisitionLine with OldReservEntry.Planning Flexibility - NAV-222488: Replace undefined ConfirmFullRegistration() with Confirm() builtin - NAV-226223: Declare IsTriggeredByVATLoop as local Boolean; fix var/begin structure - NAV-226875: Fix GetVendorName procedure placement; add CurrFieldNo condition - NAV-227219: Qualify Quantity as TransLine.Quantity - NAV-220452: Fix begin/end mismatch in FindFirst block - NAV-176426: Add Item: Record Item var declaration to UpdateNodeCosts procedure - NAV-178045: Change ItemLedgEntry.Count to ItemLedgEntry.Count() (style match) COMPILE fixes (test_patch): - NAV-215972: Replace undefined CreateTransferHeader with proper library calls - NAV-215645: Replace unavailable LibraryInventory with ContractTestLibrary - NAV-224447: Remove duplicate procedure definitions in test patch TEST_3b redesigns (CF was semantically identical to base): - NAV-174794: Use customer primary posting group instead of alternate - NAV-180484: Change condition to preserve Created status when linked - NAV-185792: Invert Job Quantity gate to trigger on zero quantity only Structural fixes: - NAV-176082: Remove unused FindPurchRcptHeader/GetPurchRcptHeader procedures - NAV-174087: Reorder Modify before CopyFromSegment for persistence semantics - NAV-175765: Remove UndoTransferShipment wiring, keep procedure definition - NAV-176194: Fix context lines for both ItemChargeAssignment table files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Includes 60 cf-1 entries with prior fixes and 6 cf-2 PATCH entries auto-fixed by fix_patch_hunks() (run 24894481285). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix compile errors in 14 cf-2 entries from GitHub Actions run 24894481285. Patch fixes (7 entries): - NAV-208320: AL0124 CurrPage in wrong context - regenerate OnLookup trigger placement - NAV-218323: AL0175 Text/Integer mismatch - wrap Report enum in Format() - NAV-216057: AL0132 IsTrackingRequired missing - use GetWhseItemTrkgSetup instead - NAV-213524: AL0132 Status missing on IC Outbox Transaction - use Line Action field - NAV-218856: AL0104 code between var/begin - move executable code after begin - NAV-220984: AL0118 DeleteExchangedComponent wrong name - fix to DeleteExcComp - NAV-226223: AL0104 code between var/begin - move exit check after VATEntryNoTotal Patch + var fix (1 entry): - NAV-213741: AL0118 ItemLedgerEntry not in scope - add local var + Get call Test_patch fixes (6 entries): - NAV-226448: test helpers inserted inside existing procedure - move to end of codeunit - NAV-223493: new test inserted inside existing procedure - fix insertion point - NAV-224447: new test inserted inside existing procedure - fix insertion point - NAV-220452: new test inserted inside existing procedure - fix insertion point - NAV-226004: test code scattered in wrong procedures - regenerate with base structure - BCApps-4822: missing end; on new test procedure - regenerate from base Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix CF-2 patches that caused PASS_TO_PASS test failures (TEST_3a) from GitHub Actions run 24894481285. Entries fixed: - NAV-181900: Swap approval entry lookup priority (general first, user fallback) - NAV-207236: Add SalesBatchPostMgt ManualReopen infrastructure for batch posting - NAV-182354: Include OnInit date filter change (WorkDate->Today) for cue value - NAV-207878: Two-level positive inventory check with proportional fallback - NAV-205825: Add Validate before direct field assignments for VAT propagation - NAV-218786: Move Action Type/Source Document checks after FindFirst() - NAV-213683: Replace inverted RunTrigger guard with Item No. equality check - NAV-217797: Replace Count=1 check with GetFilter check for template name - NAV-222488: Use activity line Source Document instead of header - NAV-227358: Gate only field clear (not Modify) on positive Amount check - NAV-218253: Narrow ShipmentExists exit to Invoice document type only - NAV-218062: Use else-if structure instead of or for availability check - NAV-224009: Use <=Tracking filter instead of %1|%2 enumeration filter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… to strings fix_patch_hunks() returns (str, list) but the auto-fixer stored the tuple directly. Extract [0] for each affected field. Entries: NAV-215645, NAV-218995, NAV-219082, NAV-220314, NAV-223819, NAV-227240 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 3 categories of patch issues in counterfactual.jsonl:
Category 1 - Hunk count mismatches (7 entries):
- NAV-185696__cf-1: test_patch hunk 3 header count
- NAV-206135__cf-1: test_patch hunk 1 header count
- NAV-176426__cf-1: patch hunk 2 header count
- NAV-177493__cf-1: test_patch hunk 2 header count
- NAV-179733__cf-1: test_patch hunk 1 header count
- NAV-185488__cf-1: test_patch hunk 1 header count
- NAV-182354__cf-1: test_patch hunk 3 header count
Category 2 - Patch doesn't apply (2 entries):
- NAV-181900__cf-1: test_patch hunk 2 had wrong context line
(PendingApprovalLbl instead of ImposedRestrictionLbl)
- NAV-191624__cf-1: patch hunk 1 had blank lines instead of
correct closing brace context
Category 3 - Syntax error (1 entry):
- NAV-214557__cf-3: test_patch hunks inserted code inside
existing procedures instead of after them; rebuilt hunks
using CF-1 structure as template
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sis/counterfactual-dataset
Remove 82 counterfactual entries that failed dataset validation across three CI runs (cf-1, cf-2, cf-3): - CF-1 run #24959405870: 47 failures removed - CF-2 run #24959409994: 31 failures removed - CF-3 run #24959414223: 4 failures removed Failure categories: - Corrupt/malformed patches (cannot apply) - Compile errors (patch applies but code doesn't build) - Tests should pass but failed (gold patch doesn't fix tests) - Tests should fail but passed (CF not diagnostic) Renumbered 42 remaining CF variants to be sequential (cf-1, cf-2, cf-3) per base instance, updating both counterfactual.jsonl and problem statement directories. Result: 204 -> 122 CF entries across 71 base instances. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sis/counterfactual-dataset
…sis/counterfactual-dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This project investigates transfer failures in large language models (LLMs) when generating code for niche programming languages, using AL as a case study.
We design BC-Bench-CF, a benchmark suite that includes realistic AL development tasks and minimal counterfactual variants. The goal is to evaluate not only functional correctness, but also robustness to small specification changes and sensitivity to AL-specific execution semantics.
Our analysis is grounded in a layered failure framework, which attributes model errors to different abstraction levels, including syntax, validation semantics, event-driven paradigms, workflow composition, and ecosystem constraints.