Skip to content

Add counterfactual dataset#615

Draft
Jiawen-CS wants to merge 48 commits into
mainfrom
thesis/counterfactual-dataset
Draft

Add counterfactual dataset#615
Jiawen-CS wants to merge 48 commits into
mainfrom
thesis/counterfactual-dataset

Conversation

@Jiawen-CS
Copy link
Copy Markdown
Collaborator

@Jiawen-CS Jiawen-CS commented Apr 16, 2026

This project investigates transfer failures in large language models (LLMs) when generating code for niche programming languages, using AL as a case study.

We design BC-Bench-CF, a benchmark suite that includes realistic AL development tasks and minimal counterfactual variants. The goal is to evaluate not only functional correctness, but also robustness to small specification changes and sensitivity to AL-specific execution semantics.

Our analysis is grounded in a layered failure framework, which attributes model errors to different abstraction levels, including syntax, validation semantics, event-driven paradigms, workflow composition, and ecosystem constraints.

Jiawen Sun added 7 commits April 14, 2026 22:08
- CounterfactualEntry model: resolves base fields from bcbench.jsonl at load time
- Register COUNTERFACTUAL_EVALUATION in EvaluationCategory enum
- Reuse BugFixPipeline, BugFixResult, ExecutionBasedEvaluationResultSummary
- Add counterfactual-evaluation prompt template (same as bug-fix)
- Add leaderboard placeholder (docs/_data/counterfactual-evaluation.json)
- Add COUNTERFACTUAL.md documentation
- Add 8 tests for CF entry loading, schema, and category properties
Structure (restructured from old thesis modules):
- src/bcbench/analysis/family.py: FamilyOutcome, FamilyType, InstanceResult
- src/bcbench/analysis/aggregator.py: build_families() (was family_aggregator.py)
- src/bcbench/analysis/metrics.py: fragility_rate, severity, layer distribution (was evaluator/thesis_metrics.py)
- src/bcbench/analysis/annotation.py: failure sampling + CSV export (was sample_failures.py)
- src/bcbench/types.py: FailureLayer enum (L1-L5)
- evaluator/counterfactual_scores.py: Braintrust scorers (was thesis_scores.py)
- notebooks updated to use proper imports instead of inline stubs
- Removed stale notebooks/thesis/ folder
- 20 new tests (444 total)
Replace single COUNTERFACTUAL_EVALUATION category with CF_1, CF_2, CF_3, CF_4
to enable batched GitHub Actions runs per variant number.

- Add is_counterfactual, cf_variant, prompt_template_key properties
- Filter entries by __cf-N suffix in dataset list command
- Share counterfactual-template across all CF categories
- Create per-category leaderboard files (cf-1.json..cf-4.json)
- Update copilot-instructions.md and COUNTERFACTUAL.md
- Update tests for new category structure
Comment thread evaluator/counterfactual_scores.py
Jiawen Sun and others added 22 commits April 16, 2026 15:28
Fix 11 CF dataset entries with compile errors:
- NAV-226875__cf-1: fix GetVendorName() proc nested inside UpdateBalance()
- NAV-226223__cf-1: move if-exit from var section to begin block
- NAV-220452__cf-1: fix missing end for FindFirst/begin block
- NAV-215972__cf-1: fix CreateTransferHeader -> use LibraryInventory
- NAV-213629__cf-1: fix Item ref -> use Resource.FindFirst()
- NAV-215645__cf-1: fix LibraryInventory ref -> use ContractTestLibrary
- NAV-224447__cf-1: fix test proc inserted inside another proc body
- NAV-227219, NAV-188438, NAV-216057, NAV-222488: verified already fixed

All patches verified to apply cleanly at their base commits.
All CF entries confirmed to differ semantically from their base entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- NAV-174794: Move posting group override from InitGenJnlLine to OnRun
  call site, using the already-loaded Customer local variable. This
  avoids potential issues with a local Customer.Get() inside
  InitGenJnlLine while keeping the conditional Allow Multiple check.

- NAV-205825: Fix test to use non-empty Ship-to Country scenario where
  alt VAT reg SHOULD be applied. The old test used empty ship-to
  country which produces the same result with or without the code patch.

- NAV-201169: Add SkippedRecord logging in the codeunit OnRun instead
  of just exiting silently when Item.Description is empty. The test
  expects a skipped record entry which the old code patch never created.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 4 counterfactual entries whose CF patches were semantically identical
to base patches (comment/whitespace-only), causing tests to pass when
expected to fail (TEST_3b).

Entries fixed:
- NAV-203923__cf-1: Wrap exit(0) with Full VAT type guard at 3 locations
  (nested if instead of adding to in-list)
- NAV-174794__cf-1: Conditional posting group assignment from memo header
  (if <> '' guard instead of unconditional assignment)
- NAV-209737__cf-1: Guard on Line Discount Amount instead of Line Discount %
  (different field, functionally equivalent)
- NAV-224009__cf-1: Use exclusion filter (<>Surplus) instead of inclusion
  filter (Reservation|Tracking); test uses 200+240 lot split

Each entry also required minimal test_patch assertion changes so the test
detects the bug at base_commit (required for step 2 of validation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Jiawen Sun and others added 19 commits April 24, 2026 16:21
Recompute hunk header line counts via fix_patch_hunks() for entries
whose patch/test_patch fields had stale counts after re-serialization.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TEST_3a fixes (tests expected to pass but failed):
- NAV-205825: Change guard from Ship-to Country/Region Code to Ship-to Code
- NAV-207236: Add missing SalesBatchPostMgt changes (keep no-recalculate variant)
- NAV-208649: Remove QtyRoundPrecision override to preserve fractional values
- NAV-208320: Add same-customer guard in OnLookup triggers
- NAV-216572: Change Percent > 0 to >= 0 to not block zero allocations
- NAV-208851: Nest Status=Active check inside Source Type check
- NAV-218323: Make bypass conditional on preview mode (not PostSettlement)
- NAV-222092: Use >= 0 guard on Dimension Set ID (effectively non-blocking)
- NAV-218995: Add Job Line Type <> blank condition for exit bypass
- NAV-226448: Add < 100 condition for partial non-deductible VAT only
- NAV-209737: Change Line Discount Amount to Inv. Discount Amount

TEST_3b fixes (tests expected to fail but passed - designed semantic deviations):
- NAV-207878: Remove AvailableInventory fallback in AbleToAssemble calculation
- NAV-223493: Use External Document No. instead of Your Reference (wrong field)
- NAV-221877: Remove empty-check guard, validate Repair Status Code unconditionally

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- NAV-208320: Replace non-existent 'Bill-to Contact E-Mail' field with
  Contact record email preservation using OldContact/NewContact variables
- NAV-207247: Replace ::Released enum (doesn't exist at base_commit) with
  ::"Firm Planned" in test_patch helpers
- NAV-214557: Rebuild test_patch from base to fix syntax error (procedures
  were inserted inside existing procedure bodies)
- NAV-217974: Replace no-op integration event with direct
  ToSalesHeader.UpdateShipToAddress() call
- NAV-216572: Rebuild code patch with extended hunk and inline logic to
  avoid cross-procedure hunk application failure
- NAV-217797: Replace early exit with CurrentJnlBatchName assignment from
  existing filter to maintain FilterGroup setup
- NAV-221877: Reorder Modify(true) before Validate to ensure service item
  line is committed before status update triggers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes for counterfactual entries failing validation:

COMPILE fixes (code patch):
- NAV-188438: Replace non-existent field 'Emission Factor CO2' with valid 'CO2' boolean
- NAV-216057: Replace undefined RequisitionLine with OldReservEntry.Planning Flexibility
- NAV-222488: Replace undefined ConfirmFullRegistration() with Confirm() builtin
- NAV-226223: Declare IsTriggeredByVATLoop as local Boolean; fix var/begin structure
- NAV-226875: Fix GetVendorName procedure placement; add CurrFieldNo condition
- NAV-227219: Qualify Quantity as TransLine.Quantity
- NAV-220452: Fix begin/end mismatch in FindFirst block
- NAV-176426: Add Item: Record Item var declaration to UpdateNodeCosts procedure
- NAV-178045: Change ItemLedgEntry.Count to ItemLedgEntry.Count() (style match)

COMPILE fixes (test_patch):
- NAV-215972: Replace undefined CreateTransferHeader with proper library calls
- NAV-215645: Replace unavailable LibraryInventory with ContractTestLibrary
- NAV-224447: Remove duplicate procedure definitions in test patch

TEST_3b redesigns (CF was semantically identical to base):
- NAV-174794: Use customer primary posting group instead of alternate
- NAV-180484: Change condition to preserve Created status when linked
- NAV-185792: Invert Job Quantity gate to trigger on zero quantity only

Structural fixes:
- NAV-176082: Remove unused FindPurchRcptHeader/GetPurchRcptHeader procedures
- NAV-174087: Reorder Modify before CopyFromSegment for persistence semantics
- NAV-175765: Remove UndoTransferShipment wiring, keep procedure definition
- NAV-176194: Fix context lines for both ItemChargeAssignment table files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Includes 60 cf-1 entries with prior fixes and 6 cf-2 PATCH entries
auto-fixed by fix_patch_hunks() (run 24894481285).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix compile errors in 14 cf-2 entries from GitHub Actions run 24894481285.

Patch fixes (7 entries):
- NAV-208320: AL0124 CurrPage in wrong context - regenerate OnLookup trigger placement
- NAV-218323: AL0175 Text/Integer mismatch - wrap Report enum in Format()
- NAV-216057: AL0132 IsTrackingRequired missing - use GetWhseItemTrkgSetup instead
- NAV-213524: AL0132 Status missing on IC Outbox Transaction - use Line Action field
- NAV-218856: AL0104 code between var/begin - move executable code after begin
- NAV-220984: AL0118 DeleteExchangedComponent wrong name - fix to DeleteExcComp
- NAV-226223: AL0104 code between var/begin - move exit check after VATEntryNoTotal

Patch + var fix (1 entry):
- NAV-213741: AL0118 ItemLedgerEntry not in scope - add local var + Get call

Test_patch fixes (6 entries):
- NAV-226448: test helpers inserted inside existing procedure - move to end of codeunit
- NAV-223493: new test inserted inside existing procedure - fix insertion point
- NAV-224447: new test inserted inside existing procedure - fix insertion point
- NAV-220452: new test inserted inside existing procedure - fix insertion point
- NAV-226004: test code scattered in wrong procedures - regenerate with base structure
- BCApps-4822: missing end; on new test procedure - regenerate from base

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix CF-2 patches that caused PASS_TO_PASS test failures (TEST_3a) from
GitHub Actions run 24894481285.

Entries fixed:
- NAV-181900: Swap approval entry lookup priority (general first, user fallback)
- NAV-207236: Add SalesBatchPostMgt ManualReopen infrastructure for batch posting
- NAV-182354: Include OnInit date filter change (WorkDate->Today) for cue value
- NAV-207878: Two-level positive inventory check with proportional fallback
- NAV-205825: Add Validate before direct field assignments for VAT propagation
- NAV-218786: Move Action Type/Source Document checks after FindFirst()
- NAV-213683: Replace inverted RunTrigger guard with Item No. equality check
- NAV-217797: Replace Count=1 check with GetFilter check for template name
- NAV-222488: Use activity line Source Document instead of header
- NAV-227358: Gate only field clear (not Modify) on positive Amount check
- NAV-218253: Narrow ShipmentExists exit to Invoice document type only
- NAV-218062: Use else-if structure instead of or for availability check
- NAV-224009: Use <=Tracking filter instead of %1|%2 enumeration filter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… to strings

fix_patch_hunks() returns (str, list) but the auto-fixer stored the
tuple directly. Extract [0] for each affected field.

Entries: NAV-215645, NAV-218995, NAV-219082, NAV-220314, NAV-223819, NAV-227240

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 3 categories of patch issues in counterfactual.jsonl:

Category 1 - Hunk count mismatches (7 entries):
  - NAV-185696__cf-1: test_patch hunk 3 header count
  - NAV-206135__cf-1: test_patch hunk 1 header count
  - NAV-176426__cf-1: patch hunk 2 header count
  - NAV-177493__cf-1: test_patch hunk 2 header count
  - NAV-179733__cf-1: test_patch hunk 1 header count
  - NAV-185488__cf-1: test_patch hunk 1 header count
  - NAV-182354__cf-1: test_patch hunk 3 header count

Category 2 - Patch doesn't apply (2 entries):
  - NAV-181900__cf-1: test_patch hunk 2 had wrong context line
    (PendingApprovalLbl instead of ImposedRestrictionLbl)
  - NAV-191624__cf-1: patch hunk 1 had blank lines instead of
    correct closing brace context

Category 3 - Syntax error (1 entry):
  - NAV-214557__cf-3: test_patch hunks inserted code inside
    existing procedures instead of after them; rebuilt hunks
    using CF-1 structure as template

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove 82 counterfactual entries that failed dataset validation across
three CI runs (cf-1, cf-2, cf-3):

- CF-1 run #24959405870: 47 failures removed
- CF-2 run #24959409994: 31 failures removed
- CF-3 run #24959414223: 4 failures removed

Failure categories:
- Corrupt/malformed patches (cannot apply)
- Compile errors (patch applies but code doesn't build)
- Tests should pass but failed (gold patch doesn't fix tests)
- Tests should fail but passed (CF not diagnostic)

Renumbered 42 remaining CF variants to be sequential (cf-1, cf-2, cf-3)
per base instance, updating both counterfactual.jsonl and problem
statement directories.

Result: 204 -> 122 CF entries across 71 base instances.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants