Skip to content

Add crashes_after_fix rule to flag fixed crash bugs still crashing on Nightly#2881

Open
spohlMozilla wants to merge 4 commits into
mozilla:masterfrom
spohlMozilla:crashes-after-fix
Open

Add crashes_after_fix rule to flag fixed crash bugs still crashing on Nightly#2881
spohlMozilla wants to merge 4 commits into
mozilla:masterfrom
spohlMozilla:crashes-after-fix

Conversation

@spohlMozilla
Copy link
Copy Markdown

The macOS and Windows Spotlight teams have repeatedly hit the same
workflow gap: a crash gets a speculative fix, the patch lands, the bug
is marked RESOLVED FIXED -- and the signature keeps firing on Nightly
after the build containing the fix has shipped. With nothing prompting
us to re-check crash-stats a few days post-landing, this verification
step gets skipped, and we have ended up discovering only much later
(in some cases weeks or months) that the speculative fix didn't
actually move the crash numbers.

This rule plugs that gap. Once a day it picks RESOLVED FIXED bugs where
cf_status_firefox_nightly is "fixed" and cf_last_resolved falls between
min_days_since_fix (default 4) and max_days_since_fix (default 10) ago,
runs a faceted Socorro SuperSearch over Nightly for the bug's
signature(s) starting the day after the fix landed, and -- if
min_crash_count (default 5) or more crashes have been recorded in that
window -- needinfos the assignee asking whether the fix was incomplete,
whether the signature is shared with a different underlying crash, or
whether a follow-up is needed.

The four-day floor gives the Nightly build containing the fix time to
roll out and accumulate user exposure before the bot will fire. The
rule skips bugs that already have any open needinfo, and also skips
bugs whose comment history contains the rule's marker phrase, so it
only pings the assignee once per fix.

… Nightly

The macOS and Windows Spotlight teams have repeatedly hit the same
workflow gap: a crash gets a speculative fix, the patch lands, the bug
is marked RESOLVED FIXED -- and the signature keeps firing on Nightly
after the build containing the fix has shipped. With nothing prompting
us to re-check crash-stats a few days post-landing, this verification
step gets skipped, and we have ended up discovering only much later
(in some cases weeks or months) that the speculative fix didn't
actually move the crash numbers.

This rule plugs that gap. Once a day it picks RESOLVED FIXED bugs where
cf_status_firefox_nightly is "fixed" and cf_last_resolved falls between
min_days_since_fix (default 4) and max_days_since_fix (default 10) ago,
runs a faceted Socorro SuperSearch over Nightly for the bug's
signature(s) starting the day after the fix landed, and -- if
min_crash_count (default 5) or more crashes have been recorded in that
window -- needinfos the assignee asking whether the fix was incomplete,
whether the signature is shared with a different underlying crash, or
whether a follow-up is needed.

The four-day floor gives the Nightly build containing the fix time to
roll out and accumulate user exposure before the bot will fire. The
rule skips bugs that already have any open needinfo, and also skips
bugs whose comment history contains the rule's marker phrase, so it
only pings the assignee once per fix.
@spohlMozilla
Copy link
Copy Markdown
Author

@suhaibmujahid would you mind taking a look when you have a moment? I couldn't add you as a formal reviewer (external-contributor permissions). Thanks!

@marco-c marco-c requested a review from suhaibmujahid May 18, 2026 08:52
@marco-c
Copy link
Copy Markdown
Contributor

marco-c commented May 18, 2026

Given we already have min_crash_count and we can look specifically at Nightly builds after the fix, we don't really need the 4 days delay.

Per marco-c's review feedback: min_crash_count plus the
"date >= fix_date + 1 day" Socorro filter already gate pings, so the
4-day floor was redundant. Removing it means the rule fires as soon as
the threshold is crossed -- fast-burning regressions get caught earlier,
slow-burning ones are still gated by min_crash_count. max_days_since_fix
is kept as the upper bound on how long we keep polling a bug whose
crash count is still below the threshold.
Copy link
Copy Markdown
Member

@suhaibmujahid suhaibmujahid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do a dry-run? Is the results matching what you are expecting? Can you please share examples from dry-run?

Comment thread bugbot/rules/crashes_after_fix.py Outdated
Comment on lines +80 to +96
# Has a non-empty crash signature.
"f1": "cf_crash_signature",
"o1": "isnotempty",
# The fix is in Nightly.
"f2": "cf_status_firefox_nightly",
"o2": "equals",
"v2": "fixed",
# cf_last_resolved > today - max_days (recent enough).
"f3": "cf_last_resolved",
"o3": "greaterthan",
"v3": oldest_fix,
# Skip bugs that already have an open needinfo so we don't pile on.
"f4": "flagtypes.name",
"o4": "notsubstring",
"v4": "needinfo?",
# Skip bugs where we've already left a needinfo comment for this
# rule (idempotency across daily runs).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is clear, I would suggest dropping the comments.


params = {
"product": "Firefox",
"release_channel": "nightly",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older version of Firefox that is still crashing would be included here. So that does not mean the bug is not fixed. We need to only consider versions they were built after the fix was released.

@suhaibmujahid
Copy link
Copy Markdown
Member

The new rule hasn't been added to any of the cron schedules yet; I'd add it to cron_run_weekdays.py.

@spohlMozilla
Copy link
Copy Markdown
Author

Thanks for the review!

For point 2 (cron schedule): pushed 2879dfd adding python -m bugbot.rules.crashes_after_fix --production to scripts/cron_run_weekdays.sh (Suhaib's comment said .py but it's actually .sh). Slotted it next to crash_small_volume since they're conceptually adjacent.

For point 1 (dry-run): I'll set that up locally and post the output here next.

Per suhaibmujahid's inline review:

- Dropped the inline comments inside get_bz_params (the query is
  self-explanatory).
- _query_socorro now filters by build_id >= midnight of (fix_date + 1
  day) instead of by crash date. Filtering by crash date would also
  count crashes from Nightly users still running pre-fix builds, which
  don't say anything about whether the fix worked. Nightly build IDs
  are timestamps in YYYYMMDDHHMMSS format so any build_id at or above
  the cutoff is from a Nightly built after the fix landed.
@spohlMozilla
Copy link
Copy Markdown
Author

Addressed the inline review comments in 350500d:

  1. get_bz_params: dropped the inline comments inside the params dict.
  2. _query_socorro: switched from date >= to build_id >= so we only count crashes on Nightly builds that actually include the fix. Build IDs use YYYYMMDDHHMMSS format, so the cutoff is midnight of the day after the bug was resolved. Good catch — pre-fix builds still in the wild were a real false-positive source.

@spohlMozilla
Copy link
Copy Markdown
Author

Dry-run on 2026-05-20:

  • Bugzilla query returned 3 candidate bugs (resolved in last 10 days, non-empty crash signature, fixed on Nightly, no open needinfo, no prior marker comment).
  • Socorro queried per bug for crashes on Nightly with build_id >= midnight(fix_date + 1). Sample signatures it queried:
    • NSC_OpenSession, sftk_SessionFromHandle
    • core::option::expect_failed | glean_core::core::with_glean
    • libsystem_notify.dylib, notify_register_check
  • None crossed the default min_crash_count = 5 threshold, so no needinfo would have fired. The rule correctly stays quiet.

Behavior verified to match expectations: candidate set looks like recently-fixed crash bugs from the right components; per-bug queries fire and respect the build-id cutoff; threshold gating prevents noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants