Add crashes_after_fix rule to flag fixed crash bugs still crashing on Nightly#2881
Add crashes_after_fix rule to flag fixed crash bugs still crashing on Nightly#2881spohlMozilla wants to merge 4 commits into
Conversation
… Nightly The macOS and Windows Spotlight teams have repeatedly hit the same workflow gap: a crash gets a speculative fix, the patch lands, the bug is marked RESOLVED FIXED -- and the signature keeps firing on Nightly after the build containing the fix has shipped. With nothing prompting us to re-check crash-stats a few days post-landing, this verification step gets skipped, and we have ended up discovering only much later (in some cases weeks or months) that the speculative fix didn't actually move the crash numbers. This rule plugs that gap. Once a day it picks RESOLVED FIXED bugs where cf_status_firefox_nightly is "fixed" and cf_last_resolved falls between min_days_since_fix (default 4) and max_days_since_fix (default 10) ago, runs a faceted Socorro SuperSearch over Nightly for the bug's signature(s) starting the day after the fix landed, and -- if min_crash_count (default 5) or more crashes have been recorded in that window -- needinfos the assignee asking whether the fix was incomplete, whether the signature is shared with a different underlying crash, or whether a follow-up is needed. The four-day floor gives the Nightly build containing the fix time to roll out and accumulate user exposure before the bot will fire. The rule skips bugs that already have any open needinfo, and also skips bugs whose comment history contains the rule's marker phrase, so it only pings the assignee once per fix.
|
@suhaibmujahid would you mind taking a look when you have a moment? I couldn't add you as a formal reviewer (external-contributor permissions). Thanks! |
|
Given we already have |
Per marco-c's review feedback: min_crash_count plus the "date >= fix_date + 1 day" Socorro filter already gate pings, so the 4-day floor was redundant. Removing it means the rule fires as soon as the threshold is crossed -- fast-burning regressions get caught earlier, slow-burning ones are still gated by min_crash_count. max_days_since_fix is kept as the upper bound on how long we keep polling a bug whose crash count is still below the threshold.
suhaibmujahid
left a comment
There was a problem hiding this comment.
Did you do a dry-run? Is the results matching what you are expecting? Can you please share examples from dry-run?
| # Has a non-empty crash signature. | ||
| "f1": "cf_crash_signature", | ||
| "o1": "isnotempty", | ||
| # The fix is in Nightly. | ||
| "f2": "cf_status_firefox_nightly", | ||
| "o2": "equals", | ||
| "v2": "fixed", | ||
| # cf_last_resolved > today - max_days (recent enough). | ||
| "f3": "cf_last_resolved", | ||
| "o3": "greaterthan", | ||
| "v3": oldest_fix, | ||
| # Skip bugs that already have an open needinfo so we don't pile on. | ||
| "f4": "flagtypes.name", | ||
| "o4": "notsubstring", | ||
| "v4": "needinfo?", | ||
| # Skip bugs where we've already left a needinfo comment for this | ||
| # rule (idempotency across daily runs). |
There was a problem hiding this comment.
The query is clear, I would suggest dropping the comments.
|
|
||
| params = { | ||
| "product": "Firefox", | ||
| "release_channel": "nightly", |
There was a problem hiding this comment.
Older version of Firefox that is still crashing would be included here. So that does not mean the bug is not fixed. We need to only consider versions they were built after the fix was released.
|
The new rule hasn't been added to any of the cron schedules yet; I'd add it to |
|
Thanks for the review! For point 2 (cron schedule): pushed 2879dfd adding For point 1 (dry-run): I'll set that up locally and post the output here next. |
Per suhaibmujahid's inline review: - Dropped the inline comments inside get_bz_params (the query is self-explanatory). - _query_socorro now filters by build_id >= midnight of (fix_date + 1 day) instead of by crash date. Filtering by crash date would also count crashes from Nightly users still running pre-fix builds, which don't say anything about whether the fix worked. Nightly build IDs are timestamps in YYYYMMDDHHMMSS format so any build_id at or above the cutoff is from a Nightly built after the fix landed.
|
Addressed the inline review comments in 350500d:
|
|
Dry-run on 2026-05-20:
Behavior verified to match expectations: candidate set looks like recently-fixed crash bugs from the right components; per-bug queries fire and respect the build-id cutoff; threshold gating prevents noise. |
The macOS and Windows Spotlight teams have repeatedly hit the same
workflow gap: a crash gets a speculative fix, the patch lands, the bug
is marked RESOLVED FIXED -- and the signature keeps firing on Nightly
after the build containing the fix has shipped. With nothing prompting
us to re-check crash-stats a few days post-landing, this verification
step gets skipped, and we have ended up discovering only much later
(in some cases weeks or months) that the speculative fix didn't
actually move the crash numbers.
This rule plugs that gap. Once a day it picks RESOLVED FIXED bugs where
cf_status_firefox_nightly is "fixed" and cf_last_resolved falls between
min_days_since_fix (default 4) and max_days_since_fix (default 10) ago,
runs a faceted Socorro SuperSearch over Nightly for the bug's
signature(s) starting the day after the fix landed, and -- if
min_crash_count (default 5) or more crashes have been recorded in that
window -- needinfos the assignee asking whether the fix was incomplete,
whether the signature is shared with a different underlying crash, or
whether a follow-up is needed.
The four-day floor gives the Nightly build containing the fix time to
roll out and accumulate user exposure before the bot will fire. The
rule skips bugs that already have any open needinfo, and also skips
bugs whose comment history contains the rule's marker phrase, so it
only pings the assignee once per fix.