The metric that tells you whether your incident process is actually working or just generating documentation.
Repeat incident rate is the percentage of incidents that recur after being marked resolved. It measures whether the underlying cause of a failure was genuinely addressed or whether the team resolved the symptoms, closed the ticket, and moved on.
What Repeat Incident Rate Measures (And What It Doesn’t)
Repeat incident rate tracks how often a previously resolved incident recurs within a defined time window. The definition of “repeat” matters significantly here.
Most teams use one of two approaches:
- Incident type matching, where a new incident is flagged as a repeat if it shares the same root cause as a prior incident.
- Signature matching, where incidents are compared based on service, failure mode, and alert pattern.
Type matching produces more accurate data but requires consistent root cause classification in your incident records.
Signature matching is easier to automate but can miss genuine recurrences where the same underlying problem manifests differently, and can over-flag incidents that look similar but have distinct causes.
What repeat incident rate does not measure is the quality of the investigation that preceded the fix. A team that correctly identifies the root cause but cannot get the fix prioritized and resourced will still show a high repeat rate, even though the analysis was sound.
Conversely, a team that gets lucky with a one-off fix, or whose system changes in a way that prevents recurrence for unrelated reasons, will show a low repeat rate without having done meaningful reliability work. The metric reflects outcomes, not process quality.
Repeat incident rate also does not distinguish between incidents that recur quickly and those that recur after a long gap. An incident that comes back three days after resolution is a different problem than one that recurs six months later, but both count the same way in the metric. Teams that want this distinction track a companion metric: time to recurrence, which measures the average interval between an incident’s resolution and its next occurrence.
How Repeat Incident Rate Gets Calculated
The formula is simple:
Repeat Incident Rate = (Number of repeat incidents / Total number of incidents) × 100
A team that had 40 incidents in a quarter, of which 8 were identified as repeats of previously resolved incidents, has a repeat incident rate of 20%.
The choice of time window for “repeat” requires a decision. Most teams define a repeat as any recurrence within 30 to 90 days of the original resolution. A window that is too short misses incidents with longer failure cycles.
A window that is too long may count genuinely unrelated incidents as repeats if root cause classification is imprecise. Teams with longer release and change cycles may find 60 or 90 days more accurate than the 30-day default.
The denominator also requires careful handling. Whether to count only incidents that had a completed postmortem, or all declared incidents regardless of review depth, changes the number significantly.
Teams that only require postmortems for high-severity incidents and use a different bar for lower-severity ones often calculate repeat incident rate separately by severity tier rather than as a single aggregate, which produces more useful data.
The Root Cause Problem: Why Incidents Keep Coming Back
Most repeat incidents are not caused by bad engineering or inadequate monitoring. They are caused by three specific failure modes in the process that follows an incident’s resolution.
Reason #1: Fixing symptoms instead of causes
The most common driver of repeat incidents is a resolution that addresses the visible failure without touching the underlying condition that produced it. A database runs out of connections and the fix is to restart the service. The service restarts, the connection count drops, and the incident is closed.
The root cause, whether that is a connection leak, insufficient pool sizing, or a query pattern that changed with a recent deployment, remains intact. The same incident returns in days or weeks. This pattern is often rational in the moment: a fast mitigation reduces user impact, and a deeper fix requires engineering time that has to compete with the rest of the backlog.
The problem is that teams frequently stop at the mitigation and never return to the root cause, which is exactly what repeat incident rate is designed to surface.
Reason #2: Action items that get written but not executed
Google’s SRE documentation is explicit on this point: incomplete postmortem action items make recurrence far more likely.
A well-run postmortem produces a clear root cause analysis and a set of concrete action items with owners and deadlines. The failure mode is not writing bad action items. It is that action items get deprioritized in the sprint planning process, assigned to engineers who are also carrying other work, and quietly dropped when no tracking mechanism enforces completion.
Atlassian’s incident management practice addresses this by assigning SLOs to postmortem action items, either four or eight weeks depending on severity, with automated reminders and manager review. Teams without a comparable enforcement mechanism will see high repeat rates even when their postmortem analysis is accurate.
Reason #3: Systemic issues treated as one-off fixes
Some incidents recur not because the initial fix was wrong, but because the fix addressed one instance of a systemic problem without addressing the broader pattern. A misconfigured timeout value gets corrected in one service.
The same misconfiguration exists in twelve other services, but the postmortem scope was limited to the affected service. The same failure class repeats elsewhere. This pattern is particularly common in distributed systems with many independently owned services.
Repeat incident rate clustered around the same root cause type, rather than the same specific incident, is the signal that a systemic issue is going unaddressed.
Where Teams Go Wrong With Repeat Incident Rate
Repeat incident rate has its own distinct set of failure modes, most of which appear after teams start tracking it.
- Defining “repeat” too narrowly. Teams that only flag an incident as a repeat when it matches the previous incident exactly in every dimension will consistently undercount recurrences. A database timeout incident and a subsequent query failure caused by the same schema change are functionally the same problem, but they may not match on signature. Narrow definitions produce artificially low repeat rates that create false confidence.
- Tracking the rate without tracking action item completion. Repeat incident rate is a lagging indicator. By the time it rises, the recurrence has already happened. Teams that want to manage repeat incidents proactively need to track a leading indicator alongside it: the percentage of postmortem action items completed on schedule. A team whose action item completion rate is falling will see its repeat incident rate rise 30 to 90 days later.
- Treating all repeats as equivalent. A SEV-1 incident that recurs is materially different from a SEV-4 incident that recurs, but both count the same in an aggregate repeat incident rate. Teams that do not segment repeat incident rate by severity cannot distinguish between a manageable background noise of minor recurring issues and a genuine reliability problem at the top of the severity spectrum.
- Conflating reopen rate with repeat incident rate. These are related but distinct metrics. Reopen rate measures incidents that are reopened because they were not fully resolved at close, typically within hours or a few days. Repeat incident rate measures genuine recurrences after a successful resolution. Tracking only reopen rate misses the broader class of repeat incidents where the original fix held temporarily before the underlying cause reasserted itself.
- Not reviewing repeat incidents as a group. Individual repeat incidents look like reliability problems. A cluster of repeat incidents with a shared root cause type looks like a process or architectural problem. Teams that review repeat incidents one at a time miss the pattern. A monthly review of all repeat incidents grouped by root cause category is one of the most effective ways to identify systemic issues that individual postmortems will not surface.
How Repeat Incident Rate Connects To The Metrics You Already Track
Repeat incident rate sits at the intersection of incident management quality and long-term reliability investment.
| Metric | Relationship to Repeat Incident Rate | What It Reveals |
| Incident Frequency | High repeat incident rate inflates total frequency; separating repeat from novel incidents shows whether frequency is rising due to new failure modes or recurring ones | Whether incident volume reflects system fragility or resolution quality |
| Postmortem Action Item Completion Rate | The most direct leading indicator of repeat rate; declining completion rates predict rising repeat rates 30-90 days out | Whether the team is following through on fixes, not just documenting them |
| Mean Time to Recover (MTTR) | Repeat incidents often resolve faster than novel ones because the failure mode is known; stable or falling MTTR on repeat incidents alongside a high repeat rate may indicate the team is optimizing for speed of mitigation rather than permanence of fix | Whether fast resolution is substituting for durable resolution |
| Change Failure Rate | Some repeat incidents are caused by fixes that introduce new failures, which then get re-fixed; high change failure rate and high repeat rate together suggest the fix-and-break cycle | Whether the fix process is generating additional incidents |
| Error Budget Consumption | Repeat incidents consume error budget just as novel incidents do; a high repeat rate accelerating budget burn indicates that fixable failures are being allowed to recur | Whether budget is being consumed by preventable recurrences |
| Toil | Repeat incidents are a primary source of toil; known failure modes that recur regularly become normalized as operational overhead rather than reliability problems | How much team capacity is being consumed by preventable work |
The relationship between repeat incident rate and postmortem action item completion rate is where the most direct intervention opportunity exists. These two metrics should be tracked and reviewed together.
A rise in action item completion rate predicts a future improvement in repeat incident rate. A fall in completion rate, even when current repeat rate looks acceptable, is an early warning that recurrences are accumulating.
The Infrastructure Challenges Nobody Warns You About
Collecting accurate repeat incident rate data is harder than it looks, and the measurement problems are distinct from those affecting other incident metrics.
Incident identity is difficult to establish consistently
Determining whether a new incident is a repeat of a previous one requires a definition of identity that is applied consistently across every incident declaration. In practice, this definition is often applied by different people with different context. Two engineers looking at the same pair of incidents may disagree on whether they share a root cause, particularly when the symptoms differ even if the underlying cause is the same. Without a structured root cause taxonomy and a formal linking process, incident identity is a judgment call, and repeat rate data reflects those individual judgments rather than a consistent standard.
Root cause fields are poorly maintained
Repeat incident rate depends on root cause data being populated accurately and consistently in incident records. Root cause fields are among the most commonly skipped or vaguely completed fields in incident management systems. Engineers close incidents under time pressure, fill in root cause with a surface-level description, and move on. When the same failure recurs, the new incident may not be linkable to the prior one because the root cause fields do not match precisely enough to establish the relationship.
Action item tracking lives in a different system than incident data
In most engineering organizations, incident records live in an incident management system, while action items from postmortems get tracked in a project management or ticketing system. These systems are rarely integrated in a way that allows automatic computation of action item completion rates or automatic flagging of incidents as repeats when a relevant action item was left open. The data needed to understand repeat incident rate fully is distributed across systems that do not talk to each other.
Recurrence windows create artificial boundaries
Any fixed recurrence window, whether 30, 60, or 90 days, will produce edge cases where genuine repeats fall just outside the window and are not counted. A failure mode that recurs every 95 days will not be captured by a 90-day window, even though it is clearly a systematic recurrence. Teams using fixed windows should periodically review incidents that fall just outside the window to check whether there are patterns being missed.
How to Reduce Repeat Incidents
Reducing repeat incident rate requires addressing the failure modes in the resolution process, not just the technical causes of individual incidents.
| Improvement Lever | What This Means in Practice | Primary Impact | Typical Effect |
| Enforce postmortem action item SLOs | Assign explicit completion deadlines to all action items, with owner accountability and manager review on a defined cadence | Directly reduces recurrence from incomplete fixes | High impact; the single highest-leverage intervention for most teams |
| Standardize root cause taxonomy | Define a shared set of root cause categories that all engineers use when closing incidents, enabling accurate recurrence detection | Improves measurement accuracy and surfaces systemic patterns | High impact on data quality; medium impact on actual recurrence |
| Conduct cross-incident root cause reviews | Monthly review of all repeat incidents grouped by root cause type, not individual incident review | Surfaces systemic issues that individual postmortems miss | High impact for teams with clustered repeat patterns |
| Separate mitigation from resolution in incident process | Explicitly distinguish between the short-term mitigation that stops the bleeding and the long-term fix that addresses root cause; track both | Reduces incidents closed on mitigation alone | Medium impact; requires process change and cultural reinforcement |
| Invest in automated detection of known failure modes | Build automated responses or runbooks for failure modes that have recurred, so they are detected and mitigated faster | Reduces user impact from repeats while permanent fixes are in progress | Medium impact on rate, high impact on severity of repeat incidents |
| Link action items to incidents at creation | Require that every postmortem action item be linked to its source incident in the tracking system | Enables automatic measurement of action item completion rates and repeat detection | High impact on measurement quality; enables proactive management |
The highest-leverage single intervention for most teams is enforcing completion of postmortem action items.
Google’s SRE workbook documents this explicitly: rewarding engineers for writing postmortems without equally rewarding the completion of action items creates a cycle of unclosed follow-ups. The fix requires both tracking and accountability, not just documentation.
What Good Repeat Incident Rate Actually Looks Like in Practice
Repeat incident rate benchmarks are context-dependent and sensitive to how teams define recurrence, but the directional targets are consistent across the literature.
| Team Context | Typical Repeat Incident Rate | Notes |
| Early-stage team, immature incident process | High and variable | Definition of “repeat” often inconsistent; data quality limits usefulness of the metric |
| Team with active postmortem practice, completing action items | Low; below 10% is achievable | Consistent follow-through on root cause fixes produces durable improvements |
| Team with postmortems but poor action item completion | Moderate to high; often 20-30%+ | Analysis quality does not translate to outcome quality without execution discipline |
| Team under high technical debt pressure | Elevated regardless of process quality | Systemic fragility produces recurring failures faster than individual fixes can address them |
| Mature SRE team with systemic review practice | Low, with a declining trend | Systemic root cause reviews catch recurring patterns before they accumulate |
The trajectory matters as much as the absolute number. A team improving from 30% to 20% over two quarters is demonstrating that its process changes are working, even if the rate is not yet at target. Above 20% consistently is a signal that the follow-through process is broken, regardless of how thorough the analysis is.
Segment the metric by severity before drawing conclusions. A 15% aggregate repeat rate concentrated in low-severity incidents suggests a manageable background of known minor recurring issues. The same rate concentrated in high-severity incidents is a reliability problem that warrants immediate attention.