Repeat Incident Rate

The metric that tells you whether your incident process is actually working or just generating documentation. Repeat incident rate is…

The metric that tells you whether your incident process is actually working or just generating documentation.

Repeat incident rate is the percentage of incidents that recur after being marked resolved. It measures whether the underlying cause of a failure was genuinely addressed or whether the team resolved the symptoms, closed the ticket, and moved on.

What Repeat Incident Rate Measures (And What It Doesn’t)

Repeat incident rate tracks how often a previously resolved incident recurs within a defined time window. The definition of “repeat” matters significantly here. 

Most teams use one of two approaches:

  1. Incident type matching, where a new incident is flagged as a repeat if it shares the same root cause as a prior incident.
  2. Signature matching, where incidents are compared based on service, failure mode, and alert pattern.

Type matching produces more accurate data but requires consistent root cause classification in your incident records. 

Signature matching is easier to automate but can miss genuine recurrences where the same underlying problem manifests differently, and can over-flag incidents that look similar but have distinct causes.

What repeat incident rate does not measure is the quality of the investigation that preceded the fix. A team that correctly identifies the root cause but cannot get the fix prioritized and resourced will still show a high repeat rate, even though the analysis was sound. 

Conversely, a team that gets lucky with a one-off fix, or whose system changes in a way that prevents recurrence for unrelated reasons, will show a low repeat rate without having done meaningful reliability work. The metric reflects outcomes, not process quality.

Repeat incident rate also does not distinguish between incidents that recur quickly and those that recur after a long gap. An incident that comes back three days after resolution is a different problem than one that recurs six months later, but both count the same way in the metric. Teams that want this distinction track a companion metric: time to recurrence, which measures the average interval between an incident’s resolution and its next occurrence.

How Repeat Incident Rate Gets Calculated

The formula is simple:

Repeat Incident Rate = (Number of repeat incidents / Total number of incidents) × 100

A team that had 40 incidents in a quarter, of which 8 were identified as repeats of previously resolved incidents, has a repeat incident rate of 20%.

The choice of time window for “repeat” requires a decision. Most teams define a repeat as any recurrence within 30 to 90 days of the original resolution. A window that is too short misses incidents with longer failure cycles. 

A window that is too long may count genuinely unrelated incidents as repeats if root cause classification is imprecise. Teams with longer release and change cycles may find 60 or 90 days more accurate than the 30-day default.

The denominator also requires careful handling. Whether to count only incidents that had a completed postmortem, or all declared incidents regardless of review depth, changes the number significantly. 

Teams that only require postmortems for high-severity incidents and use a different bar for lower-severity ones often calculate repeat incident rate separately by severity tier rather than as a single aggregate, which produces more useful data.

The Root Cause Problem: Why Incidents Keep Coming Back

Most repeat incidents are not caused by bad engineering or inadequate monitoring. They are caused by three specific failure modes in the process that follows an incident’s resolution.

Reason #1: Fixing symptoms instead of causes

The most common driver of repeat incidents is a resolution that addresses the visible failure without touching the underlying condition that produced it. A database runs out of connections and the fix is to restart the service. The service restarts, the connection count drops, and the incident is closed. 

The root cause, whether that is a connection leak, insufficient pool sizing, or a query pattern that changed with a recent deployment, remains intact. The same incident returns in days or weeks. This pattern is often rational in the moment: a fast mitigation reduces user impact, and a deeper fix requires engineering time that has to compete with the rest of the backlog. 

The problem is that teams frequently stop at the mitigation and never return to the root cause, which is exactly what repeat incident rate is designed to surface.

Reason #2: Action items that get written but not executed

Google’s SRE documentation is explicit on this point: incomplete postmortem action items make recurrence far more likely. 

A well-run postmortem produces a clear root cause analysis and a set of concrete action items with owners and deadlines. The failure mode is not writing bad action items. It is that action items get deprioritized in the sprint planning process, assigned to engineers who are also carrying other work, and quietly dropped when no tracking mechanism enforces completion. 

Atlassian’s incident management practice addresses this by assigning SLOs to postmortem action items, either four or eight weeks depending on severity, with automated reminders and manager review. Teams without a comparable enforcement mechanism will see high repeat rates even when their postmortem analysis is accurate.

Reason #3: Systemic issues treated as one-off fixes

Some incidents recur not because the initial fix was wrong, but because the fix addressed one instance of a systemic problem without addressing the broader pattern. A misconfigured timeout value gets corrected in one service. 

The same misconfiguration exists in twelve other services, but the postmortem scope was limited to the affected service. The same failure class repeats elsewhere. This pattern is particularly common in distributed systems with many independently owned services. 

Repeat incident rate clustered around the same root cause type, rather than the same specific incident, is the signal that a systemic issue is going unaddressed.

Where Teams Go Wrong With Repeat Incident Rate

Repeat incident rate has its own distinct set of failure modes, most of which appear after teams start tracking it.

  • Defining “repeat” too narrowly. Teams that only flag an incident as a repeat when it matches the previous incident exactly in every dimension will consistently undercount recurrences. A database timeout incident and a subsequent query failure caused by the same schema change are functionally the same problem, but they may not match on signature. Narrow definitions produce artificially low repeat rates that create false confidence.
  • Tracking the rate without tracking action item completion. Repeat incident rate is a lagging indicator. By the time it rises, the recurrence has already happened. Teams that want to manage repeat incidents proactively need to track a leading indicator alongside it: the percentage of postmortem action items completed on schedule. A team whose action item completion rate is falling will see its repeat incident rate rise 30 to 90 days later.
  • Treating all repeats as equivalent. A SEV-1 incident that recurs is materially different from a SEV-4 incident that recurs, but both count the same in an aggregate repeat incident rate. Teams that do not segment repeat incident rate by severity cannot distinguish between a manageable background noise of minor recurring issues and a genuine reliability problem at the top of the severity spectrum.
  • Conflating reopen rate with repeat incident rate. These are related but distinct metrics. Reopen rate measures incidents that are reopened because they were not fully resolved at close, typically within hours or a few days. Repeat incident rate measures genuine recurrences after a successful resolution. Tracking only reopen rate misses the broader class of repeat incidents where the original fix held temporarily before the underlying cause reasserted itself.
  • Not reviewing repeat incidents as a group. Individual repeat incidents look like reliability problems. A cluster of repeat incidents with a shared root cause type looks like a process or architectural problem. Teams that review repeat incidents one at a time miss the pattern. A monthly review of all repeat incidents grouped by root cause category is one of the most effective ways to identify systemic issues that individual postmortems will not surface.

How Repeat Incident Rate Connects To The Metrics You Already Track

Repeat incident rate sits at the intersection of incident management quality and long-term reliability investment.

MetricRelationship to Repeat Incident RateWhat It Reveals
Incident FrequencyHigh repeat incident rate inflates total frequency; separating repeat from novel incidents shows whether frequency is rising due to new failure modes or recurring onesWhether incident volume reflects system fragility or resolution quality
Postmortem Action Item Completion RateThe most direct leading indicator of repeat rate; declining completion rates predict rising repeat rates 30-90 days outWhether the team is following through on fixes, not just documenting them
Mean Time to Recover (MTTR)Repeat incidents often resolve faster than novel ones because the failure mode is known; stable or falling MTTR on repeat incidents alongside a high repeat rate may indicate the team is optimizing for speed of mitigation rather than permanence of fixWhether fast resolution is substituting for durable resolution
Change Failure RateSome repeat incidents are caused by fixes that introduce new failures, which then get re-fixed; high change failure rate and high repeat rate together suggest the fix-and-break cycleWhether the fix process is generating additional incidents
Error Budget ConsumptionRepeat incidents consume error budget just as novel incidents do; a high repeat rate accelerating budget burn indicates that fixable failures are being allowed to recurWhether budget is being consumed by preventable recurrences
ToilRepeat incidents are a primary source of toil; known failure modes that recur regularly become normalized as operational overhead rather than reliability problemsHow much team capacity is being consumed by preventable work

The relationship between repeat incident rate and postmortem action item completion rate is where the most direct intervention opportunity exists. These two metrics should be tracked and reviewed together. 

A rise in action item completion rate predicts a future improvement in repeat incident rate. A fall in completion rate, even when current repeat rate looks acceptable, is an early warning that recurrences are accumulating.

The Infrastructure Challenges Nobody Warns You About

Collecting accurate repeat incident rate data is harder than it looks, and the measurement problems are distinct from those affecting other incident metrics.

Incident identity is difficult to establish consistently

Determining whether a new incident is a repeat of a previous one requires a definition of identity that is applied consistently across every incident declaration. In practice, this definition is often applied by different people with different context. Two engineers looking at the same pair of incidents may disagree on whether they share a root cause, particularly when the symptoms differ even if the underlying cause is the same. Without a structured root cause taxonomy and a formal linking process, incident identity is a judgment call, and repeat rate data reflects those individual judgments rather than a consistent standard.

Root cause fields are poorly maintained

Repeat incident rate depends on root cause data being populated accurately and consistently in incident records. Root cause fields are among the most commonly skipped or vaguely completed fields in incident management systems. Engineers close incidents under time pressure, fill in root cause with a surface-level description, and move on. When the same failure recurs, the new incident may not be linkable to the prior one because the root cause fields do not match precisely enough to establish the relationship.

Action item tracking lives in a different system than incident data

In most engineering organizations, incident records live in an incident management system, while action items from postmortems get tracked in a project management or ticketing system. These systems are rarely integrated in a way that allows automatic computation of action item completion rates or automatic flagging of incidents as repeats when a relevant action item was left open. The data needed to understand repeat incident rate fully is distributed across systems that do not talk to each other.

Recurrence windows create artificial boundaries

Any fixed recurrence window, whether 30, 60, or 90 days, will produce edge cases where genuine repeats fall just outside the window and are not counted. A failure mode that recurs every 95 days will not be captured by a 90-day window, even though it is clearly a systematic recurrence. Teams using fixed windows should periodically review incidents that fall just outside the window to check whether there are patterns being missed.

How to Reduce Repeat Incidents

Reducing repeat incident rate requires addressing the failure modes in the resolution process, not just the technical causes of individual incidents.

Improvement LeverWhat This Means in PracticePrimary ImpactTypical Effect
Enforce postmortem action item SLOsAssign explicit completion deadlines to all action items, with owner accountability and manager review on a defined cadenceDirectly reduces recurrence from incomplete fixesHigh impact; the single highest-leverage intervention for most teams
Standardize root cause taxonomyDefine a shared set of root cause categories that all engineers use when closing incidents, enabling accurate recurrence detectionImproves measurement accuracy and surfaces systemic patternsHigh impact on data quality; medium impact on actual recurrence
Conduct cross-incident root cause reviewsMonthly review of all repeat incidents grouped by root cause type, not individual incident reviewSurfaces systemic issues that individual postmortems missHigh impact for teams with clustered repeat patterns
Separate mitigation from resolution in incident processExplicitly distinguish between the short-term mitigation that stops the bleeding and the long-term fix that addresses root cause; track bothReduces incidents closed on mitigation aloneMedium impact; requires process change and cultural reinforcement
Invest in automated detection of known failure modesBuild automated responses or runbooks for failure modes that have recurred, so they are detected and mitigated fasterReduces user impact from repeats while permanent fixes are in progressMedium impact on rate, high impact on severity of repeat incidents
Link action items to incidents at creationRequire that every postmortem action item be linked to its source incident in the tracking systemEnables automatic measurement of action item completion rates and repeat detectionHigh impact on measurement quality; enables proactive management

The highest-leverage single intervention for most teams is enforcing completion of postmortem action items. 

Google’s SRE workbook documents this explicitly: rewarding engineers for writing postmortems without equally rewarding the completion of action items creates a cycle of unclosed follow-ups. The fix requires both tracking and accountability, not just documentation.

What Good Repeat Incident Rate Actually Looks Like in Practice

Repeat incident rate benchmarks are context-dependent and sensitive to how teams define recurrence, but the directional targets are consistent across the literature.

Team ContextTypical Repeat Incident RateNotes
Early-stage team, immature incident processHigh and variableDefinition of “repeat” often inconsistent; data quality limits usefulness of the metric
Team with active postmortem practice, completing action itemsLow; below 10% is achievableConsistent follow-through on root cause fixes produces durable improvements
Team with postmortems but poor action item completionModerate to high; often 20-30%+Analysis quality does not translate to outcome quality without execution discipline
Team under high technical debt pressureElevated regardless of process qualitySystemic fragility produces recurring failures faster than individual fixes can address them
Mature SRE team with systemic review practiceLow, with a declining trendSystemic root cause reviews catch recurring patterns before they accumulate

The trajectory matters as much as the absolute number. A team improving from 30% to 20% over two quarters is demonstrating that its process changes are working, even if the rate is not yet at target. Above 20% consistently is a signal that the follow-through process is broken, regardless of how thorough the analysis is.

Segment the metric by severity before drawing conclusions. A 15% aggregate repeat rate concentrated in low-severity incidents suggests a manageable background of known minor recurring issues. The same rate concentrated in high-severity incidents is a reliability problem that warrants immediate attention.

Frequently Asked Questions

How is repeat incident rate different from reopen rate? 

Reopen rate measures incidents reopened because they were not fully resolved at close, typically within hours or a day. Repeat incident rate measures genuine recurrences after a successful resolution, often days, weeks, or months later. A low reopen rate and a high repeat rate is common in teams that close incidents quickly on mitigation but do not address root causes.

Thirty days is a practical default for most teams. Teams with longer release and change cycles may find 60 or 90 days more appropriate. Any fixed window will produce edge cases; periodic review of incidents falling just outside the window helps catch patterns the metric misses.

Not necessarily. A low rate can reflect genuine improvement, a narrow definition of recurrence that undercounts repeats, or changes in the system that prevented recurrence for unrelated reasons. Repeat incident rate is most meaningful when tracked alongside postmortem action item completion rate, which provides the leading indicator that the low rate reflects real process discipline.

Yes. An aggregate rate conflates low-severity recurring issues with high-severity ones that have meaningfully different operational and business impact. Segmenting by severity reveals whether the team’s repeat problem is concentrated in manageable minor incidents or in failures that cause significant user impact.

Repeat incidents are a direct source of toil. When a known failure mode recurs repeatedly, the response becomes routine operational work rather than engineering problem-solving. Teams that track repeat incident rate alongside toil often find that a small number of recurring failure modes account for a disproportionate share of total on-call burden. Addressing those failure modes permanently produces the largest reduction in toil.

Get started with Opsera Agents today.
Free for Startups & Small Teams