Guide

How to Detect Flaky Tests: A Practical Guide for Engineering Teams

March 10, 20268 min read

Flaky tests are tests that pass and fail intermittently without any changes to the code under test. According to Google's engineering research, 84% of their CI test failures were caused by flaky tests, and the average engineering team wastes 6-8 hours per developer per week investigating false failures.

The first step to fixing flaky tests is detecting them. But most teams rely on gut feel -- "that test has been failing randomly for weeks" -- rather than systematic detection. This guide covers proven techniques from simple to sophisticated.

Why Flaky Tests Are Hard to Detect

Flaky tests are deceptive. A test that fails once out of every 20 runs looks like a legitimate failure each time it happens. Without tracking history across runs, you cannot distinguish a flaky failure from a real regression. This leads to two costly outcomes:

Wasted investigation time: Developers spend 28 minutes on average investigating each flaky failure before concluding "it was just flaky." At scale, this adds up to millions in lost productivity per year.
Normalized failure: Teams start ignoring test failures entirely, which means real regressions slip through.

Technique 1: Retry-Based Detection

The simplest approach is re-running failed tests. If a test passes on retry without code changes, it is flaky by definition.

How it works

A test fails in CI.
The CI system automatically retries the failed test (typically 2-3 times).
If the test passes on any retry, flag it as flaky.

Pros

Simple to implement -- most CI systems support retries natively.
Zero false positives: if it passes on retry without code changes, it is flaky.
Works with any test framework.

Cons

Increases CI time -- every retry adds minutes to your pipeline.
Misses low-frequency flaky tests -- a test that fails once per 50 runs might pass all 3 retries.
No root cause insight -- you know it is flaky, but not why.

# GitHub Actions example with retries
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
    steps:
      - uses: actions/checkout@v4
      - run: npm test || npm test || npm test

Technique 2: Historical Flip-Rate Analysis

Instead of relying on retries, track every test result across commits and compute a flip rate: how often a test changes from pass to fail (or vice versa) without corresponding code changes.

How it works

Ingest test results from every CI run into a database.
For each test, group results by commit SHA.
Count the number of "flips" -- transitions between pass and fail on the same code.
Compute a flakiness score: flips / total_transitions.

The flakiness score

A test with a flakiness score of 0.0 always produces the same result for the same code. A score of 1.0 means the result is essentially random. Most teams set a quarantine threshold between 0.2 and 0.4 -- tests above this threshold are flagged as flaky.

Why this is better than retries

Catches low-frequency flakiness. Even a test that fails once per 100 runs accumulates a score over time.
No CI time overhead. You analyze existing results rather than re-running tests.
Quantifiable. A score of 0.35 vs 0.85 tells you which tests to fix first.

Technique 3: Quarantine-Then-Investigate

Once you detect a flaky test, the next question is: what do you do with it? The quarantine pattern separates flaky tests from your critical path so they stop blocking merges. For a deep dive on this approach, see our guide on how to quarantine flaky tests without losing CI trust.

How quarantine works

A test exceeds the flakiness threshold.
It is moved to a "quarantine" lane -- it still runs, but failures do not block PRs.
GitHub PR checks show: "3 tests failed (all quarantined -- not blocking)."
The test is investigated and fixed, then moved back to the critical path.

This approach keeps your CI green and trustworthy while giving engineers time to properly fix root causes rather than slapping @retry annotations everywhere.

Technique 4: AI-Powered Root Cause Analysis

Detecting flaky tests is only half the battle. The other half is understanding why they are flaky. We cover each category in detail in our systematic guide to flaky test root cause analysis. Common root causes include:

Timing issues: Tests that depend on setTimeout, sleep(), or race conditions between async operations.
Test ordering: Tests that pass in isolation but fail when run after another test that mutates shared state.
Resource contention: Tests fighting over ports, files, database connections, or external APIs.
Environment differences: Tests that depend on system time, locale, timezone, or OS-specific behavior.
Network flakiness: Tests that make real HTTP calls to external services.
Concurrency bugs: Tests that expose race conditions in the code under test.

AI-powered analysis can classify failures into these categories by examining error messages, stack traces, and test code patterns. Instead of spending 30 minutes reading logs, you get an immediate diagnosis: "This test is flaky due to a timing issue -- it uses setTimeout(100) which is not reliable under CI load. Suggested fix: use waitFor() or increase timeout to 500ms."

Putting It All Together

The most effective flaky test strategy combines all four techniques:

Retries for immediate unblocking -- your CI stays green while you investigate.
Flip-rate analysis for systematic detection -- you know exactly which tests are flaky and how flaky they are.
Quarantine for workflow management -- flaky tests are separated from real failures so your team trusts CI again.
AI root cause analysis for faster fixes -- each flaky test gets a diagnosis and suggested fix, cutting triage time from 28 minutes to under 5.

Getting Started

If you are starting from scratch, here is the minimum viable approach:

Enable CI retries (2 retries is usually enough).
Start logging test results to a database or service.
Compute flakiness scores weekly and share with the team.
Quarantine the top offenders.
Fix them one by one, starting with the highest-score tests. See our 7 proven fix patterns for specific techniques.

Or, if you want to skip the build-it-yourself phase, tools like FlakyGuard automate the entire pipeline: detection, quarantine, root cause analysis, and team analytics out of the box.

Stop wasting hours on flaky tests

FlakyGuard automatically detects, quarantines, and diagnoses flaky tests in your CI pipeline. Join the waitlist to be among the first teams to try it.

Join the Waitlist

Why Flaky Tests Are Hard to Detect

Technique 1: Retry-Based Detection

How it works

Pros

Cons

Technique 2: Historical Flip-Rate Analysis

How it works

The flakiness score

Why this is better than retries

Technique 3: Quarantine-Then-Investigate

How quarantine works

Technique 4: AI-Powered Root Cause Analysis

Putting It All Together

Getting Started

Stop wasting hours on flaky tests

Related Articles