Flaky Test Root Cause Analysis: A Systematic Guide

You know you have flaky tests. You have detected them, maybe even quarantined them. But they are still there, consuming CI minutes, eroding trust, and nobody wants to touch them because nobody knows why they flake.

Root cause analysis is where most teams get stuck. Flaky tests fail intermittently by definition, which means you cannot just read the error message and fix the bug. The failure might not reproduce locally. It might only happen under load, on a specific CI runner, or when tests run in a particular order.

This guide gives you a systematic framework for diagnosing flaky test root causes. Every flaky test falls into one of six categories, and each category has distinct patterns you can identify from error messages, stack traces, and test code.

The Six Root Cause Categories

After analyzing thousands of flaky test failures across engineering organizations, a clear taxonomy emerges. Every flaky test traces back to one of these root causes:

  1. Timing and async issues -- the most common category
  2. Test ordering dependencies -- shared mutable state between tests
  3. Resource contention -- ports, files, databases, memory
  4. Concurrency bugs -- race conditions in the code under test
  5. Environment sensitivity -- timezone, locale, OS, CI runner differences
  6. Network dependencies -- external APIs, DNS, TLS handshakes

Category 1: Timing and Async Issues

Timing issues account for roughly 40% of all flaky tests. They happen when a test makes assumptions about how long an operation takes rather than waiting for it to complete.

How to identify

Look for these patterns in the error output and test code:

  • Error messages containing timeout, ETIMEDOUT, or deadline exceeded
  • setTimeout, sleep(), or Thread.sleep() in test code
  • Assertions that check for a state change without an explicit wait
  • waitFor calls with timeouts that are too short for CI environments
  • Tests that pass locally but fail in CI (CI runners have shared CPU, so operations take longer)

Common examples

// BAD: Assumes the DOM updates within 100ms
await sleep(100);
expect(screen.getByText('Loaded')).toBeInTheDocument();

// GOOD: Waits for the actual condition
await waitFor(() => {
  expect(screen.getByText('Loaded')).toBeInTheDocument();
}, { timeout: 5000 });
// BAD: Fixed delay before asserting
time.sleep(2)
assert response.status == 'complete'

// GOOD: Poll with backoff
await_condition(
    lambda: client.get_status() == 'complete',
    timeout=30,
    interval=0.5
)

Fix patterns

  • Replace all sleep() / setTimeout() with explicit waitFor() or polling
  • Increase timeouts for CI environments (2-3x local values)
  • Use {timeout: process.env.CI ? 10000 : 5000} for environment-aware timeouts
  • For UI tests, use framework-provided waiters (waitFor, findByText, Playwright's auto-waiting)

Category 2: Test Ordering Dependencies

Ordering issues happen when tests share state and depend on running in a specific sequence. Test A creates some data, test B reads it, and when the test runner randomizes order or runs them in parallel, test B fails.

How to identify

  • Tests that pass when run individually (it.only) but fail in the full suite
  • Tests that fail only when run in a specific order
  • Global variables, module-level state, or singleton patterns in test setup
  • Missing beforeEach / afterEach cleanup
  • Database tests without transaction rollback or per-test isolation

Common examples

// BAD: Tests share a module-level variable
let userCount = 0;

test('creates a user', () => {
  userCount++;
  expect(userCount).toBe(1); // Fails if another test incremented it first
});

// GOOD: Each test has its own state
test('creates a user', () => {
  const counter = new UserCounter();
  counter.increment();
  expect(counter.count).toBe(1);
});

Fix patterns

  • Ensure each test creates and tears down its own state
  • Use beforeEach to reset shared fixtures
  • For database tests, wrap each test in a transaction that rolls back
  • Avoid global mutable state -- prefer dependency injection
  • Run tests with randomized ordering to catch these early (jest --randomize, pytest -p randomly)

Category 3: Resource Contention

Resource contention happens when tests compete for shared system resources: ports, files, database connections, or memory. This is especially common in integration tests that spin up servers or database containers.

How to identify

  • Errors containing EADDRINUSE, port already in use, or address already bound
  • File system errors: ENOENT, EACCES, EEXIST on temp files
  • Database errors: deadlock detected, lock timeout, too many connections
  • Out-of-memory errors that only happen when the full test suite runs
  • Tests that pass in isolation but fail when run in parallel

Fix patterns

  • Use dynamic port allocation (port: 0) instead of hardcoded ports
  • Create unique temp directories per test (fs.mkdtempSync())
  • Use per-test database schemas or isolated database instances
  • Mock external resources in unit tests; reserve real resources for integration tests only
  • Set up proper resource cleanup in afterEach / afterAll hooks

Category 4: Concurrency Bugs

Sometimes the test itself is fine, but the code under test has a race condition that surfaces intermittently. The test is flaky because the code is buggy -- the flaky test is doing its job by catching a real issue, just not reliably.

How to identify

  • Errors involving race condition, concurrent modification, or ConcurrentModificationException
  • Data corruption or unexpected values in multi-threaded operations
  • Tests involving async operations, parallel workers, or event-driven code
  • Failures that seem "random" with no pattern -- different assertions fail on different runs
  • Shared mutable state accessed from multiple goroutines, threads, or async callbacks

Fix patterns

  • Add proper synchronization (mutexes, locks, semaphores) to shared state
  • Use immutable data structures where possible
  • For Go code, run tests with -race flag to detect data races
  • In JavaScript, ensure promises are properly awaited and not left dangling
  • Consider using stress testing tools to reproduce: run the flaky test 100 times in a loop

Category 5: Environment Sensitivity

Environment-sensitive tests depend on aspects of the runtime environment that differ between machines: timezone, locale, OS, CPU architecture, available memory, or installed system libraries.

How to identify

  • Tests that pass on macOS but fail on Linux CI runners (or vice versa)
  • Date/time assertions that fail in different timezones
  • String comparison failures due to locale-dependent sorting
  • File path assertions using \ vs /
  • Snapshot tests that differ between environments (font rendering, floating point precision)

Common examples

// BAD: Depends on the system timezone
const date = new Date('2026-03-13');
expect(date.getDate()).toBe(13); // Fails in UTC-12

// GOOD: Pin the timezone
process.env.TZ = 'UTC';
const date = new Date('2026-03-13T00:00:00Z');
expect(date.getUTCDate()).toBe(13);

Fix patterns

  • Pin timezone in tests: process.env.TZ = 'UTC' or TZ=UTC in CI config
  • Pin locale for sorting: use explicit comparators instead of default sort()
  • Use path.join() instead of string concatenation for file paths
  • Mock Date.now() for time-dependent tests
  • Use platform-agnostic assertions (regex matchers instead of exact string matches)

Category 6: Network Dependencies

Tests that depend on external network resources -- third-party APIs, DNS resolution, or services running on other hosts -- are inherently flaky. Networks are unreliable, and external services have downtime, rate limits, and latency spikes.

How to identify

  • Errors containing ECONNREFUSED, ENOTFOUND, ECONNRESET
  • HTTP status codes like 429 Too Many Requests, 502 Bad Gateway, 503 Service Unavailable
  • TLS/SSL handshake failures
  • Tests that call real external APIs (Stripe, GitHub, Twilio) without mocking
  • DNS resolution failures in CI environments with restricted network access

Fix patterns

  • Mock all external HTTP calls in unit tests (use nock, msw, responses, or wiremock)
  • Use contract tests (Pact) for integration verification instead of live API calls
  • Record and replay HTTP interactions for deterministic tests (vcr, polly)
  • Add retry logic with exponential backoff for integration tests that genuinely need network
  • Set explicit timeouts on all HTTP clients used in tests

Pattern-Matching Approach to Root Cause Analysis

When you have a flaky test, you do not need to guess which category it falls into. You can systematically narrow it down by examining three pieces of evidence:

  1. The error message and stack trace. Each category has distinct error signatures. A timeout exceeded error is timing. An EADDRINUSE is resource contention. A ECONNREFUSED is network.
  2. The test code. Look for sleep() calls (timing), shared state (ordering), hardcoded ports (resource), multi-threaded operations (concurrency), system-dependent values (environment), or HTTP calls (network).
  3. The failure pattern. Does it fail more on CI than locally? (timing or environment). Does it fail only when run with other tests? (ordering or resource). Does it fail at specific times of day? (network -- the external service has peak hours).

This pattern-matching approach scales. Instead of spending an hour manually debugging each flaky test, you can classify the root cause in minutes by checking error patterns against known signatures.

Automating Root Cause Analysis with AI

Manual pattern matching works but does not scale to hundreds of flaky tests. The pattern signatures are well-defined enough to be automated -- and AI makes this even more powerful.

An AI-powered root cause analyzer can:

  • Classify the error. Match the error message and stack trace against pattern libraries for all six categories simultaneously. A human might miss that a connection reset buried in a 200-line stack trace points to network flakiness; pattern matching catches it instantly.
  • Read the test code. Analyze the test source for anti-patterns: sleep() calls, missing cleanup, hardcoded ports, unmocked HTTP clients.
  • Score confidence. Rank the likely root cause by confidence. Some failures map cleanly to one category; others are ambiguous (a timeout in a network call could be timing or network). A confidence score helps you prioritize investigation.
  • Suggest specific fixes. Instead of "fix the timing issue," an AI analyzer can say "replace the sleep(2000) on line 47 with await waitFor(() => ...)."

The key advantage is speed. Manually diagnosing a single flaky test takes 30-60 minutes of an engineer's time. Automated analysis takes seconds and can run on every quarantined test, every day, without human involvement.

Building a Root Cause Analysis Pipeline

Here is a practical architecture for automated root cause analysis:

  1. Ingest test results. Collect JUnit XML or equivalent from every CI run. Store results per test case with error messages and stack traces.
  2. Identify flaky tests. Compute flip rates (pass/fail transitions on the same code). Flag tests above your threshold (typically 0.2-0.3).
  3. Classify root causes. Run pattern matching against the error corpus. Use weighted pattern signatures per category. Score each category and pick the highest-confidence match.
  4. Surface results. Show the root cause, confidence, matched patterns, and suggested fix in your dashboard and PR comments.
  5. Track resolution. When a test is fixed and unquarantined, record which category it was and how it was resolved. This builds your organization's flaky test knowledge base.

Prioritizing Which Flaky Tests to Fix

Not all flaky tests are equally worth fixing. Prioritize by impact:

  • CI minutes wasted. A test that runs on every PR and flakes 20% of the time wastes far more time than a test that flakes 50% but only runs in nightly builds.
  • Team frustration. Flaky tests in frequently-touched code paths cause the most developer friction.
  • Root cause confidence. If the automated analysis shows 95% confidence it is a timing issue with a clear fix, that is a quick win. If the confidence is 40% across two categories, it will take longer to investigate.
  • Quarantine age. Tests that have been quarantined for 30+ days without attention need escalation, not more delay.

Getting Started

If you are starting root cause analysis for the first time:

  1. Pick your top 5 flaky tests by failure frequency or CI time wasted.
  2. Read the error messages and classify each into one of the six categories using the patterns above.
  3. Fix the easiest one first. Timing issues are usually the quickest to fix (replace sleep with waitFor). See our 7 proven fix patterns for step-by-step techniques. Get a win.
  4. Document what you find. Track root cause category, fix applied, and time to fix. This data helps your team improve test quality systematically.

Or, automate the entire pipeline. FlakyGuard analyzes every flaky test automatically: it ingests CI results, detects flaky tests via flip-rate analysis, classifies root causes using AI pattern matching across all six categories, and surfaces the diagnosis with confidence scores and specific fix suggestions -- directly in your GitHub PR checks and Slack notifications.

Stop guessing why tests flake

FlakyGuard uses AI to classify flaky test root causes across six categories, scores confidence, and suggests specific fixes. Get automated root cause analysis on every quarantined test, every day.

Join the Waitlist