You know you have flaky tests. You have detected them, maybe even quarantined them. But they are still there, consuming CI minutes, eroding trust, and nobody wants to touch them because nobody knows why they flake.
Root cause analysis is where most teams get stuck. Flaky tests fail intermittently by definition, which means you cannot just read the error message and fix the bug. The failure might not reproduce locally. It might only happen under load, on a specific CI runner, or when tests run in a particular order.
This guide gives you a systematic framework for diagnosing flaky test root causes. Every flaky test falls into one of six categories, and each category has distinct patterns you can identify from error messages, stack traces, and test code.
The Six Root Cause Categories
After analyzing thousands of flaky test failures across engineering organizations, a clear taxonomy emerges. Every flaky test traces back to one of these root causes:
- Timing and async issues -- the most common category
- Test ordering dependencies -- shared mutable state between tests
- Resource contention -- ports, files, databases, memory
- Concurrency bugs -- race conditions in the code under test
- Environment sensitivity -- timezone, locale, OS, CI runner differences
- Network dependencies -- external APIs, DNS, TLS handshakes
Category 1: Timing and Async Issues
Timing issues account for roughly 40% of all flaky tests. They happen when a test makes assumptions about how long an operation takes rather than waiting for it to complete.
How to identify
Look for these patterns in the error output and test code:
- Error messages containing
timeout,ETIMEDOUT, ordeadline exceeded setTimeout,sleep(), orThread.sleep()in test code- Assertions that check for a state change without an explicit wait
waitForcalls with timeouts that are too short for CI environments- Tests that pass locally but fail in CI (CI runners have shared CPU, so operations take longer)
Common examples
// BAD: Assumes the DOM updates within 100ms
await sleep(100);
expect(screen.getByText('Loaded')).toBeInTheDocument();
// GOOD: Waits for the actual condition
await waitFor(() => {
expect(screen.getByText('Loaded')).toBeInTheDocument();
}, { timeout: 5000 });// BAD: Fixed delay before asserting
time.sleep(2)
assert response.status == 'complete'
// GOOD: Poll with backoff
await_condition(
lambda: client.get_status() == 'complete',
timeout=30,
interval=0.5
)Fix patterns
- Replace all
sleep()/setTimeout()with explicitwaitFor()or polling - Increase timeouts for CI environments (2-3x local values)
- Use
{timeout: process.env.CI ? 10000 : 5000}for environment-aware timeouts - For UI tests, use framework-provided waiters (
waitFor,findByText, Playwright's auto-waiting)
Category 2: Test Ordering Dependencies
Ordering issues happen when tests share state and depend on running in a specific sequence. Test A creates some data, test B reads it, and when the test runner randomizes order or runs them in parallel, test B fails.
How to identify
- Tests that pass when run individually (
it.only) but fail in the full suite - Tests that fail only when run in a specific order
- Global variables, module-level state, or singleton patterns in test setup
- Missing
beforeEach/afterEachcleanup - Database tests without transaction rollback or per-test isolation
Common examples
// BAD: Tests share a module-level variable
let userCount = 0;
test('creates a user', () => {
userCount++;
expect(userCount).toBe(1); // Fails if another test incremented it first
});
// GOOD: Each test has its own state
test('creates a user', () => {
const counter = new UserCounter();
counter.increment();
expect(counter.count).toBe(1);
});Fix patterns
- Ensure each test creates and tears down its own state
- Use
beforeEachto reset shared fixtures - For database tests, wrap each test in a transaction that rolls back
- Avoid global mutable state -- prefer dependency injection
- Run tests with randomized ordering to catch these early (
jest --randomize,pytest -p randomly)
Category 3: Resource Contention
Resource contention happens when tests compete for shared system resources: ports, files, database connections, or memory. This is especially common in integration tests that spin up servers or database containers.
How to identify
- Errors containing
EADDRINUSE,port already in use, oraddress already bound - File system errors:
ENOENT,EACCES,EEXISTon temp files - Database errors:
deadlock detected,lock timeout,too many connections - Out-of-memory errors that only happen when the full test suite runs
- Tests that pass in isolation but fail when run in parallel
Fix patterns
- Use dynamic port allocation (
port: 0) instead of hardcoded ports - Create unique temp directories per test (
fs.mkdtempSync()) - Use per-test database schemas or isolated database instances
- Mock external resources in unit tests; reserve real resources for integration tests only
- Set up proper resource cleanup in
afterEach/afterAllhooks
Category 4: Concurrency Bugs
Sometimes the test itself is fine, but the code under test has a race condition that surfaces intermittently. The test is flaky because the code is buggy -- the flaky test is doing its job by catching a real issue, just not reliably.
How to identify
- Errors involving
race condition,concurrent modification, orConcurrentModificationException - Data corruption or unexpected values in multi-threaded operations
- Tests involving async operations, parallel workers, or event-driven code
- Failures that seem "random" with no pattern -- different assertions fail on different runs
- Shared mutable state accessed from multiple goroutines, threads, or async callbacks
Fix patterns
- Add proper synchronization (mutexes, locks, semaphores) to shared state
- Use immutable data structures where possible
- For Go code, run tests with
-raceflag to detect data races - In JavaScript, ensure promises are properly awaited and not left dangling
- Consider using stress testing tools to reproduce: run the flaky test 100 times in a loop
Category 5: Environment Sensitivity
Environment-sensitive tests depend on aspects of the runtime environment that differ between machines: timezone, locale, OS, CPU architecture, available memory, or installed system libraries.
How to identify
- Tests that pass on macOS but fail on Linux CI runners (or vice versa)
- Date/time assertions that fail in different timezones
- String comparison failures due to locale-dependent sorting
- File path assertions using
\vs/ - Snapshot tests that differ between environments (font rendering, floating point precision)
Common examples
// BAD: Depends on the system timezone
const date = new Date('2026-03-13');
expect(date.getDate()).toBe(13); // Fails in UTC-12
// GOOD: Pin the timezone
process.env.TZ = 'UTC';
const date = new Date('2026-03-13T00:00:00Z');
expect(date.getUTCDate()).toBe(13);Fix patterns
- Pin timezone in tests:
process.env.TZ = 'UTC'orTZ=UTCin CI config - Pin locale for sorting: use explicit comparators instead of default
sort() - Use
path.join()instead of string concatenation for file paths - Mock
Date.now()for time-dependent tests - Use platform-agnostic assertions (regex matchers instead of exact string matches)
Category 6: Network Dependencies
Tests that depend on external network resources -- third-party APIs, DNS resolution, or services running on other hosts -- are inherently flaky. Networks are unreliable, and external services have downtime, rate limits, and latency spikes.
How to identify
- Errors containing
ECONNREFUSED,ENOTFOUND,ECONNRESET - HTTP status codes like
429 Too Many Requests,502 Bad Gateway,503 Service Unavailable - TLS/SSL handshake failures
- Tests that call real external APIs (Stripe, GitHub, Twilio) without mocking
- DNS resolution failures in CI environments with restricted network access
Fix patterns
- Mock all external HTTP calls in unit tests (use
nock,msw,responses, orwiremock) - Use contract tests (Pact) for integration verification instead of live API calls
- Record and replay HTTP interactions for deterministic tests (
vcr,polly) - Add retry logic with exponential backoff for integration tests that genuinely need network
- Set explicit timeouts on all HTTP clients used in tests
Pattern-Matching Approach to Root Cause Analysis
When you have a flaky test, you do not need to guess which category it falls into. You can systematically narrow it down by examining three pieces of evidence:
- The error message and stack trace. Each category has distinct error signatures. A
timeout exceedederror is timing. AnEADDRINUSEis resource contention. AECONNREFUSEDis network. - The test code. Look for
sleep()calls (timing), shared state (ordering), hardcoded ports (resource), multi-threaded operations (concurrency), system-dependent values (environment), or HTTP calls (network). - The failure pattern. Does it fail more on CI than locally? (timing or environment). Does it fail only when run with other tests? (ordering or resource). Does it fail at specific times of day? (network -- the external service has peak hours).
This pattern-matching approach scales. Instead of spending an hour manually debugging each flaky test, you can classify the root cause in minutes by checking error patterns against known signatures.
Automating Root Cause Analysis with AI
Manual pattern matching works but does not scale to hundreds of flaky tests. The pattern signatures are well-defined enough to be automated -- and AI makes this even more powerful.
An AI-powered root cause analyzer can:
- Classify the error. Match the error message and stack trace against pattern libraries for all six categories simultaneously. A human might miss that a
connection resetburied in a 200-line stack trace points to network flakiness; pattern matching catches it instantly. - Read the test code. Analyze the test source for anti-patterns:
sleep()calls, missing cleanup, hardcoded ports, unmocked HTTP clients. - Score confidence. Rank the likely root cause by confidence. Some failures map cleanly to one category; others are ambiguous (a timeout in a network call could be timing or network). A confidence score helps you prioritize investigation.
- Suggest specific fixes. Instead of "fix the timing issue," an AI analyzer can say "replace the
sleep(2000)on line 47 withawait waitFor(() => ...)."
The key advantage is speed. Manually diagnosing a single flaky test takes 30-60 minutes of an engineer's time. Automated analysis takes seconds and can run on every quarantined test, every day, without human involvement.
Building a Root Cause Analysis Pipeline
Here is a practical architecture for automated root cause analysis:
- Ingest test results. Collect JUnit XML or equivalent from every CI run. Store results per test case with error messages and stack traces.
- Identify flaky tests. Compute flip rates (pass/fail transitions on the same code). Flag tests above your threshold (typically 0.2-0.3).
- Classify root causes. Run pattern matching against the error corpus. Use weighted pattern signatures per category. Score each category and pick the highest-confidence match.
- Surface results. Show the root cause, confidence, matched patterns, and suggested fix in your dashboard and PR comments.
- Track resolution. When a test is fixed and unquarantined, record which category it was and how it was resolved. This builds your organization's flaky test knowledge base.
Prioritizing Which Flaky Tests to Fix
Not all flaky tests are equally worth fixing. Prioritize by impact:
- CI minutes wasted. A test that runs on every PR and flakes 20% of the time wastes far more time than a test that flakes 50% but only runs in nightly builds.
- Team frustration. Flaky tests in frequently-touched code paths cause the most developer friction.
- Root cause confidence. If the automated analysis shows 95% confidence it is a timing issue with a clear fix, that is a quick win. If the confidence is 40% across two categories, it will take longer to investigate.
- Quarantine age. Tests that have been quarantined for 30+ days without attention need escalation, not more delay.
Getting Started
If you are starting root cause analysis for the first time:
- Pick your top 5 flaky tests by failure frequency or CI time wasted.
- Read the error messages and classify each into one of the six categories using the patterns above.
- Fix the easiest one first. Timing issues are usually the quickest to fix (replace sleep with waitFor). See our 7 proven fix patterns for step-by-step techniques. Get a win.
- Document what you find. Track root cause category, fix applied, and time to fix. This data helps your team improve test quality systematically.
Or, automate the entire pipeline. FlakyGuard analyzes every flaky test automatically: it ingests CI results, detects flaky tests via flip-rate analysis, classifies root causes using AI pattern matching across all six categories, and surfaces the diagnosis with confidence scores and specific fix suggestions -- directly in your GitHub PR checks and Slack notifications.
Stop guessing why tests flake
FlakyGuard uses AI to classify flaky test root causes across six categories, scores confidence, and suggests specific fixes. Get automated root cause analysis on every quarantined test, every day.
Join the Waitlist