Eliminating Flaky Tests: A Systematic Approach
A flaky test is a test that sometimes passes and sometimes fails without any code changes. At 10% flaky rate, developers stop trusting the test suite. At 20%, they stop running it.
I've taken suites from 10% flaky to under 1%. Here's the systematic approach.
Step 1: Measure the Flake Rate
You can't fix what you don't measure. Track flakiness over time:
# Simple flake tracker in CI
import json
from datetime import datetime
def record_test_result(test_name, passed, run_id):
with open('test_history.jsonl', 'a') as f:
json.dump({
'test': test_name,
'passed': passed,
'run_id': run_id,
'timestamp': datetime.utcnow().isoformat()
}, f)
f.write('\n')
Run this for 2 weeks. Any test that fails >2 times without code changes is flaky.
Step 2: Categorize the Flakes
In my experience, flaky tests fall into 5 categories:
| Category | % of Flakes | Example |
|---|---|---|
| Timing/async | 40% | Test checks element before it renders |
| Shared state | 25% | Test A writes data that breaks Test B |
| Network | 15% | External API times out |
| Randomness | 10% | Test uses random data that triggers edge cases |
| Environment | 10% | Different behavior on CI vs local |
Step 3: Fix by Category
Timing: Use Explicit Waits, Not Sleep
# Bad: arbitrary sleep
time.sleep(3)
assert element.is_visible()
# Good: explicit wait with condition
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.ID, "result"))
)
Shared State: Isolate Every Test
# Each test gets its own database transaction that rolls back
@pytest.fixture(autouse=True)
def db_session(db):
connection = db.engine.connect()
transaction = connection.begin()
session = Session(bind=connection)
yield session
transaction.rollback()
connection.close()
Network: Mock External Services
# Mock external APIs in tests
@pytest.fixture
def mock_market_data(mocker):
return mocker.patch(
'services.alpaca.get_quote',
return_value={'price': 150.00, 'volume': 1000000}
)
Randomness: Use Seeds
# Deterministic "random" data in tests
@pytest.fixture
def fake():
return Faker()
fake.seed_instance(12345) # Same data every run
Step 4: Quarantine, Don't Delete
Don't delete flaky tests — quarantine them. They still catch real bugs sometimes:
@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_websocket_reconnection():
# This test is flaky due to WebSocket timing
# Reruns 3 times with 2-second delay between attempts
...
Track quarantined tests separately. Fix them when you have time. But don't let them block deployments.
Step 5: Prevent New Flakes
Add a CI check that detects new flaky tests:
- name: Detect flaky tests
run: |
# Run the test suite 3 times
for i in 1 2 3; do
pytest tests/ --tb=line -q > results_$i.txt 2>&1 || true
done
# Compare results - any test that passed in one run
# but failed in another is flaky
python scripts/detect_flakes.py results_*.txt
Results
| Metric | Before | After |
|---|---|---|
| Flaky rate | 10% | 0.8% |
| CI pass rate | 72% | 97% |
| Developer trust | "CI is broken again" | "If CI fails, there's a real bug" |
| Time to fix a flake | 2-4 hours | 15 minutes (categorized approach) |
The biggest win isn't the number — it's developer trust. When engineers trust the test suite, they run it. When they run it, they catch bugs before production.