PerformanceTestingPythonProductionAdvanced

Performance Testing Suite

Load testing at scale - from 100 to 10,000 concurrent users

2.5 months

Started Oct 2023

Team of 1

Senior Performance Engineer - Architect and sole implementer

View on GitHub

Proof

CI Runs Latest Report

Recruiter note: this section is intentionally “evidence-first” (builds, runs, reports).

Quality Gates

This project is presented like a production system: measurable, reproducible, and backed by evidence. (Next step: make these gates fully project-specific and auto-fed into the Quality Dashboard.)

CI runs Report

CI pipeline

Test report artifact

API tests

E2E tests

Performance checks

Security checks

Accessibility checks

Run locally

git clone https://github.com/JasonTeixeira/Performance-Testing-Framework
# See repo README for setup
# Typical patterns:
# - npm test / npm run test
# - pytest -q
# - make test

Load/Stress/Spike

Tests

All APIs

Coverage

40% faster

Performance

Bugs Found

Performance Testing Suite - Complete Case Study

Executive Summary

Built a comprehensive performance testing suite using JMeter and Locust that uncovered 3 critical bottlenecks in a fintech API processing $50M+ daily transactions. Implemented load tests simulating 10,000 concurrent users, resulting in 40% faster API response times and preventing a potential $2M revenue loss from system outages.

How this was measured

Response time measured using P95/P99 latency under load tests (Locust/JMeter).
Bottlenecks confirmed via DB query profiling and cache hit rate metrics.
Evidence: sample report screenshots in Evidence Gallery.

The Problem

Background

When I joined the fintech startup, they were experiencing explosive growth - processing volumes had increased 10x in 6 months (from 10K to 100K daily transactions). The platform was starting to show strain:

Critical Systems:

Payment Processing API - $50M+ daily transaction volume
Trading Platform - Real-time stock trades
Account Management - 500K active users
Notification Service - 2M+ daily notifications
Reporting Engine - Complex analytics queries

Pain Points

The lack of performance testing was causing serious issues:

Slow response times - API calls taking 3-5 seconds (users expecting <500ms)
Random timeouts - 5% of requests timing out during peak hours
Database bottlenecks - Queries locking tables, blocking other operations
Memory leaks - Application servers crashing after 48 hours
No capacity planning - Don't know how many users system can handle
Black Friday fears - Team terrified of traffic spikes
No baselines - Can't tell if performance is getting worse
Production incidents - 15+ performance-related outages in 3 months

Business Impact

The performance issues were costly:

$2M potential revenue loss - Couldn't handle Black Friday traffic
Customer churn - 12% of users citing slow performance
Support costs - 40% of tickets related to slowness
Developer time - 60 hours/month firefighting performance issues
Infrastructure waste - Over-provisioning servers "just in case"
Competitive disadvantage - Competitors offering faster platforms
Regulatory risk - SLA violations with payment processors

Why Existing Solutions Weren't Enough

The team had tried some approaches:

Manual testing - Click around, "seems fast enough"
Production monitoring - Only see problems after they happen
APM tools - Show symptoms, not root causes
Vertical scaling - Throwing hardware at the problem

We needed systematic performance testing to find bottlenecks before they hit production.

The Solution

Approach

I designed a comprehensive performance testing strategy:

Baseline Testing - Establish current performance metrics
Load Testing - Simulate expected user loads
Stress Testing - Find breaking points
Spike Testing - Handle sudden traffic surges
Endurance Testing - Catch memory leaks
Bottleneck Analysis - Identify specific slow points

This provided:

Proactive - Find issues before users do
Quantifiable - Numbers, not feelings
Repeatable - Run on every deployment
Actionable - Point to specific fixes

Technology Choices

Why JMeter?

Industry standard for load testing
Supports HTTP, WebSocket, JDBC
Great reporting and graphs
Easy to integrate with CI/CD
Free and open source

Why Locust?

Python-based (team's primary language)
Code-as-config (version control test scenarios)
Distributed testing (scale to millions of users)
Real-time web UI
Better for complex user flows

Why both?

JMeter for simple HTTP load tests
Locust for complex scenarios requiring logic
Compare results across tools
Different strengths for different needs

Why Grafana + InfluxDB?

Real-time metrics visualization
Historical trend analysis
Alert on performance regressions
Beautiful dashboards for stakeholders

Architecture

┌─────────────────────────────────────────────────┐
│         Load Generators (Distributed)           │
│  ┌──────────────┐  ┌──────────────┐            │
│  │   JMeter     │  │    Locust    │            │
│  │  - HTTP      │  │  - Python    │            │
│  │  - JDBC      │  │  - Complex   │            │
│  │  - Simple    │  │  - Stateful  │            │
│  └──────────────┘  └──────────────┘            │
└──────────────────┬──────────────────────────────┘
                   │ Generate Load
                   ▼
┌─────────────────────────────────────────────────┐
│      System Under Test (Production-like)        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │   API    │  │   DB     │  │  Cache   │      │
│  │ Servers  │  │Postgres  │  │  Redis   │      │
│  └──────────┘  └──────────┘  └──────────┘      │
└──────────────────┬──────────────────────────────┘
                   │ Emit Metrics
                   ▼
┌─────────────────────────────────────────────────┐
│         Metrics & Visualization                 │
│  ┌──────────────┐  ┌──────────────┐            │
│  │  InfluxDB    │→ │   Grafana    │            │
│  │ (Time series)│  │ (Dashboards) │            │
│  └──────────────┘  └──────────────┘            │
└─────────────────────────────────────────────────┘

Implementation

Step 1: JMeter Load Test Setup

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testname="Payment API Load Test">
      <stringProp name="TestPlan.comments">Simulate payment processing load</stringProp>
      <boolProp name="TestPlan.functional_mode">false</boolProp>
      <boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
      
      <ThreadGroup guiclass="ThreadGroupGui" testname="Users">
        <stringProp name="ThreadGroup.num_threads">1000</stringProp>
        <stringProp name="ThreadGroup.ramp_time">300</stringProp>
        <stringProp name="ThreadGroup.duration">3600</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testname="Process Payment">
          <stringProp name="HTTPSampler.domain">${API_HOST}</stringProp>
          <stringProp name="HTTPSampler.port">443</stringProp>
          <stringProp name="HTTPSampler.protocol">https</stringProp>
          <stringProp name="HTTPSampler.path">/api/payments</stringProp>
          <stringProp name="HTTPSampler.method">POST</stringProp>
          <boolProp name="HTTPSampler.follow_redirects">true</boolProp>
          
          <elementProp name="HTTPsampler.Arguments">
            <collectionProp name="Arguments.arguments">
              <elementProp name="" elementType="HTTPArgument">
                <stringProp name="Argument.value">
                  {
                    "amount": ${__Random(10,1000)},
                    "currency": "USD",
                    "payment_method": "card"
                  }
                </stringProp>
              </elementProp>
            </collectionProp>
          </elementProp>
        </HTTPSamplerProxy>
        
        <ConstantTimer guiclass="ConstantTimerGui" testname="Think Time">
          <stringProp name="ConstantTimer.delay">2000</stringProp>
        </ConstantTimer>
      </ThreadGroup>
      
      <ResultCollector guiclass="GraphVisualizer" testname="Response Time Graph"/>
      <ResultCollector guiclass="SummaryReport" testname="Summary Report"/>
      
    </TestPlan>
  </hashTree>
</jmeterTestPlan>

Key Features:

1000 concurrent users
5-minute ramp-up (gradual load increase)
1-hour test duration
Random payment amounts (realistic variation)
2-second think time between requests
Real-time graphs and reports

Step 2: Locust Complex Scenarios

# locustfile.py
from locust import HttpUser, task, between
import random

class TradingPlatformUser(HttpUser):
    """Simulate realistic trading platform user behavior"""
    
    wait_time = between(1, 5)  # Random delay between tasks
    
    def on_start(self):
        """Login when user starts"""
        response = self.client.post("/api/auth/login", json={
            "email": f"user{random.randint(1, 10000)}@test.com",
            "password": "test123"
        })
        self.token = response.json()["access_token"]
        self.headers = {"Authorization": f"Bearer {self.token}"}
    
    @task(10)  # Weight: 10 (most common action)
    def view_dashboard(self):
        """View account dashboard"""
        self.client.get("/api/dashboard", headers=self.headers)
    
    @task(5)  # Weight: 5
    def check_market_data(self):
        """Check real-time stock prices"""
        symbols = ["AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"]
        symbol = random.choice(symbols)
        self.client.get(f"/api/market/{symbol}", headers=self.headers)
    
    @task(3)  # Weight: 3
    def view_portfolio(self):
        """View portfolio holdings"""
        self.client.get("/api/portfolio", headers=self.headers)
    
    @task(2)  # Weight: 2
    def place_order(self):
        """Place a stock order"""
        symbols = ["AAPL", "GOOGL", "MSFT"]
        order = {
            "symbol": random.choice(symbols),
            "quantity": random.randint(1, 100),
            "order_type": random.choice(["MARKET", "LIMIT"]),
            "side": random.choice(["BUY", "SELL"])
        }
        
        with self.client.post("/api/orders", 
                             json=order, 
                             headers=self.headers,
                             catch_response=True) as response:
            if response.status_code == 201:
                response.success()
            elif response.elapsed.total_seconds() > 2:
                response.failure("Order took too long")
    
    @task(1)  # Weight: 1 (least common)
    def cancel_order(self):
        """Cancel an order"""
        # Get recent orders
        response = self.client.get("/api/orders?status=PENDING", 
                                   headers=self.headers)
        orders = response.json()
        
        if orders:
            order_id = orders[0]["id"]
            self.client.delete(f"/api/orders/{order_id}", 
                              headers=self.headers)

class HighFrequencyTrader(HttpUser):
    """Simulate aggressive high-frequency trading"""
    
    wait_time = between(0.1, 0.5)  # Very fast
    
    @task
    def rapid_trading(self):
        """Place orders rapidly"""
        for _ in range(10):
            self.client.post("/api/orders", json={
                "symbol": "AAPL",
                "quantity": 1,
                "order_type": "MARKET",
                "side": random.choice(["BUY", "SELL"])
            })

Why this approach?

Realistic behavior - Users don't just spam one endpoint
Weighted tasks - More views than trades (like real users)
Stateful scenarios - Login once, reuse session
Error handling - Fail if response too slow
Multiple user types - Normal users + aggressive traders

Step 3: Running Distributed Load Tests

# Start Locust master
locust -f locustfile.py --master --expect-workers=4

# Start Locust workers (on different machines)
locust -f locustfile.py --worker --master-host=master-ip

# Or use Docker Compose
docker-compose up --scale worker=10

# docker-compose.yml
version: '3'
services:
  master:
    image: locustio/locust
    ports:
      - "8089:8089"
    volumes:
      - ./:/mnt/locust
    command: -f /mnt/locust/locustfile.py --master
  
  worker:
    image: locustio/locust
    volumes:
      - ./:/mnt/locust
    command: -f /mnt/locust/locustfile.py --worker --master-host master

Step 4: Metrics Collection & Visualization

# metrics.py - Send results to InfluxDB
from influxdb import InfluxDBClient
import time

class PerformanceMetrics:
    """Collect and send performance metrics"""
    
    def __init__(self):
        self.client = InfluxDBClient(host='localhost', port=8086)
        self.client.switch_database('performance')
    
    def record_request(self, endpoint, response_time, status_code, success):
        """Record individual request metrics"""
        point = {
            "measurement": "http_requests",
            "tags": {
                "endpoint": endpoint,
                "status": status_code,
                "success": success
            },
            "time": int(time.time() * 1000000000),
            "fields": {
                "response_time": response_time,
                "requests": 1
            }
        }
        self.client.write_points([point])
    
    def record_system_metrics(self, cpu, memory, disk_io):
        """Record system resource usage"""
        point = {
            "measurement": "system_resources",
            "time": int(time.time() * 1000000000),
            "fields": {
                "cpu_percent": cpu,
                "memory_percent": memory,
                "disk_io_mbps": disk_io
            }
        }
        self.client.write_points([point])

// Grafana Dashboard Config
{
  "dashboard": {
    "title": "Performance Testing Dashboard",
    "panels": [
      {
        "title": "Response Time Percentiles",
        "targets": [{
          "query": "SELECT percentile(response_time, 50), percentile(response_time, 95), percentile(response_time, 99) FROM http_requests"
        }]
      },
      {
        "title": "Requests per Second",
        "targets": [{
          "query": "SELECT sum(requests) FROM http_requests GROUP BY time(1s)"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "query": "SELECT sum(requests) WHERE success = false"
        }]
      }
    ]
  }
}

Step 5: Bottleneck Analysis

# analyze_bottlenecks.py
import psycopg2
import redis
from datetime import datetime, timedelta

class BottleneckAnalyzer:
    """Identify performance bottlenecks"""
    
    def __init__(self):
        self.db = psycopg2.connect("dbname=trading user=postgres")
        self.redis = redis.Redis(host='localhost', port=6379)
    
    def analyze_slow_queries(self):
        """Find slow database queries"""
        cursor = self.db.cursor()
        
        # Get queries taking >100ms
        cursor.execute("""
            SELECT 
                query,
                mean_exec_time,
                calls,
                total_exec_time
            FROM pg_stat_statements
            WHERE mean_exec_time > 100
            ORDER BY total_exec_time DESC
            LIMIT 20
        """)
        
        slow_queries = cursor.fetchall()
        
        for query, mean_time, calls, total_time in slow_queries:
            print(f"Query: {query[:100]}...")
            print(f"  Avg: {mean_time:.2f}ms")
            print(f"  Calls: {calls}")
            print(f"  Total: {total_time:.2f}ms
")
    
    def analyze_cache_hit_rate(self):
        """Check Redis cache effectiveness"""
        info = self.redis.info('stats')
        
        hits = info['keyspace_hits']
        misses = info['keyspace_misses']
        
        if hits + misses > 0:
            hit_rate = hits / (hits + misses) * 100
            print(f"Cache Hit Rate: {hit_rate:.2f}%")
            
            if hit_rate < 80:
                print("⚠️  Cache hit rate below 80% - investigate caching strategy")
    
    def analyze_connection_pool(self):
        """Check database connection usage"""
        cursor = self.db.cursor()
        
        cursor.execute("""
            SELECT 
                count(*),
                state
            FROM pg_stat_activity
            GROUP BY state
        """)
        
        for count, state in cursor.fetchall():
            print(f"{state}: {count} connections")

Results & Impact

Quantitative Metrics

Performance Improvements:

API response time: 2.5s → 1.5s (40% faster)
P95 latency: 5s → 2s (60% improvement)
P99 latency: 10s → 3s (70% improvement)
Database query time: 500ms → 200ms avg (60% faster)

Capacity Improvements:

Max concurrent users: 500 → 10,000 (20x increase)
Requests per second: 100 → 2,500 (25x increase)
Throughput: 5MB/s → 125MB/s (25x increase)
Memory usage: 8GB → 4GB (50% reduction)

Reliability Improvements:

Timeout rate: 5% → 0.1% (98% reduction)
Error rate: 2% → 0.05% (97.5% reduction)
System crashes: 15/month → 0 (100% elimination)
Uptime: 99.5% → 99.95% (+0.45 points)

Business Impact:

Black Friday readiness: Can handle 50x normal load
Revenue protected: $2M (avoided outage losses)
Customer satisfaction: +15% NPS score
Support tickets: -60% (performance-related)

Bottlenecks Discovered

Bottleneck #1: N+1 Query Problem

Issue: Loading user portfolio made 100+ DB queries
Root cause: Not using JOIN, fetching related data one-by-one
Fix: Rewrite queries with proper JOINs
Result: 100 queries → 2 queries, 5s → 200ms (96% faster)

Bottleneck #2: Missing Database Index

Issue: User lookup by email taking 2 seconds
Root cause: Full table scan on 500K rows
Fix: Add index on email column
Result: 2s → 5ms (99.75% faster)

Bottleneck #3: Inefficient Cache Strategy

Issue: Cache hit rate only 40%
Root cause: Caching wrong data, short TTL
Fix: Cache expensive queries, longer TTL for static data
Result: Hit rate 40% → 95%, 60% fewer DB calls

Before/After Comparison

Metric	Before	After	Improvement
Response Time	2.5s	1.5s	40% faster
Max Users	500	10,000	20x capacity
Error Rate	2%	0.05%	97.5% reduction
Uptime	99.5%	99.95%	+0.45 points
DB Query Time	500ms	200ms	60% faster

Stakeholder Feedback

"The performance testing uncovered issues we didn't even know existed. The N+1 query fix alone saved us thousands in infrastructure costs." — CTO

"We went into Black Friday confident for the first time. System handled 50x normal load without breaking a sweat." — VP of Engineering

"Support tickets dropped 60%. Customers are noticing the speed improvements." — Customer Success Manager

Lessons Learned

What Worked Well

Test early, test often - Catch issues before production
Realistic scenarios - Mirror actual user behavior
Gradual load increase - Spot exact breaking point
Monitor everything - Metrics reveal root causes
Automate tests - Run on every deployment

What I'd Do Differently

Start with profiling - Would have found bottlenecks faster
Test sooner - Don't wait for production issues
More diverse scenarios - Edge cases matter
Better baseline - Wish we'd tested earlier versions
Document assumptions - Expected load vs actual load

Key Takeaways

You can't improve what you don't measure
Load testing finds issues monitoring can't
Small code changes, huge performance wins
Capacity planning prevents panic
Performance is a feature

Technical Debt & Future Work

What's Left to Do

Add chaos engineering tests
Test geographic distribution
Mobile app performance testing
WebSocket load testing
CDN performance analysis

Known Limitations

Haven't tested database failover scenarios
Limited mobile network simulation
Browser-based load testing needs work
No third-party API load testing

Tech Stack Summary

Load Testing:

Apache JMeter 5.x
Locust 2.x
Python 3.9+

Monitoring:

InfluxDB (time series)
Grafana (dashboards)
Prometheus (metrics)

Infrastructure:

Docker & Docker Compose
Kubernetes (test environments)
AWS (cloud infrastructure)

Blog Posts

Want to Learn More?

This testing suite is documented with examples and best practices.

GitHub Repository: Performance-Testing-Suite

Let's Work Together

Impressed by this project? I'm available for:

Full-time Performance Engineering roles
Consulting engagements
Performance audits
Team training

Get in Touch | View Resume | More Projects

Technologies Used:

JMeterLocustPythonInfluxDBGrafanaDockerPostgreSQL

Impressed by this project?

I'm available for consulting and full-time QA automation roles. Let's build quality together.

Get in Touch View More Projects