LoadForge LogoLoadForge

How to Fail CI Builds on Load Test Performance Regressions

How to Fail CI Builds on Load Test Performance Regressions

Introduction

Modern teams rely on CI/CD pipelines to catch problems before they reach production. Unit tests, integration tests, linting, and security scans are now standard—but performance regressions often still slip through because load testing happens manually or too late in the release process.

That’s where automated load testing in CI/CD becomes powerful. If a pull request makes an API 40% slower, increases error rates, or causes timeouts under concurrency, your pipeline should be able to detect it and fail the build automatically. This guide shows you how to use LoadForge to set pass/fail thresholds and stop deployments when performance degrades.

In this guide, you’ll learn how to build realistic Locust-based load tests for CI/CD workflows, define performance expectations, and use LoadForge as a gate in your delivery pipeline. We’ll cover practical examples including authenticated API testing, regression checks for critical endpoints, and environment-specific performance validation.

If you want to implement performance testing, stress testing, and load testing as part of your DevOps workflow, this is the pattern to follow.

Prerequisites

Before you start, make sure you have:

  • A LoadForge account
  • A web application or API deployed in a test, staging, or ephemeral CI environment
  • Access to the application’s authentication method, such as:
    • Bearer tokens
    • OAuth client credentials
    • Session login endpoints
  • A CI/CD platform such as:
    • GitHub Actions
    • GitLab CI
    • Jenkins
    • CircleCI
  • A set of performance expectations for your application, such as:
    • 95th percentile response time under 500ms
    • Error rate below 1%
    • Specific endpoint latency thresholds
  • Basic familiarity with Python and Locust

You’ll also want to identify your most important user journeys. For CI/CD load testing, focus on a small number of critical flows that are likely to regress:

  • User login
  • Product search
  • Checkout or order creation
  • Dashboard or reporting APIs
  • File upload or export endpoints
  • Internal service-to-service APIs

LoadForge is especially useful here because it gives you cloud-based infrastructure, distributed testing, global test locations, real-time reporting, and CI/CD integration, making it practical to run automated performance checks on every release candidate.

Understanding CI/CD & DevOps Under Load

When teams talk about load testing in DevOps, they usually mean one of two things:

  1. Running performance tests as part of the pipeline
  2. Using performance results to make deployment decisions

The second part is what turns load testing into a real quality gate.

Why performance regressions happen in CI/CD

Even small changes can introduce measurable slowdowns:

  • A new database join adds 100ms to every request
  • An external API call is now made synchronously
  • Caching headers are removed accidentally
  • A search query becomes unindexed
  • Authentication middleware performs extra lookups
  • Serialization logic increases CPU usage

These changes may not break functionality, so traditional tests pass. But under load, they can create queueing, timeouts, and cascading failures.

Common bottlenecks you’ll catch with automated load testing

In CI/CD and DevOps workflows, performance testing often reveals:

  • Slow application startup after deployment
  • Database contention in newly added endpoints
  • CPU spikes from expensive business logic
  • Memory pressure causing increased response times
  • Authentication bottlenecks under concurrent login bursts
  • Rate limiting or upstream dependency saturation
  • Misconfigured autoscaling thresholds

What makes CI-based load testing different?

A pipeline load test is usually:

  • Shorter than a full-scale performance test
  • Focused on critical endpoints
  • Threshold-driven
  • Repeatable across commits
  • Designed to detect regressions rather than find absolute max capacity

For example, your nightly stress testing may run for 30 minutes across multiple regions, while your CI load testing job may run for 3–5 minutes with a smaller user count and strict pass/fail criteria.

Writing Your First Load Test

Let’s start with a realistic regression test for a common SaaS API. Imagine your CI pipeline deploys a staging build of an application with these endpoints:

  • POST /api/v1/auth/login
  • GET /api/v1/projects
  • GET /api/v1/projects/{id}/builds
  • POST /api/v1/builds/{id}/retry

This first Locust script logs in, fetches projects, inspects recent builds, and retries a build occasionally. These are realistic actions for a DevOps platform or internal CI dashboard.

python
from locust import HttpUser, task, between
import random
 
class DevOpsApiUser(HttpUser):
    wait_time = between(1, 3)
 
    def on_start(self):
        response = self.client.post(
            "/api/v1/auth/login",
            json={
                "email": "ci-bot@example.com",
                "password": "SuperSecurePassword123!"
            },
            headers={"Content-Type": "application/json"},
            name="POST /api/v1/auth/login"
        )
 
        if response.status_code == 200:
            data = response.json()
            self.token = data["access_token"]
            self.headers = {
                "Authorization": f"Bearer {self.token}",
                "Accept": "application/json"
            }
        else:
            self.token = None
            self.headers = {}
 
    @task(3)
    def list_projects(self):
        self.client.get(
            "/api/v1/projects?limit=20&sort=updated_at:desc",
            headers=self.headers,
            name="GET /api/v1/projects"
        )
 
    @task(2)
    def list_project_builds(self):
        project_id = random.choice([101, 102, 103, 104])
        self.client.get(
            f"/api/v1/projects/{project_id}/builds?status=failed&limit=10",
            headers=self.headers,
            name="GET /api/v1/projects/:id/builds"
        )
 
    @task(1)
    def retry_build(self):
        build_id = random.choice([9001, 9002, 9003])
        self.client.post(
            f"/api/v1/builds/{build_id}/retry",
            json={
                "reason": "Regression validation from CI pipeline",
                "triggered_by": "loadforge-ci-check"
            },
            headers={**self.headers, "Content-Type": "application/json"},
            name="POST /api/v1/builds/:id/retry"
        )

Why this script works well in CI/CD

This test is a good first step because it:

  • Authenticates the same way a real client would
  • Exercises business-critical API paths
  • Uses realistic payloads
  • Mixes read-heavy and write-heavy operations
  • Produces endpoint-level metrics you can threshold in LoadForge

How to use this in a CI pipeline

In LoadForge, you would configure this test to run against your staging or preview environment after deployment. Then set pass/fail conditions such as:

  • Overall error rate less than 1%
  • POST /api/v1/auth/login p95 less than 800ms
  • GET /api/v1/projects p95 less than 500ms
  • GET /api/v1/projects/:id/builds p95 less than 700ms

If any threshold is exceeded, the test fails and your CI build can be marked as failed.

Advanced Load Testing Scenarios

Once you have a basic smoke-level performance gate, you can add more realistic regression scenarios.

Scenario 1: Authenticated pipeline health checks with token refresh

Many CI/CD systems use short-lived access tokens. If your application refreshes tokens during active sessions, you should test that flow too.

This example simulates a user session that logs in, refreshes the token, and accesses deployment endpoints.

python
from locust import HttpUser, task, between
 
class DeploymentApiUser(HttpUser):
    wait_time = between(1, 2)
 
    def on_start(self):
        self.login()
 
    def login(self):
        response = self.client.post(
            "/api/v1/auth/login",
            json={
                "email": "release-manager@example.com",
                "password": "ReleasePassword456!"
            },
            headers={"Content-Type": "application/json"},
            name="POST /api/v1/auth/login"
        )
        response.raise_for_status()
 
        data = response.json()
        self.access_token = data["access_token"]
        self.refresh_token = data["refresh_token"]
        self.headers = {
            "Authorization": f"Bearer {self.access_token}",
            "Accept": "application/json"
        }
 
    @task(4)
    def list_deployments(self):
        self.client.get(
            "/api/v1/deployments?environment=staging&status=in_progress",
            headers=self.headers,
            name="GET /api/v1/deployments"
        )
 
    @task(2)
    def get_deployment_details(self):
        deployment_id = 55021
        self.client.get(
            f"/api/v1/deployments/{deployment_id}",
            headers=self.headers,
            name="GET /api/v1/deployments/:id"
        )
 
    @task(1)
    def refresh_session(self):
        response = self.client.post(
            "/api/v1/auth/refresh",
            json={"refresh_token": self.refresh_token},
            headers={"Content-Type": "application/json"},
            name="POST /api/v1/auth/refresh"
        )
 
        if response.status_code == 200:
            data = response.json()
            self.access_token = data["access_token"]
            self.headers["Authorization"] = f"Bearer {self.access_token}"

This is useful when performance regressions affect session handling, auth middleware, or token storage systems like Redis or database-backed session tables.

Scenario 2: Regression testing a build trigger and artifact workflow

A common DevOps use case is validating build and artifact endpoints. These APIs often become slower over time because they touch databases, queues, object storage, and audit logging systems.

python
from locust import HttpUser, task, between
import random
import uuid
 
class BuildPipelineUser(HttpUser):
    wait_time = between(2, 4)
 
    def on_start(self):
        response = self.client.post(
            "/api/v1/auth/token",
            json={
                "client_id": "ci-runner",
                "client_secret": "ci-runner-secret",
                "grant_type": "client_credentials",
                "scope": "builds:read builds:write artifacts:read"
            },
            headers={"Content-Type": "application/json"},
            name="POST /api/v1/auth/token"
        )
        response.raise_for_status()
        token = response.json()["access_token"]
        self.headers = {
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
 
    @task(3)
    def trigger_build(self):
        branch = random.choice(["main", "develop", "release/2026.04"])
        payload = {
            "project_id": 2001,
            "branch": branch,
            "commit_sha": str(uuid.uuid4()).replace("-", "")[:12],
            "pipeline_source": "ci_regression_test",
            "variables": {
                "RUN_E2E": "false",
                "RUN_PERF_SMOKE": "true"
            }
        }
 
        self.client.post(
            "/api/v1/builds",
            json=payload,
            headers=self.headers,
            name="POST /api/v1/builds"
        )
 
    @task(2)
    def get_build_status(self):
        build_id = random.choice([78110, 78111, 78112, 78113])
        self.client.get(
            f"/api/v1/builds/{build_id}/status",
            headers=self.headers,
            name="GET /api/v1/builds/:id/status"
        )
 
    @task(1)
    def list_artifacts(self):
        build_id = random.choice([78110, 78111, 78112])
        self.client.get(
            f"/api/v1/builds/{build_id}/artifacts?type=report",
            headers=self.headers,
            name="GET /api/v1/builds/:id/artifacts"
        )

This script is ideal for catching regressions in:

  • Build creation latency
  • Queue-backed status lookups
  • Artifact metadata retrieval
  • Auth and permission checks
  • Audit/event publishing overhead

Scenario 3: Testing a database-heavy reporting endpoint in CI

Reporting endpoints are frequent regression hotspots because they aggregate data across builds, deployments, and environments. These APIs may still return 200 responses while becoming unacceptably slow.

python
from locust import HttpUser, task, between
 
class ReportingUser(HttpUser):
    wait_time = between(1, 3)
 
    def on_start(self):
        response = self.client.post(
            "/api/v1/session/login",
            json={
                "username": "analytics-bot",
                "password": "AnalyticsPassword789!"
            },
            headers={"Content-Type": "application/json"},
            name="POST /api/v1/session/login"
        )
        response.raise_for_status()
        session_cookie = response.cookies.get("session_id")
        self.headers = {
            "Accept": "application/json",
            "X-Requested-With": "XMLHttpRequest"
        }
        self.cookies = {"session_id": session_cookie}
 
    @task(3)
    def deployment_frequency_report(self):
        self.client.get(
            "/api/v1/reports/deployment-frequency?team=platform&window=30d",
            headers=self.headers,
            cookies=self.cookies,
            name="GET /api/v1/reports/deployment-frequency"
        )
 
    @task(2)
    def change_failure_rate_report(self):
        self.client.get(
            "/api/v1/reports/change-failure-rate?service=payments&window=90d",
            headers=self.headers,
            cookies=self.cookies,
            name="GET /api/v1/reports/change-failure-rate"
        )
 
    @task(1)
    def lead_time_report(self):
        self.client.get(
            "/api/v1/reports/lead-time?repository=checkout-service&branch=main",
            headers=self.headers,
            cookies=self.cookies,
            name="GET /api/v1/reports/lead-time"
        )

This scenario helps identify:

  • Slow SQL queries
  • Missing indexes
  • Expensive report generation logic
  • Cache misses
  • N+1 query issues
  • Data warehouse or analytics backend latency

These database-heavy endpoints are perfect candidates for threshold-based build failure because they often degrade gradually over time.

Analyzing Your Results

After running your load test in LoadForge, the next step is deciding whether the build should pass or fail.

Key metrics to watch

For CI/CD performance testing, focus on these metrics:

  • Response time percentiles:
    • Median can hide problems
    • p95 and p99 are better indicators of user experience under load
  • Error rate:
    • Watch both HTTP failures and application-level failures
  • Requests per second:
    • Useful for checking throughput consistency
  • Endpoint-specific latency:
    • Critical for identifying which route regressed
  • Response distribution over time:
    • Reveals warm-up issues, memory leaks, or resource exhaustion

Good threshold examples

A practical set of pass/fail conditions might look like:

  • Total error rate < 1%
  • POST /api/v1/auth/login p95 < 800ms
  • GET /api/v1/projects p95 < 500ms
  • POST /api/v1/builds p95 < 1200ms
  • GET /api/v1/reports/deployment-frequency p95 < 1500ms

You can also set stricter rules for core business flows and looser ones for secondary endpoints.

Comparing results against previous runs

The real value of CI/CD load testing is regression detection. A single test result matters less than the trend.

For example:

  • Last week: GET /api/v1/builds/:id/status p95 = 220ms
  • Current branch: p95 = 680ms

Even if 680ms is technically “acceptable,” that jump may indicate a serious regression. LoadForge’s real-time reporting and historical test visibility make these changes easier to spot before they hit production.

Using LoadForge in CI/CD gates

A typical workflow looks like this:

  1. Deploy application to staging or preview environment
  2. Trigger LoadForge test via CI job
  3. Wait for test completion
  4. Check pass/fail status from LoadForge thresholds
  5. Fail the pipeline if thresholds are breached

Here is an example GitHub Actions step pattern for a CI gate:

yaml
name: performance-regression-check
 
on:
  pull_request:
    branches: [main]
 
jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy preview environment
        run: ./scripts/deploy-preview.sh
 
      - name: Trigger LoadForge test
        run: |
          curl -X POST "https://api.loadforge.com/v1/tests/123456/run" \
            -H "Authorization: Bearer ${{ secrets.LOADFORGE_API_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d '{
              "environment_url": "https://pr-482.staging.example.com",
              "notes": "PR #482 performance regression check"
            }'
 
      - name: Poll for result
        run: ./scripts/check-loadforge-result.sh

And an example shell script that exits non-zero if the test failed:

bash
#!/usr/bin/env bash
set -euo pipefail
 
TEST_RUN_ID="$1"
API_TOKEN="$LOADFORGE_API_TOKEN"
 
while true; do
  RESPONSE=$(curl -s "https://api.loadforge.com/v1/test-runs/${TEST_RUN_ID}" \
    -H "Authorization: Bearer ${API_TOKEN}")
 
  STATUS=$(echo "$RESPONSE" | jq -r '.status')
  RESULT=$(echo "$RESPONSE" | jq -r '.result')
 
  if [[ "$STATUS" == "completed" ]]; then
    if [[ "$RESULT" == "passed" ]]; then
      echo "Load test passed"
      exit 0
    else
      echo "Load test failed"
      exit 1
    fi
  fi
 
  echo "Waiting for test completion..."
  sleep 15
done

The exact API details may vary based on your LoadForge setup, but the pattern remains the same: trigger, poll, evaluate, fail the build if performance regressed.

Performance Optimization Tips

When your CI build fails due to load test thresholds, use that as a signal to investigate systematically.

Optimize the slowest endpoints first

Look at the endpoint-level breakdown in LoadForge and prioritize:

  • Highest p95 latency
  • Largest regression from baseline
  • Highest error-producing routes

Review database access patterns

For database-heavy APIs:

  • Add indexes for new filters or joins
  • Reduce query count per request
  • Avoid N+1 ORM behavior
  • Cache frequently requested aggregates
  • Paginate large result sets

Improve authentication performance

If login or token refresh is slow:

  • Cache user/session lookups
  • Reduce repeated permission checks
  • Optimize JWT validation or key fetching
  • Move expensive auth hooks off the request path where possible

Reduce payload and serialization overhead

Large JSON responses can become a hidden bottleneck:

  • Return only fields needed by the client
  • Compress responses
  • Avoid deeply nested objects
  • Stream large exports instead of generating them synchronously

Use environment-appropriate test sizes

Your CI performance test should be small enough to run quickly, but large enough to expose regressions. Then use broader LoadForge distributed testing in nightly or pre-release stages for deeper stress testing across cloud-based infrastructure and global test locations.

Common Pitfalls to Avoid

Testing too many endpoints in CI

Your pipeline should not run a full-scale performance testing suite on every commit. Keep CI tests focused on critical paths and known regression hotspots.

Using unrealistic user behavior

If your load test only hammers one endpoint with no auth, no session handling, and no realistic pacing, the results may not reflect real application behavior. Use actual login flows, realistic headers, and production-like payloads.

Ignoring percentiles

Average response time is not enough. A build can have a good average and still produce terrible tail latency. Always track p95 and p99.

Running against unstable environments

If your staging environment is noisy, underpowered, or shared with unrelated work, your CI load testing results may be inconsistent. Try to test against predictable environments or ephemeral deployments where possible.

Failing builds on overly aggressive thresholds

If your thresholds are too strict, teams will start ignoring failures. Start with realistic baselines and tighten over time.

Not separating smoke, load, and stress testing

These are related but different:

  • Smoke performance test: quick regression check in CI
  • Load testing: validate expected concurrency
  • Stress testing: push beyond expected capacity

Use CI for regression-oriented checks and reserve larger tests for scheduled or pre-release pipelines.

Forgetting warm-up effects

Some applications perform poorly just after deployment due to cold caches, JIT compilation, or lazy initialization. Decide whether your threshold should include or exclude warm-up behavior based on your production expectations.

Conclusion

Automating load testing in CI/CD is one of the most effective ways to prevent performance regressions from reaching production. By defining pass/fail thresholds in LoadForge, you can turn performance testing into a real deployment gate instead of a manual afterthought.

Start with a focused Locust script that covers your most important endpoints, add realistic authentication and payloads, and define thresholds around p95 latency and error rate. As your process matures, expand into more advanced scenarios like token refresh, build orchestration, and reporting APIs. With LoadForge, you can run these tests on cloud-based infrastructure, scale them with distributed testing, view real-time reporting, and integrate them directly into your CI/CD pipeline.

If you’re ready to make load testing a standard part of your DevOps workflow, try LoadForge and start failing builds before performance regressions reach your users.

Try LoadForge free for 7 days

Set up your first load test in under 2 minutes. No commitment.