LoadForge LogoLoadForge

How to Load Test an AI Gateway

How to Load Test an AI Gateway

Introduction

AI gateways sit between your applications and one or more large language model providers, handling concerns like authentication, request routing, caching, retries, observability, rate limiting, and failover. If your product depends on an AI gateway, load testing is not optional. A gateway that works perfectly for a handful of requests can become a bottleneck under real production traffic, especially when prompt sizes grow, streaming is enabled, or multiple upstream models are involved.

Learning how to load test an AI gateway helps you validate far more than raw throughput. You can confirm that routing rules behave correctly under concurrency, cache hits actually reduce latency, rate limiting protects upstream providers without breaking user experience, and fallback logic works when one model becomes slow or unavailable. This is especially important in AI and LLM systems, where response times are often variable and provider-side limits can change without warning.

In this guide, you’ll learn how to perform realistic load testing, performance testing, and stress testing for an AI gateway using Locust on LoadForge. We’ll cover basic chat completion traffic, authenticated requests, streaming and multi-model routing, cache validation, and rate-limit behavior. Along the way, you’ll see practical Locust scripts with realistic AI gateway endpoints, headers, and payloads that mirror production usage.

LoadForge is especially useful here because AI gateway testing often requires distributed testing from multiple regions, real-time reporting to spot latency spikes, and cloud-based infrastructure to generate enough concurrent traffic to simulate actual user demand.

Prerequisites

Before you begin load testing your AI gateway, make sure you have the following:

  • A working AI gateway deployment or staging environment
  • The base URL for your gateway, such as:
    • https://gateway.example.com
    • https://ai-gateway.staging.example.net
  • One or more valid API tokens or service account credentials
  • Knowledge of the gateway’s supported endpoints, for example:
    • POST /v1/chat/completions
    • POST /v1/embeddings
    • GET /v1/models
    • POST /v1/responses
  • A clear understanding of gateway features you want to validate:
    • model routing
    • prompt/response caching
    • tenant isolation
    • rate limiting
    • failover across providers
    • streaming responses
  • Test data such as:
    • realistic prompts
    • conversation history
    • tenant IDs
    • model names
  • A LoadForge account if you want to run the tests at scale using distributed cloud load generators

You should also define success criteria before starting. For example:

  • p95 latency under 2.5 seconds for cached prompts
  • less than 1% error rate under 500 concurrent users
  • proper 429 responses during rate limiting
  • successful failover from one provider-backed model route to another
  • no increase in authentication failures under load

Understanding AI Gateway Under Load

An AI gateway behaves differently from a typical REST API because it often performs work beyond simple request forwarding. Under load, several layers can become bottlenecks.

Request Routing

Many AI gateways route requests based on model name, tenant, cost policy, region, or prompt characteristics. Under concurrent traffic, routing logic can add latency if rules are complex or require external lookups. If routing metadata is stored in Redis or a database, those dependencies can become hot spots.

Authentication and Authorization

AI gateways commonly validate bearer tokens, API keys, workspace IDs, and tenant quotas. If every request triggers expensive auth checks or policy evaluation, throughput can drop quickly. This is especially noticeable in multi-tenant AI platforms.

Caching

Prompt caching is one of the most valuable gateway features for AI workloads. But caching needs to be tested carefully. Under load, you want to verify:

  • cache hit rates improve latency
  • cache keys are constructed correctly
  • duplicate requests do not overwhelm upstream providers before cache entries are written
  • cache storage remains stable under high concurrency

Rate Limiting and Quotas

Gateways often enforce per-user, per-tenant, or per-model request limits. During stress testing, you need to confirm that rate limiting is accurate and predictable. A good AI gateway should reject excess traffic gracefully without causing cascading failures.

Streaming Responses

Streaming token output can keep connections open much longer than standard JSON APIs. This changes concurrency behavior dramatically. A gateway may handle 1,000 short requests well but struggle with 300 long-lived streaming connections.

Upstream Provider Variability

Even if your gateway is efficient, upstream LLM providers may introduce latency, throttling, or intermittent failures. Your performance testing should distinguish between gateway bottlenecks and provider-side instability. This is where LoadForge’s real-time reporting is helpful, because you can correlate response times, error codes, and throughput during a test.

Writing Your First Load Test

Let’s start with a basic AI gateway load test that exercises a chat completions endpoint using realistic authentication and payloads.

This example simulates users sending short prompts to a gateway route that proxies requests to a fast general-purpose model.

python
from locust import HttpUser, task, between
import random
import uuid
 
PROMPTS = [
    "Summarize the benefits of using an API gateway for AI applications.",
    "Write a short onboarding email for a new SaaS customer.",
    "Explain rate limiting in simple terms for a product manager.",
    "Generate three FAQ entries for an AI chatbot pricing page."
]
 
class AIGatewayUser(HttpUser):
    wait_time = between(1, 3)
 
    def on_start(self):
        self.api_key = "lgw_test_sk_prodlike_123456"
        self.tenant_id = random.choice(["team-alpha", "team-beta", "team-gamma"])
 
    @task
    def chat_completion(self):
        prompt = random.choice(PROMPTS)
 
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant_id,
            "X-Request-ID": str(uuid.uuid4())
        }
 
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are a concise enterprise AI assistant."},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.3,
            "max_tokens": 180,
            "stream": False,
            "metadata": {
                "app": "customer-support-portal",
                "environment": "staging"
            }
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers=headers,
            name="/v1/chat/completions [basic]",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Unexpected status code: {response.status_code}")
                return
 
            try:
                data = response.json()
                if "choices" not in data or not data["choices"]:
                    response.failure("Missing choices in response")
            except Exception as e:
                response.failure(f"Invalid JSON response: {e}")

What this test validates

This first script is useful for baseline load testing because it confirms:

  • the AI gateway accepts authenticated traffic
  • the POST /v1/chat/completions route works under concurrency
  • the response schema remains valid
  • per-request tenant headers are handled correctly

Why this matters

Even a simple baseline test can reveal important performance issues:

  • auth middleware slowing down as concurrency rises
  • request queues forming at the gateway
  • connection pooling misconfiguration
  • high latency caused by default model routing

For your first performance testing run, start with a modest load shape, such as 25 to 50 concurrent users, and observe median and p95 response times. Then increase gradually.

Advanced Load Testing Scenarios

Once the basic path is validated, move on to more realistic AI gateway scenarios. These are the tests that uncover routing bugs, rate-limit problems, and multi-model reliability issues.

Scenario 1: Testing authenticated multi-tenant routing and model selection

Many AI gateways route traffic differently depending on tenant plan, model access policy, or region. This example simulates different tenants using different models and endpoint patterns.

python
from locust import HttpUser, task, between
import random
import uuid
 
TENANTS = [
    {"id": "free-tier", "model": "gpt-4o-mini", "workspace": "ws_free_001"},
    {"id": "pro-tier", "model": "claude-3-5-sonnet", "workspace": "ws_pro_002"},
    {"id": "enterprise", "model": "gpt-4.1", "workspace": "ws_ent_003"},
]
 
USER_QUERIES = [
    "Draft a product release note for a new analytics dashboard.",
    "Summarize this support issue into a Jira ticket.",
    "Rewrite this message to sound more professional and concise.",
    "Extract action items from the following meeting summary."
]
 
class MultiTenantGatewayUser(HttpUser):
    wait_time = between(1, 2)
 
    def on_start(self):
        self.api_key = "lgw_test_sk_multitenant_789"
        self.tenant = random.choice(TENANTS)
 
    @task(3)
    def routed_chat_request(self):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant["id"],
            "X-Workspace-ID": self.tenant["workspace"],
            "Idempotency-Key": str(uuid.uuid4())
        }
 
        payload = {
            "model": self.tenant["model"],
            "messages": [
                {"role": "system", "content": "You are an assistant for internal business operations."},
                {"role": "user", "content": random.choice(USER_QUERIES)}
            ],
            "temperature": 0.2,
            "max_tokens": 220,
            "stream": False
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers=headers,
            name=f"/v1/chat/completions [{self.tenant['id']}]",
            catch_response=True
        ) as response:
            if response.status_code not in [200, 429]:
                response.failure(f"Unexpected status code {response.status_code}")
                return
 
            if response.status_code == 429:
                response.success()
                return
 
            try:
                body = response.json()
                returned_model = body.get("model", "")
                if not returned_model:
                    response.failure("Missing model in response")
            except Exception as e:
                response.failure(f"JSON parsing failed: {e}")
 
    @task(1)
    def list_available_models(self):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "X-Tenant-ID": self.tenant["id"]
        }
 
        with self.client.get(
            "/v1/models",
            headers=headers,
            name="/v1/models",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Failed to list models: {response.status_code}")

This test is ideal for validating:

  • multi-tenant request isolation
  • model-specific routing behavior
  • workspace-based authorization
  • rate limiting per tenant
  • model catalog endpoint stability under load

If your AI gateway supports policy-based routing, you can extend this script to assert that certain tenants are never routed to premium models.

Scenario 2: Testing cache effectiveness with repeated prompts

Caching is a core AI gateway feature, especially for repeated prompts such as FAQs, support templates, and internal knowledge lookups. This script intentionally mixes cacheable and non-cacheable requests to measure the impact of prompt caching.

python
from locust import HttpUser, task, between
import random
import uuid
 
CACHEABLE_PROMPTS = [
    "What are your business hours?",
    "How do I reset my password?",
    "What is your refund policy?",
    "How can I contact support?"
]
 
DYNAMIC_PROMPTS = [
    lambda: f"Summarize support ticket #{random.randint(10000, 99999)}",
    lambda: f"Generate follow-up email for lead ID {random.randint(1000, 9999)}",
    lambda: f"Classify sentiment for review batch {random.randint(1, 500)}"
]
 
class CachedAIGatewayUser(HttpUser):
    wait_time = between(0.5, 1.5)
 
    def on_start(self):
        self.api_key = "lgw_test_sk_cache_456"
        self.tenant_id = "support-automation"
 
    @task(4)
    def cacheable_chat_request(self):
        prompt = random.choice(CACHEABLE_PROMPTS)
 
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant_id,
            "X-Cache-Enabled": "true",
            "X-Request-ID": str(uuid.uuid4())
        }
 
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Answer customer support questions accurately and briefly."},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0,
            "max_tokens": 80,
            "stream": False
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers=headers,
            name="/v1/chat/completions [cacheable]",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Status code {response.status_code}")
                return
 
            cache_status = response.headers.get("X-Cache", "").lower()
            if cache_status not in ["hit", "miss", "bypass"]:
                response.failure("Missing or invalid X-Cache header")
 
    @task(2)
    def dynamic_chat_request(self):
        prompt = random.choice(DYNAMIC_PROMPTS)()
 
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant_id,
            "X-Cache-Enabled": "true"
        }
 
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You process unique operational requests."},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.4,
            "max_tokens": 120,
            "stream": False
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers=headers,
            name="/v1/chat/completions [dynamic]",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Dynamic request failed with {response.status_code}")

This kind of performance testing helps answer critical questions:

  • Does caching actually improve p95 latency?
  • Are repeated prompts consistently served from cache?
  • Does cache overhead hurt uncached requests?
  • Are there cache stampede issues under concurrent demand?

On LoadForge, you can compare the response time distributions for the cacheable and dynamic request groups separately.

Scenario 3: Testing failover, rate limiting, and embeddings traffic

Real AI gateway traffic often includes more than chat completions. Embeddings endpoints are common in retrieval-augmented generation pipelines, and gateways may also fail over between providers when one route is degraded.

python
from locust import HttpUser, task, between
import random
import uuid
 
DOCUMENTS = [
    "Load testing verifies whether an AI gateway can route requests reliably under concurrent traffic.",
    "Embeddings are used to convert text into vectors for semantic search and retrieval.",
    "Rate limiting protects upstream model providers from excessive request volume.",
    "Failover allows traffic to shift to a secondary provider when the primary route is unhealthy."
]
 
class AIGatewayMixedWorkloadUser(HttpUser):
    wait_time = between(1, 2)
 
    def on_start(self):
        self.api_key = "lgw_test_sk_mixed_321"
        self.tenant_id = random.choice(["rag-app", "search-api", "analytics-bot"])
 
    @task(3)
    def create_embeddings(self):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant_id,
            "X-Provider-Preference": "primary"
        }
 
        payload = {
            "model": "text-embedding-3-large",
            "input": random.choice(DOCUMENTS),
            "encoding_format": "float"
        }
 
        with self.client.post(
            "/v1/embeddings",
            json=payload,
            headers=headers,
            name="/v1/embeddings",
            catch_response=True
        ) as response:
            if response.status_code not in [200, 429]:
                response.failure(f"Unexpected embeddings status: {response.status_code}")
                return
 
            if response.status_code == 200:
                try:
                    body = response.json()
                    if "data" not in body:
                        response.failure("Missing embeddings data")
                except Exception as e:
                    response.failure(f"Embeddings JSON invalid: {e}")
 
    @task(2)
    def failover_chat_request(self):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Tenant-ID": self.tenant_id,
            "X-Route-Policy": "prefer-primary-with-fallback",
            "X-Request-ID": str(uuid.uuid4())
        }
 
        payload = {
            "model": "enterprise-chat",
            "messages": [
                {"role": "system", "content": "You are a resilient assistant designed for enterprise workflows."},
                {"role": "user", "content": "Summarize the latest weekly incident report in five bullet points."}
            ],
            "temperature": 0.1,
            "max_tokens": 200,
            "stream": False
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers=headers,
            name="/v1/chat/completions [failover]",
            catch_response=True
        ) as response:
            if response.status_code not in [200, 429, 503]:
                response.failure(f"Unexpected failover response: {response.status_code}")
                return
 
            if response.status_code == 200:
                upstream = response.headers.get("X-Upstream-Provider", "")
                if not upstream:
                    response.failure("Missing X-Upstream-Provider header")

This mixed workload script is valuable because it simulates a more production-like AI and LLM environment where multiple endpoint types compete for gateway resources.

It can reveal:

  • whether embeddings traffic starves chat traffic
  • whether failover logic adds excessive latency
  • whether rate limiting is enforced consistently across endpoint types
  • whether upstream provider metadata remains visible for debugging

Analyzing Your Results

After running your AI gateway load test, focus on metrics that reflect both gateway behavior and end-user experience.

Key metrics to review

Response time percentiles

Average latency is not enough. For AI gateway performance testing, pay close attention to:

  • p50 for normal user experience
  • p95 for tail latency under load
  • p99 for severe slowdowns and routing issues

If p95 rises sharply while p50 stays stable, that often points to queueing, cache misses, or upstream provider variability.

Throughput

Measure requests per second by endpoint:

  • /v1/chat/completions
  • /v1/embeddings
  • /v1/models

A gateway may handle lightweight metadata endpoints easily while struggling with chat or embedding workloads.

Error rates

Break errors down by status code:

  • 401 or 403: auth or tenant policy problems
  • 429: expected under rate limiting, but should be controlled
  • 500: gateway application issues
  • 502 or 503: upstream provider or failover issues

Cache hit behavior

If your gateway emits headers like X-Cache: hit, compare latency for cached and uncached requests. A healthy cache should produce a noticeable reduction in response time.

Upstream provider distribution

If your gateway includes headers such as X-Upstream-Provider or X-Route-Decision, examine whether traffic is distributed according to policy. Unexpected skew may indicate a routing bug.

What good results look like

For a well-performing AI gateway, you generally want to see:

  • stable latency as user count increases gradually
  • predictable 429s instead of random 500s when limits are exceeded
  • lower latency for cached prompts
  • successful failover without widespread request loss
  • no cross-tenant auth anomalies under concurrency

LoadForge’s real-time reporting makes it easier to spot inflection points during a stress testing run. If latency spikes after 200 concurrent users, for example, you can correlate that with error rates or throughput drops immediately. And because LoadForge supports distributed testing and global test locations, you can also measure how the AI gateway behaves for geographically diverse clients.

Performance Optimization Tips

When load testing exposes AI gateway bottlenecks, these are the most common areas to optimize.

Reduce auth overhead

If bearer token validation is expensive, use token caching or lighter-weight policy checks where appropriate. Authentication should not dominate request latency.

Tune connection pooling

Your gateway likely communicates with one or more upstream model providers. Make sure HTTP client connection pools, keep-alive settings, and timeout values are tuned for concurrent traffic.

Improve cache strategy

For repeated prompts or deterministic system workflows:

  • normalize prompt formatting
  • cache by tenant and model where needed
  • prevent cache stampedes with request coalescing
  • set sensible TTLs

Separate workloads

Embeddings, chat completions, and streaming requests can behave very differently. Consider separate worker pools, route classes, or concurrency limits for each.

Optimize routing logic

If routing decisions depend on database lookups or external config services, cache those decisions in memory or a fast store. Every millisecond counts at high request volumes.

Set clear rate-limit policies

Rate limiting should be transparent and deterministic. Return consistent 429 responses with useful headers such as:

  • X-RateLimit-Limit
  • X-RateLimit-Remaining
  • Retry-After

Validate failover paths regularly

Do not assume failover works because it exists in configuration. Include failover scenarios in recurring load testing and CI/CD integration pipelines so regressions are caught early.

Common Pitfalls to Avoid

AI gateway load testing has a few traps that can produce misleading or incomplete results.

Using unrealistic prompts

Tiny prompts with low token counts may make the gateway look faster than it will be in production. Use realistic message sizes and response limits.

Ignoring upstream variability

If the upstream LLM provider is slow, your gateway metrics will reflect that. Try to separate gateway overhead from provider latency by using provider-specific headers, tracing, or test routes.

Not testing cache and non-cache traffic separately

A mixed test without segmentation can hide important differences. Always compare cacheable and dynamic workloads independently.

Forgetting tenant-specific behavior

Multi-tenant AI gateways often apply different quotas, models, and routing rules. If you only test one tenant, you may miss the most important bottlenecks.

Treating 429s as generic failures

In stress testing, 429 responses can be healthy if the gateway is intentionally enforcing limits. The real question is whether they happen at the right thresholds and with the correct headers.

Skipping long-lived connection testing

Streaming requests and large completions hold connections open longer. If your production traffic uses streaming, include it in your load testing plan.

Running only from one region

Latency and routing behavior can change by geography. LoadForge’s cloud-based infrastructure and global test locations are useful for understanding regional differences in AI gateway performance.

Conclusion

Load testing an AI gateway is about much more than checking whether an endpoint returns 200 OK. You need to validate routing logic, cache effectiveness, rate limiting, multi-tenant behavior, failover reliability, and endpoint performance across chat and embeddings workloads. With realistic Locust scripts and a structured approach to performance testing and stress testing, you can identify bottlenecks before they affect your users.

LoadForge makes this process easier with distributed testing, real-time reporting, CI/CD integration, and scalable cloud-based infrastructure for generating meaningful AI and LLM traffic patterns. If you want to confidently validate your AI gateway at scale, try running these tests on LoadForge and turn your gateway into a reliable foundation for production AI applications.

Try LoadForge free for 7 days

Set up your first load test in under 2 minutes. No commitment.