
Introduction
Load testing LLM inference endpoints is no longer optional if your application depends on generative AI for chat, summarization, search augmentation, classification, or content generation. Unlike traditional REST APIs, large language model endpoints introduce unique performance characteristics: long-running requests, variable output sizes, token-based billing, streaming responses, and highly non-linear latency under concurrency.
If you are serving traffic to OpenAI-compatible APIs, self-hosted vLLM or TGI deployments, Azure OpenAI, or custom inference gateways, you need to understand how your LLM stack behaves under real user load. A small increase in concurrent requests can dramatically affect time to first token, total response time, token throughput, and error rates.
In this guide, you will learn how to load test LLM inference endpoints using Locust on LoadForge. We will cover realistic scenarios including authenticated chat completions, streaming inference, mixed prompt sizes, and multi-endpoint workloads. Along the way, we will discuss how to measure performance testing metrics that matter for AI systems, including concurrency, throughput, latency percentiles, and failures. With LoadForge’s distributed testing, real-time reporting, cloud-based infrastructure, and CI/CD integration, you can benchmark LLM inference endpoints from multiple regions and at meaningful scale.
Prerequisites
Before you start load testing LLM inference endpoints, make sure you have:
- A working LLM inference API endpoint
- API authentication credentials such as a bearer token or API key
- Knowledge of the request schema your model expects
- A test environment with safe quotas and rate limits
- Sample prompts that reflect realistic production usage
- A LoadForge account for running distributed load tests
Common endpoint types you may want to test include:
/v1/chat/completions/v1/completions/v1/embeddings/generate/api/generate/v1/responses
You should also define what success looks like for your performance testing effort. For LLM inference endpoints, that usually includes:
- P50, P95, and P99 response time
- Time to first token for streaming APIs
- Requests per second under sustained concurrency
- Tokens generated per second
- Error rate under load
- Rate limit behavior
- Queueing delays during stress testing
If your endpoint is OpenAI-compatible, the examples below will feel familiar. If you are using a custom gateway or self-hosted model server, you can adapt the same Locust patterns.
Understanding LLM Inference Endpoints Under Load
LLM inference endpoints behave differently from standard CRUD APIs because request cost varies significantly based on prompt length, output length, model size, and decoding parameters.
Key performance factors
Prompt and completion token volume
A request with a 50-token prompt and 100-token output is much cheaper than one with a 5,000-token context and 1,000-token response. During load testing, token volume often matters more than request count.
Model size and hardware
A 7B model running on a single GPU will behave very differently from a 70B model distributed across multiple GPUs. Inference latency can increase sharply once request queues form.
Streaming vs non-streaming responses
Streaming may improve perceived responsiveness because users receive tokens earlier, but it does not necessarily reduce backend compute time. You should measure both total request duration and time to first token.
Concurrency limits and batching
Some inference engines batch requests internally to improve throughput. This can help at moderate concurrency, but once GPU memory or scheduler capacity is exhausted, latency may spike.
Authentication and gateway overhead
Many production LLM APIs sit behind API gateways, service meshes, WAFs, or usage metering layers. Your load testing should include this real path whenever possible.
Common bottlenecks
When stress testing LLM inference endpoints, the most common bottlenecks are:
- GPU saturation
- Request queue buildup
- Tokenizer overhead
- Large prompt serialization costs
- Rate limiting at the gateway
- Slow downstream retrieval or tool-calling dependencies
- Connection exhaustion during streaming
- Regional latency for globally distributed users
This is why distributed load testing with LoadForge is useful. You can generate traffic from multiple global test locations and see whether latency is caused by model inference itself or by network and edge infrastructure.
Writing Your First Load Test
Let’s start with a basic load test for a chat completion endpoint using bearer token authentication. This example targets an OpenAI-compatible endpoint at /v1/chat/completions.
Basic chat completions load test
from locust import HttpUser, task, between
import os
import json
class LLMChatUser(HttpUser):
wait_time = between(1, 3)
host = os.getenv("LLM_API_BASE", "https://llm-api.example.com")
def on_start(self):
self.headers = {
"Authorization": f"Bearer {os.getenv('LLM_API_KEY', 'replace-me')}",
"Content-Type": "application/json"
}
@task
def chat_completion(self):
payload = {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a concise support assistant."},
{"role": "user", "content": "Explain what tokenization means in large language models in 3 short bullet points."}
],
"temperature": 0.2,
"max_tokens": 120,
"stream": False
}
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=payload,
catch_response=True,
name="POST /v1/chat/completions"
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status code: {response.status_code}")
return
try:
data = response.json()
content = data["choices"][0]["message"]["content"]
usage = data.get("usage", {})
completion_tokens = usage.get("completion_tokens", 0)
if not content:
response.failure("Empty completion content")
elif completion_tokens == 0:
response.failure("No completion tokens returned")
else:
response.success()
except Exception as e:
response.failure(f"Invalid JSON response: {e}")What this test does
This script simulates a user sending a realistic chat request to an LLM inference endpoint. It validates:
- The endpoint returns HTTP 200
- The JSON structure is valid
- The model generated content
- Token usage information is present
This is a good starting point for baseline load testing, but it is still simplistic. Real workloads usually involve mixed prompt sizes, authentication refresh flows, and streaming.
Running the test
If you are running Locust locally before uploading to LoadForge, you can use environment variables:
export LLM_API_BASE="https://llm-api.example.com"
export LLM_API_KEY="your-api-key"
locust -f locustfile.pyIn LoadForge, you can store these values as environment variables or secrets and run the same script using cloud-based infrastructure at larger scale.
Advanced Load Testing Scenarios
Once your basic test works, you should move to more realistic performance testing scenarios. The following examples model common production patterns for LLM inference endpoints.
Scenario 1: Mixed prompt sizes and endpoint weighting
Real applications rarely send identical prompts. Some users ask short questions, while others send long contexts, support transcripts, or retrieved documents. This example simulates mixed workloads with weighted tasks.
from locust import HttpUser, task, between
import os
import random
SHORT_PROMPTS = [
"Summarize the benefits of horizontal scaling in one sentence.",
"What is the difference between latency and throughput?",
"Write a short explanation of vector embeddings."
]
MEDIUM_PROMPTS = [
"""A customer reports that our chatbot is responding slowly during peak hours.
Suggest 5 likely causes and 5 actions the engineering team should take.""",
"""We are deploying an inference service behind an API gateway.
Provide a checklist for monitoring latency, error rates, token throughput, and rate limits."""
]
LONG_CONTEXT = """
You are assisting with a production incident review for an AI platform.
The platform serves chat completions for customer support, internal knowledge search,
and summarization workflows. During a weekday traffic spike, p95 latency increased
from 2.1 seconds to 14.8 seconds. GPU utilization reached 96 percent, request queues
grew rapidly, and some users received 429 and 503 errors. The service uses an
OpenAI-compatible API gateway, a retrieval layer for knowledge augmentation, and
streaming responses for chat clients.
Analyze the likely bottlenecks, explain how concurrency affects inference performance,
and provide a prioritized remediation plan. Include recommendations for autoscaling,
prompt size controls, caching, batching, and observability.
"""
class MixedLLMUser(HttpUser):
wait_time = between(1, 2)
host = os.getenv("LLM_API_BASE", "https://llm-api.example.com")
def on_start(self):
self.headers = {
"Authorization": f"Bearer {os.getenv('LLM_API_KEY', 'replace-me')}",
"Content-Type": "application/json"
}
@task(5)
def short_chat_request(self):
payload = {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": random.choice(SHORT_PROMPTS)}
],
"temperature": 0.3,
"max_tokens": 80
}
self._send_chat(payload, "short prompts")
@task(3)
def medium_chat_request(self):
payload = {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "system", "content": "You are a production AI platform advisor."},
{"role": "user", "content": random.choice(MEDIUM_PROMPTS)}
],
"temperature": 0.4,
"max_tokens": 220
}
self._send_chat(payload, "medium prompts")
@task(1)
def long_context_request(self):
payload = {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "system", "content": "You are a senior SRE for AI systems."},
{"role": "user", "content": LONG_CONTEXT}
],
"temperature": 0.2,
"max_tokens": 400
}
self._send_chat(payload, "long prompts")
def _send_chat(self, payload, request_type):
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=payload,
catch_response=True,
name=f"POST /v1/chat/completions [{request_type}]"
) as response:
if response.status_code == 429:
response.failure("Rate limited")
return
if response.status_code >= 500:
response.failure(f"Server error: {response.status_code}")
return
if response.status_code != 200:
response.failure(f"Unexpected status: {response.status_code}")
return
data = response.json()
usage = data.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
if prompt_tokens <= 0 or completion_tokens <= 0:
response.failure("Missing token usage data")
else:
response.success()This test is much more realistic because it reflects variable request cost. It helps you identify whether your LLM inference endpoint degrades gracefully when prompt sizes increase.
Scenario 2: Streaming inference and time to first token
For chat applications, streaming matters. Users care about how quickly the model starts responding, not just when the request completes. This example tests a streaming endpoint and validates that data begins arriving.
from locust import HttpUser, task, between
import os
import time
class StreamingLLMUser(HttpUser):
wait_time = between(2, 4)
host = os.getenv("LLM_API_BASE", "https://llm-api.example.com")
def on_start(self):
self.headers = {
"Authorization": f"Bearer {os.getenv('LLM_API_KEY', 'replace-me')}",
"Content-Type": "application/json"
}
@task
def streaming_chat_completion(self):
payload = {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful enterprise AI assistant."},
{"role": "user", "content": "Draft a customer-facing explanation of why AI responses may be slower during peak demand, keeping the tone professional and reassuring."}
],
"temperature": 0.5,
"max_tokens": 180,
"stream": True
}
start_time = time.time()
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=payload,
stream=True,
catch_response=True,
name="POST /v1/chat/completions [streaming]"
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status code: {response.status_code}")
return
first_chunk_time = None
chunk_count = 0
try:
for line in response.iter_lines(decode_unicode=True):
if line:
chunk_count += 1
if first_chunk_time is None:
first_chunk_time = time.time() - start_time
if chunk_count == 0:
response.failure("No streaming chunks received")
elif first_chunk_time is None or first_chunk_time > 5:
response.failure(f"Slow time to first token: {first_chunk_time}")
else:
response.success()
except Exception as e:
response.failure(f"Streaming read error: {e}")This script is useful for stress testing chat-style user experiences. While Locust’s default metrics focus on total request time, this pattern lets you add custom validation for time to first token behavior.
Scenario 3: Authenticated multi-endpoint AI workload
Many AI applications do more than chat completions. They may generate embeddings for retrieval, then call a chat endpoint using retrieved context. The following example simulates a multi-step AI workflow with API key authentication and per-request metadata.
from locust import HttpUser, task, between
import os
import random
DOCUMENTS = [
"Load testing helps teams understand latency, throughput, and failure patterns before production incidents occur.",
"Embeddings convert text into dense vectors that can be compared for semantic similarity in retrieval systems.",
"Streaming responses improve perceived responsiveness in chat applications by returning tokens incrementally."
]
QUESTIONS = [
"How does load testing improve AI reliability?",
"What are embeddings used for in semantic search?",
"Why do streaming responses matter for chat applications?"
]
class AIWorkflowUser(HttpUser):
wait_time = between(1, 3)
host = os.getenv("LLM_API_BASE", "https://llm-api.example.com")
def on_start(self):
self.headers = {
"Authorization": f"Bearer {os.getenv('LLM_API_KEY', 'replace-me')}",
"Content-Type": "application/json",
"X-Client-Id": "loadforge-llm-benchmark",
"X-Request-Source": "performance-test"
}
@task
def embedding_then_chat(self):
document = random.choice(DOCUMENTS)
question = random.choice(QUESTIONS)
embedding_payload = {
"model": "text-embedding-3-small",
"input": document
}
with self.client.post(
"/v1/embeddings",
headers=self.headers,
json=embedding_payload,
catch_response=True,
name="POST /v1/embeddings"
) as embedding_response:
if embedding_response.status_code != 200:
embedding_response.failure(f"Embedding failed: {embedding_response.status_code}")
return
try:
embedding_data = embedding_response.json()
vector = embedding_data["data"][0]["embedding"]
if len(vector) < 100:
embedding_response.failure("Embedding vector too short")
return
embedding_response.success()
except Exception as e:
embedding_response.failure(f"Invalid embedding response: {e}")
return
chat_payload = {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Answer using the provided context only."},
{"role": "user", "content": f"Context: {document}\n\nQuestion: {question}"}
],
"temperature": 0.1,
"max_tokens": 120,
"metadata": {
"tenant_id": "acme-support",
"test_run": "loadforge-llm-inference"
}
}
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=chat_payload,
catch_response=True,
name="POST /v1/chat/completions [RAG workflow]"
) as chat_response:
if chat_response.status_code != 200:
chat_response.failure(f"Chat failed: {chat_response.status_code}")
return
try:
data = chat_response.json()
answer = data["choices"][0]["message"]["content"]
if not answer or len(answer.strip()) < 20:
chat_response.failure("Answer too short or empty")
else:
chat_response.success()
except Exception as e:
chat_response.failure(f"Invalid chat response: {e}")This is the kind of load testing scenario that reveals bottlenecks across a realistic AI pipeline rather than a single endpoint in isolation.
Analyzing Your Results
Once your tests are running in LoadForge, focus on the metrics that matter for LLM inference endpoints.
Response time percentiles
Average latency is not enough. Watch:
- P50 for typical user experience
- P95 for degraded but common peak behavior
- P99 for severe outliers
LLM systems often show large variance, especially with long prompts or large outputs.
Failure rate
Track how often requests fail and why:
- 429 Too Many Requests indicates rate limiting
- 500-level errors often point to overloaded model servers
- Timeouts may indicate queue buildup or gateway issues
- Connection errors can appear during streaming overload
Throughput
Measure both:
- Requests per second
- Effective token throughput if your API returns usage data
A system may sustain a stable request rate but still show declining token throughput under load.
Concurrency behavior
Increase concurrent users gradually and observe where latency begins to climb sharply. This is often the practical saturation point for your inference stack.
With LoadForge, you can run distributed tests across regions to see whether concurrency issues are global or isolated to a specific deployment location.
Endpoint-specific insights
Separate metrics by request type:
- Short prompts
- Long prompts
- Streaming requests
- Embeddings
- Multi-step AI workflows
This helps you avoid misleading averages. A fast embeddings endpoint can hide poor chat completion performance if you lump everything together.
Performance Optimization Tips
If your load testing results show slow or unstable LLM inference endpoints, these optimizations are often effective:
Control prompt size
Large prompts increase tokenization cost, memory usage, and generation latency. Trim unnecessary context and enforce input limits.
Tune max_tokens
Overly generous output limits can inflate latency and cost. Set realistic max_tokens values based on the use case.
Use streaming for interactive UX
Streaming improves perceived responsiveness for chat applications, even when total compute time stays similar.
Scale inference capacity
If GPU utilization is consistently high, add capacity or distribute requests across more replicas. Stress testing helps determine the concurrency threshold where scaling becomes necessary.
Cache deterministic responses
For low-temperature or repeated prompts, response caching can reduce load significantly.
Optimize retrieval pipelines
If your LLM endpoint depends on RAG, measure embedding generation, vector search, and prompt assembly separately. The model may not be the only bottleneck.
Apply rate limiting and backpressure
Protect your service from overload by enforcing quotas and queue limits rather than letting latency spiral indefinitely.
Benchmark model choices
Smaller models may provide better performance testing results for high-volume workloads. Test multiple models to find the right latency-quality tradeoff.
Common Pitfalls to Avoid
Load testing LLM inference endpoints has several traps that can lead to misleading results.
Using unrealistic prompts
If your test prompts are too short or too simple, your benchmark will not reflect production behavior. Include realistic context lengths and output expectations.
Ignoring token usage
Request count alone is not enough for AI systems. Two tests with the same RPS can have radically different token throughput and hardware impact.
Testing only non-streaming endpoints
If your users consume streamed responses, you need to test streaming. Otherwise, you are missing a critical part of the user experience.
Forgetting authentication overhead
Production inference often includes API key validation, tenant routing, and logging. Test the real path, not a bypassed internal endpoint.
Overlooking warm-up effects
Model servers may behave differently after startup, cache misses, or autoscaling events. Include warm-up time before evaluating steady-state performance.
Running only from one region
LLM applications often serve global users. Use LoadForge’s global test locations to identify regional latency and edge routing issues.
Not separating workloads
Embeddings, chat completions, and long-context generation should not always be mixed into one metric bucket. Keep them distinguishable in your reports.
Conclusion
Load testing LLM inference endpoints is essential for understanding how your AI application performs under real traffic. By benchmarking chat completions, streaming responses, embeddings, and multi-step AI workflows, you can uncover latency spikes, concurrency limits, token throughput bottlenecks, and failure patterns before they affect users.
Using Locust scripts on LoadForge gives you a practical way to run realistic performance testing and stress testing scenarios at scale. With distributed testing, real-time reporting, cloud-based infrastructure, CI/CD integration, and global test locations, LoadForge makes it easier to validate your LLM inference endpoints under meaningful load.
If you are preparing an AI feature for production, now is the time to test it. Try LoadForge and start benchmarking your LLM inference endpoints with realistic, scalable load tests.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test Hugging Face Inference API
Load test Hugging Face Inference API workloads to measure model latency, concurrency, autoscaling behavior, and error rates.

Load Testing the Google Gemini API
Learn how to load test the Google Gemini API with concurrent prompts, streaming responses, and token usage benchmarks.

Load Testing Token Throughput for LLM Applications
Measure token throughput for LLM apps under load to optimize concurrency, model costs, and response time at scale.