
Introduction
Load testing token throughput for LLM applications is one of the most practical ways to understand how your AI system behaves in production. Unlike traditional web applications, large language model workloads are shaped not just by request volume, but by prompt size, output length, streaming behavior, model latency, rate limits, and token generation speed. If you only measure requests per second, you can miss the real bottleneck: how many input and output tokens your stack can process reliably under concurrent load.
For teams building AI chatbots, retrieval-augmented generation (RAG) systems, copilots, summarization services, and internal LLM APIs, token throughput directly impacts user experience and infrastructure cost. Slow token generation can make a chat interface feel broken. Poor concurrency planning can trigger provider throttling. Oversized prompts can explode costs and reduce throughput. And if you’re proxying requests through your own API gateway, vector store, and orchestration layer, bottlenecks can appear far from the model itself.
This guide shows how to load test token throughput for LLM applications using LoadForge and Locust. You’ll learn how to simulate realistic AI traffic, measure prompt and completion token behavior, test streaming and non-streaming endpoints, and identify performance issues before they affect production. Because LoadForge is cloud-based and built on Locust, you can scale these tests across distributed infrastructure, use global test locations, view real-time reporting, and integrate performance testing into CI/CD workflows.
Prerequisites
Before you begin load testing your LLM application, make sure you have the following:
- A LoadForge account
- A deployed LLM application or API endpoint to test
- API credentials such as bearer tokens, service keys, or tenant-specific auth headers
- Knowledge of your application’s request paths and expected payloads
- A list of realistic prompts, conversation sizes, and output expectations
- An understanding of any upstream provider limits, such as requests per minute or tokens per minute
You should also identify what layer you are testing:
- Your own AI gateway or backend API
- A chat completion endpoint exposed to clients
- A RAG service that includes retrieval and generation
- A streaming inference endpoint
- A batch summarization or document-processing API
For meaningful performance testing, define clear goals before starting. Common goals include:
- Maximum concurrent users before latency spikes
- Sustainable input/output token throughput
- Time to first token for streaming responses
- P95 or P99 response time under realistic prompt sizes
- Error rate during stress testing
- Cost efficiency at different concurrency levels
Understanding AI & LLM Under Load
LLM systems behave differently from conventional REST APIs. A simple CRUD endpoint usually has predictable request cost. An LLM endpoint does not. Two requests to the same path can vary dramatically in CPU usage, provider latency, memory footprint, and token generation time.
What token throughput actually means
For LLM applications, throughput is often best measured as:
- Input tokens processed per second
- Output tokens generated per second
- Total tokens handled per second across all concurrent users
This is more useful than raw request counts because one request may contain 100 tokens while another contains 20,000.
Common bottlenecks in LLM applications
When load testing AI & LLM systems, you’ll often encounter these bottlenecks:
Model provider limits
If you use OpenAI-compatible APIs, Anthropic, Azure OpenAI, or a self-hosted inference service, you may hit:
- Requests per minute limits
- Tokens per minute limits
- Concurrent request limits
- Model-specific queueing delays
Application-layer orchestration
Your app may do more than forward a prompt. It might:
- Validate sessions
- Load conversation history
- Retrieve documents from a vector database
- Re-rank search results
- Build prompts dynamically
- Post-process model output
- Store completions and analytics
Each of these steps can add latency under load.
Streaming overhead
Streaming improves perceived responsiveness, but it introduces different performance concerns:
- Time to first token
- Chunk delivery consistency
- Long-lived HTTP connections
- Reverse proxy buffering issues
- Client disconnect handling
Prompt growth and context window pressure
As chat sessions grow, token counts increase. This can reduce throughput dramatically and increase costs. A conversation that performs well with 1,000 total tokens may degrade badly at 16,000 tokens.
What to measure during load testing
For realistic LLM performance testing, track:
- Average and P95 response time
- Time to first token for streaming
- Requests per second
- Input tokens per second
- Output tokens per second
- Total token usage per user journey
- Error rate and timeout rate
- HTTP 429 and 5xx responses
- Cost per scenario, if token pricing is known
LoadForge’s real-time reporting is especially useful here because you can observe how response times and failures evolve as concurrency increases across distributed test workers.
Writing Your First Load Test
Let’s start with a basic non-streaming chat completion test. This example assumes your application exposes an OpenAI-style endpoint behind your own API:
POST /v1/chat/completions
The script sends realistic prompts, includes bearer authentication, and captures token usage from the response.
from locust import HttpUser, task, between, events
import random
import json
PROMPTS = [
"Summarize the following customer support ticket in 3 bullet points: The user reports intermittent login failures after resetting their password.",
"Write a concise release note for a new feature that allows exporting dashboard reports to CSV.",
"Explain the difference between horizontal scaling and vertical scaling for a SaaS engineering team.",
"Draft a polite email response to a customer asking for an update on a delayed shipment."
]
class LLMChatUser(HttpUser):
wait_time = between(1, 3)
host = "https://api.example-ai-app.com"
def on_start(self):
self.headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json",
"X-Tenant-ID": "acme-prod"
}
@task
def chat_completion(self):
prompt = random.choice(PROMPTS)
payload = {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a concise enterprise AI assistant."},
{"role": "user", "content": prompt}
],
"temperature": 0.3,
"max_tokens": 180,
"stream": False
}
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=payload,
catch_response=True,
name="/v1/chat/completions [basic]"
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status code: {response.status_code}")
return
try:
data = response.json()
usage = data.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
if "choices" not in data or not data["choices"]:
response.failure("Missing choices in response")
return
generated_text = data["choices"][0]["message"]["content"]
if not generated_text.strip():
response.failure("Empty completion returned")
return
response.success()
print(
f"prompt_tokens={prompt_tokens}, completion_tokens={completion_tokens}"
)
except json.JSONDecodeError:
response.failure("Response was not valid JSON")What this test does
This first script is useful for baseline load testing because it simulates a standard chat completion flow with realistic prompts and output limits. It helps you answer:
- How quickly does your API respond under moderate concurrency?
- How many tokens are consumed per request?
- Does latency increase when prompt size grows slightly?
- Are there any immediate auth, routing, or provider stability issues?
Why this matters for token throughput
Even this simple test can reveal whether your service is constrained by:
- Slow model generation
- API gateway overhead
- Serialization or response formatting delays
- Upstream rate limiting
Once this baseline works, you can move to more realistic user journeys.
Advanced Load Testing Scenarios
Basic completions are a good start, but most AI applications are more complex. Below are three advanced scenarios that better reflect real-world LLM workloads.
Scenario 1: Authenticated chat sessions with conversation history
Many LLM apps maintain session context. As the conversation grows, token usage increases and throughput can drop. This test simulates a user authenticating, creating a chat session, and sending multiple messages to the same thread.
from locust import HttpUser, task, between
import random
import uuid
USER_MESSAGES = [
"Can you summarize the Q4 sales performance by region?",
"What are the top three churn risks from this customer health report?",
"Rewrite this paragraph in a more professional tone.",
"Give me a short action plan based on the following meeting notes."
]
class ChatSessionUser(HttpUser):
wait_time = between(2, 5)
host = "https://api.example-ai-app.com"
def on_start(self):
login_payload = {
"email": "loadtest.user@example.com",
"password": "SuperSecurePassword123!"
}
with self.client.post(
"/api/auth/login",
json=login_payload,
name="/api/auth/login",
catch_response=True
) as response:
if response.status_code != 200:
response.failure("Login failed")
return
data = response.json()
self.access_token = data["access_token"]
self.headers = {
"Authorization": f"Bearer {self.access_token}",
"Content-Type": "application/json",
"X-Request-ID": str(uuid.uuid4())
}
with self.client.post(
"/api/chat/sessions",
headers=self.headers,
json={"title": "Load Test Conversation"},
name="/api/chat/sessions",
catch_response=True
) as response:
if response.status_code != 201:
response.failure("Failed to create chat session")
return
self.session_id = response.json()["session_id"]
@task
def send_message(self):
message = random.choice(USER_MESSAGES)
payload = {
"model": "gpt-4o-mini",
"message": message,
"include_history": True,
"temperature": 0.4,
"max_tokens": 250,
"metadata": {
"workspace_id": "ws_enterprise_001",
"feature": "assistant_chat"
}
}
with self.client.post(
f"/api/chat/sessions/{self.session_id}/messages",
headers=self.headers,
json=payload,
name="/api/chat/sessions/:id/messages",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"Message send failed: {response.status_code}")
return
data = response.json()
usage = data.get("usage", {})
total_tokens = usage.get("total_tokens", 0)
if total_tokens <= 0:
response.failure("Token usage missing or invalid")
return
assistant_reply = data.get("reply", "")
if len(assistant_reply.strip()) < 10:
response.failure("Assistant reply too short")
return
response.success()Why this scenario matters
This test helps expose problems caused by long-running conversations:
- Context windows growing too large
- Session storage overhead
- Database lookups for prior messages
- Token explosion from unbounded history replay
If your response times climb steadily over the life of a session, your conversation management strategy may need optimization.
Scenario 2: RAG endpoint with retrieval and generation
RAG systems often combine vector search, prompt assembly, and LLM generation. This makes them ideal candidates for performance testing because the bottleneck may be retrieval rather than generation.
Assume your application exposes:
POST /api/rag/query
from locust import HttpUser, task, between
import random
RAG_QUERIES = [
"What does our employee handbook say about parental leave?",
"Summarize the SOC 2 access control policy.",
"What are the refund terms in the enterprise customer agreement?",
"Find the onboarding steps for new engineering hires."
]
class RAGUser(HttpUser):
wait_time = between(1, 4)
host = "https://api.example-ai-app.com"
def on_start(self):
self.headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json",
"X-Org-ID": "org_42"
}
@task
def rag_query(self):
query = random.choice(RAG_QUERIES)
payload = {
"query": query,
"model": "gpt-4o-mini",
"top_k": 5,
"temperature": 0.2,
"max_tokens": 300,
"filters": {
"document_type": ["policy", "handbook", "contract"],
"language": "en"
},
"return_citations": True
}
with self.client.post(
"/api/rag/query",
headers=self.headers,
json=payload,
name="/api/rag/query",
catch_response=True,
timeout=90
) as response:
if response.status_code != 200:
response.failure(f"RAG query failed: {response.status_code}")
return
data = response.json()
citations = data.get("citations", [])
answer = data.get("answer", "")
usage = data.get("usage", {})
if not citations:
response.failure("No citations returned")
return
if len(answer.strip()) < 20:
response.failure("Answer too short")
return
if usage.get("prompt_tokens", 0) == 0:
response.failure("Missing token usage")
return
response.success()What this reveals
A RAG load test can show:
- Vector database latency under concurrency
- Slow document filtering or metadata queries
- Prompt construction overhead
- Increased token usage due to retrieved context stuffing
If retrieval latency dominates, scaling the LLM provider alone won’t help. You may need to optimize embedding search, caching, or chunk selection.
Scenario 3: Streaming token generation and time to first token
Streaming is common in AI chat interfaces because users perceive it as faster. But streaming performance testing is different from testing a standard JSON response. You want to know not only total response time, but how quickly the first chunk arrives and whether streams remain stable under load.
Assume your endpoint is:
POST /v1/chat/completionswith"stream": true
from locust import HttpUser, task, between
import json
import time
class StreamingLLMUser(HttpUser):
wait_time = between(1, 2)
host = "https://api.example-ai-app.com"
def on_start(self):
self.headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json"
}
@task
def streaming_chat(self):
payload = {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful support AI assistant."},
{"role": "user", "content": "Explain how SSO login works in our platform in simple terms."}
],
"temperature": 0.3,
"max_tokens": 220,
"stream": True
}
start_time = time.time()
first_token_time = None
chunks_received = 0
with self.client.post(
"/v1/chat/completions",
headers=self.headers,
json=payload,
name="/v1/chat/completions [stream]",
catch_response=True,
stream=True,
timeout=120
) as response:
if response.status_code != 200:
response.failure(f"Streaming request failed: {response.status_code}")
return
try:
for line in response.iter_lines():
if not line:
continue
decoded = line.decode("utf-8")
if decoded.startswith("data: "):
content = decoded[6:].strip()
if content == "[DONE]":
break
chunk = json.loads(content)
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
chunks_received += 1
if first_token_time is None:
first_token_time = time.time() - start_time
if chunks_received == 0:
response.failure("No streaming content chunks received")
return
if first_token_time is None or first_token_time > 5:
response.failure(f"Slow time to first token: {first_token_time}")
return
response.success()
except Exception as exc:
response.failure(f"Streaming parse error: {exc}")Why streaming tests are important
A non-streaming endpoint may appear acceptable while your actual chat UI feels slow. That’s because users experience time to first token, not just total completion time. Streaming load tests help identify:
- Delays before generation begins
- Proxy buffering problems
- Connection exhaustion
- Instability in long-lived responses
This is especially valuable when running distributed testing in LoadForge across multiple regions to see whether latency varies by geography.
Analyzing Your Results
Once your tests are running in LoadForge, focus on patterns rather than just averages.
Key metrics to review
Response time percentiles
Look at median, P95, and P99 response times. LLM systems often have long-tail latency, especially under stress testing. Averages can hide serious user-facing problems.
Failure rates
Pay close attention to:
- 429 Too Many Requests
- 500 or 502 upstream failures
- 504 timeouts
- Connection resets during streaming
These often indicate provider throttling, gateway saturation, or backend instability.
Throughput versus concurrency
As you increase user count, check whether token throughput scales linearly. If not, identify where degradation begins. This is often your practical concurrency ceiling.
Token usage trends
If your app returns usage metadata, correlate token counts with latency. You may find that:
- Prompt-heavy requests are much slower
- Output token generation is the main bottleneck
- Certain workflows are disproportionately expensive
Interpreting common result patterns
Latency rises sharply with stable request volume
This often suggests queueing at the model provider or overloaded orchestration services.
Error rate spikes before CPU or memory saturation
This usually points to rate limits, connection pool exhaustion, or misconfigured timeouts rather than raw infrastructure limits.
Streaming first token is fast, but total completion is slow
Your model starts responding quickly, but token generation rate may be too slow for long outputs.
RAG endpoints degrade more than chat endpoints
Your vector store, retrieval logic, or prompt construction pipeline may be the bottleneck.
LoadForge’s real-time reporting makes it easier to spot these inflection points as they happen rather than waiting for a post-test summary.
Performance Optimization Tips
After load testing your AI & LLM application, these optimizations often have the biggest impact:
Reduce prompt size
Shorter prompts usually improve throughput and reduce cost. Trim unnecessary system instructions, duplicate context, and excessive conversation history.
Cap output length
Set realistic max_tokens values. Overly large completion limits reduce concurrency and increase tail latency.
Use conversation summarization
Instead of replaying the full chat history every time, summarize older turns and keep only the most relevant recent context.
Cache retrieval results
For RAG systems, cache common searches, embeddings, or document snippets to reduce repeated vector lookups.
Tune connection pools and timeouts
Streaming and long-running inference requests can exhaust connection pools quickly. Make sure your gateway, app server, and client stack are configured for sustained concurrency.
Separate hot paths
If your app handles both lightweight and heavyweight prompts, isolate them by queue, worker pool, or endpoint to prevent one class of request from starving the other.
Monitor provider limits
Track requests per minute and tokens per minute. If you’re consistently hitting limits, distribute traffic across models, tenants, or provisioned capacity where available.
Test from multiple regions
If your users are global, use LoadForge’s global test locations to understand how network distance affects time to first token and total completion time.
Common Pitfalls to Avoid
Load testing LLM applications can go wrong if the scenarios are unrealistic.
Testing only tiny prompts
Short prompts may make your system look fast, but they rarely reflect production traffic. Use realistic prompt lengths and conversation histories.
Ignoring token-based limits
Many teams focus only on request rate and forget that token throughput is often the real limit. A small number of very large prompts can overwhelm the system.
Not validating response quality
A 200 response does not always mean success. Make sure your test checks for actual generated content, citations, or structured output fields.
Skipping streaming tests
If your UI uses streaming, non-streaming benchmarks are incomplete. Measure time to first token and stream stability.
Load testing the provider directly instead of your full stack
If your production architecture includes auth, retrieval, prompt assembly, observability, and storage, test the full application path whenever possible.
Forgetting warm-up behavior
LLM systems may behave differently during cold starts, model spin-up, or cache warm-up. Include ramp-up periods in your tests.
Running unrealistic concurrency immediately
A sudden spike can be useful for stress testing, but start with gradual ramps to find sustainable throughput first.
Not integrating tests into CI/CD
Performance regressions in prompt logic, retrieval pipelines, or response formatting can appear after normal code changes. LoadForge’s CI/CD integration helps catch these issues before release.
Conclusion
Load testing token throughput for LLM applications gives you a much clearer picture of real AI performance than request counts alone. By measuring how your system handles prompt size, output generation, conversation history, retrieval workloads, and streaming behavior under concurrency, you can make better decisions about scaling, latency targets, and model cost control.
With LoadForge, you can run realistic Locust-based load testing for AI & LLM workloads using cloud-based infrastructure, distributed testing, global test locations, and real-time reporting. Whether you’re validating a chat API, a RAG platform, or a streaming assistant, the right performance testing strategy will help you deliver faster and more reliable AI experiences at scale.
If you’re ready to see how your LLM application performs under real-world load, try LoadForge and start building token-aware performance tests today.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test Hugging Face Inference API
Load test Hugging Face Inference API workloads to measure model latency, concurrency, autoscaling behavior, and error rates.

Load Testing the Google Gemini API
Learn how to load test the Google Gemini API with concurrent prompts, streaming responses, and token usage benchmarks.

Load Testing LLM Inference Endpoints
Load test LLM inference endpoints to benchmark response times, concurrency, token throughput, and failure rates under traffic.