
Introduction
Load testing vLLM inference servers is essential if you want reliable, cost-efficient, and scalable AI application performance in production. vLLM has become a popular choice for serving large language models because of its high-throughput architecture, efficient KV cache management, and continuous batching capabilities. But even with those advantages, real-world performance depends heavily on request patterns, prompt sizes, output token lengths, concurrency levels, GPU memory limits, and sampling parameters.
If you’re building AI products on top of vLLM, you need more than a simple benchmark. You need realistic load testing and performance testing that reflects how users actually interact with your inference APIs. That means testing chat completions, text completions, embeddings, authenticated requests, streaming-like workloads, and mixed prompt sizes under concurrent traffic.
In this guide, you’ll learn how to load test vLLM inference servers using LoadForge and Locust. We’ll cover how vLLM behaves under load, how to write practical Locust scripts against real vLLM-compatible endpoints, and how to analyze throughput, latency, and failure patterns. We’ll also look at common bottlenecks that affect batching efficiency and GPU utilization so you can optimize your inference stack before it impacts users.
LoadForge makes this process easier with cloud-based infrastructure, distributed testing, real-time reporting, CI/CD integration, and global test locations, which is especially helpful when validating AI workloads from multiple regions.
Prerequisites
Before you start load testing vLLM inference servers, make sure you have the following:
- A running vLLM server
- The base URL for the server, such as:
http://localhost:8000https://llm-api.example.com
- A deployed model, such as:
meta-llama/Llama-3-8B-Instructmistralai/Mistral-7B-Instruct-v0.2
- Knowledge of which API surface you are exposing:
- OpenAI-compatible endpoints like
/v1/chat/completions,/v1/completions, and/v1/embeddings - Optional health or model endpoints like
/healthand/v1/models
- OpenAI-compatible endpoints like
- Authentication details if your gateway or proxy requires them:
- Bearer token
- API key in a custom header
- LoadForge account if you want to run the tests in distributed cloud environments
- Basic familiarity with:
- Python
- Locust
- LLM inference concepts like prompt tokens, output tokens, temperature, and max tokens
A typical vLLM server might be started with a command like this:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192Before load testing, verify that the service is reachable:
curl http://localhost:8000/v1/modelsIf you’re fronting vLLM with NGINX, an API gateway, or a service mesh, test the exact production-like endpoint path. For accurate performance testing, you should measure the full request path your application actually uses.
Understanding vLLM Under Load
vLLM is optimized for high-throughput inference, but its performance characteristics differ from traditional REST APIs. When you load test vLLM, you’re not just testing HTTP responsiveness. You’re testing the interaction between incoming requests, batching behavior, token generation speed, and GPU resource constraints.
Continuous batching changes concurrency behavior
Unlike simple request-per-thread architectures, vLLM can combine multiple requests into batches dynamically. This improves throughput, but it also means latency is influenced by:
- Number of concurrent users
- Prompt length distribution
- Requested output length
- Sampling parameters
- GPU memory pressure
- Model size and quantization strategy
A server may perform extremely well for many short prompts, then degrade sharply when a smaller number of users send long-context prompts with large max_tokens.
Common bottlenecks in vLLM inference servers
When load testing vLLM, you’ll often encounter bottlenecks such as:
- GPU saturation: Compute becomes the limiting factor during token generation
- KV cache memory pressure: Long contexts and many active sessions consume memory quickly
- Queueing delays: Requests wait before being batched and executed
- Gateway overhead: Authentication, rate limiting, and logging layers add latency
- Uneven prompt distribution: A few very large prompts can impact tail latency for everyone
- Streaming overhead: If your architecture proxies streaming responses, upstream and downstream buffering can affect performance
Key metrics to watch
For vLLM performance testing, focus on:
- Requests per second
- P50, P95, and P99 latency
- Error rate
- Time to first token, if measured separately by your stack
- Tokens generated per second
- Throughput per GPU
- Queue wait time
- GPU utilization
- GPU memory utilization
A good load testing plan should measure both user-facing latency and backend efficiency. High throughput is meaningless if tail latency becomes unacceptable.
Writing Your First Load Test
Let’s start with a basic load test against the OpenAI-compatible chat completions endpoint exposed by vLLM. This script simulates users sending short instruction prompts to /v1/chat/completions.
Basic chat completions load test
from locust import HttpUser, task, between
import os
import random
MODEL_NAME = os.getenv("VLLM_MODEL", "meta-llama/Llama-3-8B-Instruct")
API_KEY = os.getenv("VLLM_API_KEY", "test-token")
PROMPTS = [
"Summarize the benefits of using Redis for caching in web applications.",
"Write a short product description for a wireless mechanical keyboard.",
"Explain the difference between horizontal scaling and vertical scaling.",
"Generate three bullet points about the advantages of Kubernetes.",
"Draft a polite customer support response for a delayed shipment."
]
class VLLMChatUser(HttpUser):
wait_time = between(1, 3)
def on_start(self):
self.headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
@task
def chat_completion(self):
prompt = random.choice(PROMPTS)
payload = {
"model": MODEL_NAME,
"messages": [
{"role": "system", "content": "You are a concise assistant for enterprise software users."},
{"role": "user", "content": prompt}
],
"temperature": 0.7,
"max_tokens": 120
}
with self.client.post(
"/v1/chat/completions",
json=payload,
headers=self.headers,
name="/v1/chat/completions",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status {response.status_code}: {response.text}")
return
try:
data = response.json()
content = data["choices"][0]["message"]["content"]
if not content.strip():
response.failure("Empty completion returned")
except Exception as e:
response.failure(f"Invalid JSON response: {e}")What this test does
This basic script is useful for initial benchmarking because it:
- Hits the standard vLLM OpenAI-compatible endpoint
- Uses realistic instruction-style prompts
- Includes bearer token authentication
- Validates the response body rather than only checking status code
- Simulates think time with
between(1, 3)
This is a good starting point for measuring baseline latency and throughput. In LoadForge, you can scale this to hundreds or thousands of concurrent users across distributed generators to understand how your inference server performs under realistic traffic.
Running locally with Locust
locust -f locustfile.py --host=http://localhost:8000If you’re using LoadForge, upload the script, set environment variables like VLLM_MODEL and VLLM_API_KEY, and launch the test from your preferred cloud regions.
Advanced Load Testing Scenarios
Basic chat completion tests are useful, but production AI workloads are usually more complex. Below are several advanced vLLM load testing scenarios that better reflect real-world usage.
Scenario 1: Mixed prompt sizes to evaluate batching efficiency
One of the most important things to test with vLLM is how it handles a mix of short and long prompts. Continuous batching can improve throughput, but mixed context lengths often reveal queueing and tail latency issues.
from locust import HttpUser, task, between
import os
import random
MODEL_NAME = os.getenv("VLLM_MODEL", "meta-llama/Llama-3-8B-Instruct")
API_KEY = os.getenv("VLLM_API_KEY", "test-token")
SHORT_PROMPTS = [
"What is autoscaling?",
"List three Python web frameworks.",
"Define observability in one paragraph."
]
LONG_PROMPTS = [
"""You are reviewing an architecture proposal for a SaaS analytics platform.
The system includes API gateways, Kafka ingestion, Spark batch jobs, Redis caching,
PostgreSQL for transactional workloads, and ClickHouse for analytics.
Provide a detailed review of scalability risks, likely bottlenecks, and recommendations
for improving resilience, deployment automation, and cost efficiency.""",
"""Analyze the following product requirements for a customer support chatbot:
multilingual support, CRM integration, ticket summarization, agent handoff,
audit logging, and role-based access controls. Explain implementation tradeoffs,
security concerns, and a phased delivery plan."""
]
class MixedPromptVLLMUser(HttpUser):
wait_time = between(0.5, 2.0)
def on_start(self):
self.headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
@task(3)
def short_prompt_chat(self):
payload = {
"model": MODEL_NAME,
"messages": [
{"role": "system", "content": "You are a helpful infrastructure assistant."},
{"role": "user", "content": random.choice(SHORT_PROMPTS)}
],
"temperature": 0.3,
"max_tokens": 80
}
self.client.post(
"/v1/chat/completions",
json=payload,
headers=self.headers,
name="chat_short"
)
@task(1)
def long_prompt_chat(self):
payload = {
"model": MODEL_NAME,
"messages": [
{"role": "system", "content": "You are a senior solutions architect."},
{"role": "user", "content": random.choice(LONG_PROMPTS)}
],
"temperature": 0.2,
"max_tokens": 300
}
self.client.post(
"/v1/chat/completions",
json=payload,
headers=self.headers,
name="chat_long"
)Why this matters
This test separates short and long prompts into named request groups, which helps you compare:
- Latency differences by prompt type
- How long requests affect short-request performance
- Whether batching remains efficient under mixed workloads
- Whether GPU memory pressure causes spikes in failures or latency
In LoadForge’s real-time reporting, this segmentation makes it easier to identify whether your server is optimized for the traffic profile you actually expect.
Scenario 2: Authenticated multi-endpoint test for model discovery and text generation
Many production deployments put vLLM behind an API gateway. Users may first call /v1/models, then submit either chat completions or classic completions. This scenario simulates a more realistic application workflow.
from locust import HttpUser, task, between
import os
import random
API_KEY = os.getenv("VLLM_API_KEY", "test-token")
DEFAULT_MODEL = os.getenv("VLLM_MODEL", "mistralai/Mistral-7B-Instruct-v0.2")
COMPLETION_PROMPTS = [
"Write a release note for a new feature that adds single sign-on support.",
"Generate a SQL query to find the top 10 customers by revenue in the last 30 days.",
"Complete this sentence: Effective incident response requires"
]
class VLLMWorkflowUser(HttpUser):
wait_time = between(1, 2)
def on_start(self):
self.headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"X-Client-Name": "loadforge-locust"
}
self.model_name = DEFAULT_MODEL
@task(1)
def list_models(self):
with self.client.get(
"/v1/models",
headers=self.headers,
name="/v1/models",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"Model listing failed: {response.status_code}")
return
try:
data = response.json()
models = data.get("data", [])
if models:
self.model_name = models[0]["id"]
except Exception as e:
response.failure(f"Invalid model list response: {e}")
@task(4)
def text_completion(self):
payload = {
"model": self.model_name,
"prompt": random.choice(COMPLETION_PROMPTS),
"temperature": 0.5,
"max_tokens": 100,
"top_p": 0.95
}
with self.client.post(
"/v1/completions",
json=payload,
headers=self.headers,
name="/v1/completions",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"Completion failed: {response.status_code} {response.text}")
return
try:
data = response.json()
text = data["choices"][0]["text"]
if len(text.strip()) < 10:
response.failure("Completion too short")
except Exception as e:
response.failure(f"Invalid completion response: {e}")What this reveals
This scenario helps you test:
- Authentication overhead
- Gateway behavior under concurrent API traffic
- Metadata endpoint performance
- Differences between
/v1/completionsand/v1/chat/completions - End-to-end application workflow realism
This is especially useful if your frontend or SDK first discovers available models before invoking inference.
Scenario 3: Embeddings and chat mix for RAG-style workloads
Many AI systems use vLLM in retrieval-augmented generation pipelines. A common pattern is generating embeddings for search or reranking, then sending a prompt to a chat completion endpoint. Even if embeddings are served by a different backend in some architectures, many teams still want to test the combined API layer behavior.
from locust import SequentialTaskSet, HttpUser, task, between
import os
import random
API_KEY = os.getenv("VLLM_API_KEY", "test-token")
CHAT_MODEL = os.getenv("VLLM_CHAT_MODEL", "meta-llama/Llama-3-8B-Instruct")
EMBED_MODEL = os.getenv("VLLM_EMBED_MODEL", "BAAI/bge-small-en-v1.5")
DOCUMENT_QUERIES = [
"How do I rotate database credentials in production?",
"What are the SLO targets for the payments API?",
"Explain the rollback process for Kubernetes deployments.",
"How is customer data encrypted at rest?"
]
class RAGWorkflow(SequentialTaskSet):
def on_start(self):
self.headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
self.query = random.choice(DOCUMENT_QUERIES)
@task
def generate_embedding(self):
payload = {
"model": EMBED_MODEL,
"input": self.query
}
with self.client.post(
"/v1/embeddings",
json=payload,
headers=self.headers,
name="/v1/embeddings",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"Embedding request failed: {response.status_code}")
return
try:
data = response.json()
embedding = data["data"][0]["embedding"]
if not embedding or len(embedding) < 10:
response.failure("Invalid embedding vector")
except Exception as e:
response.failure(f"Invalid embeddings response: {e}")
@task
def answer_question(self):
context = """
Runbook excerpt:
Database credentials are rotated using the secrets management pipeline every 30 days.
Applications retrieve credentials through sidecar injection and must support hot reload.
Emergency rotation can be triggered by the platform team during incidents.
"""
payload = {
"model": CHAT_MODEL,
"messages": [
{"role": "system", "content": "Answer using the provided context only."},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {self.query}"
}
],
"temperature": 0.1,
"max_tokens": 150
}
with self.client.post(
"/v1/chat/completions",
json=payload,
headers=self.headers,
name="rag_chat_completion",
catch_response=True
) as response:
if response.status_code != 200:
response.failure(f"RAG answer failed: {response.status_code}")
return
try:
data = response.json()
content = data["choices"][0]["message"]["content"]
if "I don't know" in content:
response.success()
except Exception as e:
response.failure(f"Invalid chat response: {e}")
class VLLMRAGUser(HttpUser):
wait_time = between(1, 3)
tasks = [RAGWorkflow]Why this is useful
This is a strong performance testing scenario for AI applications because it simulates a realistic RAG workflow with:
- Embeddings generation
- Context-enriched chat completion
- Sequential dependency between steps
- Distinct endpoint names for easier analysis
If you run this in LoadForge with distributed testing, you can observe how well your AI stack handles geographically diverse traffic and mixed endpoint pressure.
Analyzing Your Results
Once your vLLM load testing run finishes, the next step is understanding what the numbers mean.
Look beyond average latency
Average response time can be misleading for inference servers. Instead, focus on:
- P50 latency: Typical user experience
- P95 latency: Performance for most users under load
- P99 latency: Tail latency, often where batching and queueing problems show up
- Error rate: HTTP 429, 500, 502, 503, timeouts, and malformed responses
For vLLM, P95 and P99 are especially important because long prompts and high output token counts can create queueing effects that don’t show up in averages.
Compare endpoint groups
If you named requests clearly in Locust, compare:
chat_shortvschat_long/v1/completionsvs/v1/chat/completions/v1/embeddingsvsrag_chat_completion
This helps you identify whether performance issues are tied to:
- Prompt size
- Endpoint type
- Authentication layers
- Mixed workload contention
Correlate with GPU metrics
Your HTTP load testing results should be correlated with infrastructure telemetry such as:
- GPU utilization
- GPU memory usage
- CPU usage
- Container memory
- Queue depth
- Token throughput
If latency rises while GPU utilization remains low, the bottleneck might be outside the model runtime, such as the gateway, request validation, or network overhead. If GPU utilization is near maximum and latency climbs steadily, you may be compute-bound.
Watch for throughput collapse
A common pattern in stress testing vLLM is that throughput increases predictably up to a point, then degrades sharply. This usually indicates:
- GPU memory exhaustion
- Excessive context lengths
- Too many high-
max_tokensrequests - Inefficient batching under mixed workloads
LoadForge’s real-time reporting is useful here because you can see when response times start to diverge during ramp-up, rather than only after the test ends.
Performance Optimization Tips
After load testing vLLM inference servers, you’ll usually find opportunities to improve both latency and throughput.
Keep prompt sizes under control
Long prompts consume KV cache memory and reduce batching efficiency. If possible:
- Trim unnecessary system prompt text
- Summarize conversation history
- Limit retrieved context in RAG pipelines
Set realistic max_tokens
Overly generous max_tokens values can hurt throughput significantly. Many clients request more output than they actually need. Lowering this value often improves both latency and GPU efficiency.
Separate workload classes
If you have very different traffic types, consider routing them separately:
- Short interactive chat requests
- Long-form generation
- Embeddings
- Batch offline inference
This can reduce interference between workloads and improve predictability.
Tune concurrency carefully
More concurrency does not always mean better performance. With vLLM, there is usually an optimal range where batching is efficient without causing excessive queueing.
Test production-like authentication and gateways
Don’t benchmark only the raw vLLM container if production traffic goes through:
- API gateways
- Ingress controllers
- WAFs
- Service meshes
- Observability middleware
Your true bottleneck may be outside the model server.
Use distributed load generation
AI applications often serve global users. LoadForge’s cloud-based infrastructure and global test locations let you simulate traffic from multiple regions, which is useful for validating CDN, gateway, and edge routing behavior in front of vLLM.
Common Pitfalls to Avoid
Load testing vLLM inference servers is different from testing conventional APIs. Here are some of the most common mistakes.
Using unrealistically small prompts
If your production users send large contexts, testing only tiny prompts will overestimate throughput and underestimate latency.
Ignoring output token length
Two requests with the same prompt can have very different performance profiles if one generates 50 tokens and the other generates 1000. Always test realistic max_tokens values.
Not validating responses
A 200 response does not always mean success. Validate that:
- JSON is well-formed
choicesexists- Generated text is non-empty
- Embedding vectors are present
Overlooking warm-up behavior
Model servers may behave differently during startup, cache warm-up, or the first burst of traffic. Include a warm-up phase before drawing conclusions.
Failing to segment request types
If all requests are grouped under one generic name, it becomes difficult to identify what caused latency spikes. Use clear names in Locust for each endpoint and workload type.
Stress testing without infrastructure telemetry
HTTP metrics alone won’t tell you whether the issue is batching, GPU saturation, memory pressure, or gateway overhead. Always pair load testing with system monitoring.
Testing only a single region
If your users are global, a single-region benchmark may hide network and edge-layer issues. Distributed testing is a better representation of real-world AI traffic.
Conclusion
Load testing vLLM inference servers is one of the most effective ways to improve AI application reliability, control infrastructure costs, and deliver better user experience. Because vLLM performance depends on batching behavior, prompt sizes, output lengths, and GPU constraints, realistic load testing and stress testing are critical before you go to production.
With Locust-based scripts and LoadForge, you can benchmark chat completions, text completions, embeddings, and RAG-style workflows using realistic payloads and authentication patterns. You can also scale tests with distributed cloud load generators, monitor results in real time, and integrate performance testing into your CI/CD pipeline.
If you’re ready to optimize throughput, latency, batching efficiency, and GPU utilization for your vLLM deployment, try LoadForge and start building production-grade inference benchmarks today.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test an AI Gateway
Learn how to load test an AI gateway to validate routing, caching, rate limiting, and multi-model reliability at scale.

How to Load Test Azure OpenAI
Load test Azure OpenAI deployments to validate throughput, response times, quotas, and reliability for enterprise AI applications.

How to Load Test the ChatGPT API
Discover how to load test the ChatGPT API with realistic prompts, concurrent users, streaming, and token-based performance metrics.