
Introduction
Retrieval-Augmented Generation (RAG) applications are quickly becoming a core part of modern AI products. Whether you are building an internal knowledge assistant, a customer support chatbot, a document Q&A platform, or an enterprise search experience, your users expect fast, accurate, and reliable responses. That makes load testing RAG applications essential.
Unlike a traditional web app, a RAG system usually includes multiple performance-sensitive stages in a single request path: query preprocessing, embeddings generation, vector database retrieval, reranking, prompt construction, and LLM response generation. Under load, bottlenecks can appear in any of these layers. A system that works fine with a few test users may degrade quickly when dozens or hundreds of concurrent users start asking long, retrieval-heavy questions.
In this guide, you will learn how to load test RAG applications end to end using LoadForge and Locust. We will cover realistic performance testing scenarios for AI & LLM systems, including authenticated chat requests, document ingestion pipelines, retrieval-only testing, and mixed user traffic patterns. You will also see how to evaluate latency, throughput, token-heavy responses, and failure modes specific to RAG architectures.
Because LoadForge is built on Locust, every example uses practical Python scripts you can run in a cloud-based, distributed load testing environment. That makes it easier to simulate real-world traffic from global test locations, monitor results in real time, and integrate performance testing into CI/CD pipelines.
Prerequisites
Before you start load testing your RAG application, make sure you have:
- A deployed RAG application with accessible HTTP endpoints
- API authentication credentials such as:
- Bearer token
- OAuth client credentials
- API key
- Knowledge of your application’s request flow, including:
- chat or ask endpoint
- document ingestion endpoint
- retrieval/search endpoint
- embeddings endpoint, if exposed
- A representative test dataset, such as:
- realistic user questions
- sample document IDs
- tenant IDs or workspace IDs
- uploaded files or source URLs
- Expected performance baselines, for example:
- p95 latency under 50 concurrent users
- max acceptable error rate below 1%
- target throughput for document ingestion jobs
It also helps to understand whether your RAG stack uses components such as:
- OpenAI, Anthropic, Azure OpenAI, or self-hosted LLM inference
- Pinecone, Weaviate, Qdrant, Milvus, pgvector, or Elasticsearch for vector search
- Redis or another cache for retrieval or prompt caching
- Background workers for chunking, embeddings, and indexing
For the examples below, we will assume the application exposes these realistic endpoints:
POST /api/v1/auth/tokenPOST /api/v1/chat/completionsPOST /api/v1/retrieval/searchPOST /api/v1/documents/ingestGET /api/v1/documents/{job_id}/status
Understanding RAG Applications Under Load
RAG systems behave differently from standard CRUD applications during load testing and stress testing. A single user request can trigger several expensive operations:
- Query normalization or guardrail checks
- Embedding generation for the user query
- Vector similarity search in a vector database
- Metadata filtering and reranking
- Prompt assembly with retrieved context
- LLM inference or API call
- Streaming or full-response delivery
Each stage introduces its own latency profile. Under concurrent load, common bottlenecks include:
Embedding Throughput Limits
If your application generates embeddings on every query, your embedding service may become a constraint before the vector database or LLM does. This is especially common when using external API providers with rate limits.
Vector Database Saturation
Vector search performance can degrade under high concurrency, especially with:
- large indexes
- complex metadata filters
- hybrid search
- reranking on top of initial retrieval
LLM Token Generation Delays
The generation step often dominates end-to-end latency. Long prompts, large retrieved contexts, and high max_tokens settings can significantly increase response times.
Background Ingestion Contention
If document ingestion and query traffic share the same infrastructure, indexing jobs may compete for CPU, memory, I/O, or provider quotas.
Authentication and Multi-Tenant Overhead
Enterprise RAG systems often add:
- per-tenant retrieval filters
- RBAC checks
- audit logging
- conversation memory lookup
These can become hidden latency contributors during performance testing.
That is why effective load testing for RAG applications should not focus only on a single chat endpoint. You should test the full system behavior, including read-heavy query traffic, write-heavy ingestion traffic, and mixed workloads.
Writing Your First Load Test
Let’s start with a basic end-to-end RAG chat load test. This script authenticates a user, sends realistic questions to the chat endpoint, and validates that the response includes an answer and retrieved sources.
from locust import HttpUser, task, between
import random
QUESTIONS = [
"What is our refund policy for annual enterprise subscriptions?",
"How do I configure SSO with Okta for the admin portal?",
"Summarize the security controls for SOC 2 compliance.",
"What are the API rate limits for the premium plan?",
"How do I rotate service account credentials safely?"
]
class RAGChatUser(HttpUser):
wait_time = between(1, 3)
def on_start(self):
response = self.client.post(
"/api/v1/auth/token",
json={
"client_id": "loadtest-client",
"client_secret": "super-secret-value",
"audience": "rag-api",
"grant_type": "client_credentials"
},
name="/api/v1/auth/token"
)
data = response.json()
self.token = data["access_token"]
self.headers = {
"Authorization": f"Bearer {self.token}",
"Content-Type": "application/json",
"X-Tenant-Id": "acme-corp"
}
@task
def ask_question(self):
question = random.choice(QUESTIONS)
with self.client.post(
"/api/v1/chat/completions",
headers=self.headers,
json={
"conversation_id": None,
"messages": [
{
"role": "user",
"content": question
}
],
"retrieval": {
"top_k": 5,
"filters": {
"department": "support",
"visibility": "internal"
}
},
"generation": {
"model": "gpt-4o-mini",
"temperature": 0.2,
"max_tokens": 400
},
"include_sources": True,
"stream": False
},
catch_response=True,
name="/api/v1/chat/completions"
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status: {response.status_code}")
return
body = response.json()
answer = body.get("answer", "")
sources = body.get("sources", [])
if not answer:
response.failure("Missing answer in response")
elif len(sources) == 0:
response.failure("No retrieval sources returned")
else:
response.success()What this test covers
This first script is useful because it measures the full request path for a common RAG query:
- authentication
- query handling
- retrieval
- prompt building
- LLM generation
- response serialization
It also validates application correctness, not just status codes. In AI & LLM load testing, a 200 OK response is not enough if the answer is empty or retrieval failed silently.
Why this matters for RAG performance testing
A simple chat test helps you establish:
- average and p95 response times
- error rates under concurrent user traffic
- whether retrieval remains functional at scale
- whether LLM latency dominates total request time
In LoadForge, you can run this script with distributed testing workers to simulate traffic from multiple cloud regions and observe real-time reporting as concurrency increases.
Advanced Load Testing Scenarios
Basic chat traffic is only the beginning. Real RAG systems need more targeted load testing scenarios to uncover bottlenecks in retrieval, ingestion, and mixed workloads.
Scenario 1: Retrieval-Heavy Search Without Full Generation
Sometimes you want to isolate the retrieval layer from LLM generation. This is especially useful when diagnosing whether latency comes from the vector database or the model itself.
from locust import HttpUser, task, between
import random
SEARCH_QUERIES = [
"PCI DSS logging requirements for payment processing",
"Incident response runbook for database failover",
"Employee onboarding checklist for engineering",
"Data retention policy for customer support transcripts",
"VPN access troubleshooting steps for remote workers"
]
class RetrievalOnlyUser(HttpUser):
wait_time = between(0.5, 2)
def on_start(self):
self.headers = {
"Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
"Content-Type": "application/json",
"X-Tenant-Id": "acme-corp"
}
@task
def semantic_search(self):
query = random.choice(SEARCH_QUERIES)
with self.client.post(
"/api/v1/retrieval/search",
headers=self.headers,
json={
"query": query,
"top_k": 8,
"namespace": "knowledge-base-prod",
"filters": {
"doc_type": ["policy", "runbook", "guide"],
"region": "us",
"published": True
},
"rerank": {
"enabled": True,
"model": "cross-encoder-v2",
"top_n": 5
},
"include_chunks": True
},
catch_response=True,
name="/api/v1/retrieval/search"
) as response:
if response.status_code != 200:
response.failure(f"Search failed with {response.status_code}")
return
data = response.json()
results = data.get("results", [])
if len(results) < 3:
response.failure("Too few retrieval results returned")
else:
response.success()When to use this test
Use retrieval-only testing when you want to:
- benchmark vector database performance
- compare reranking enabled vs disabled
- test metadata filtering overhead
- isolate retrieval bottlenecks from LLM latency
This is one of the most effective forms of performance testing for RAG applications because it narrows the problem scope. If retrieval is fast but chat is slow, the issue likely sits in prompt assembly or generation.
Scenario 2: Authenticated Multi-Turn Chat Sessions
Real users do not always send one isolated question. They often continue a conversation, which means your system may store session state, retrieve prior messages, and expand prompts over time.
from locust import HttpUser, task, between
import random
import uuid
CONVERSATION_STARTERS = [
"Can you explain our disaster recovery plan?",
"What is the process for requesting production access?",
"Summarize the SLA for enterprise customers."
]
FOLLOW_UPS = [
"Can you give me the key steps as a checklist?",
"Which section mentions approval requirements?",
"Summarize that in simpler language.",
"What are the exceptions to this policy?"
]
class MultiTurnRAGUser(HttpUser):
wait_time = between(2, 5)
def on_start(self):
self.conversation_id = str(uuid.uuid4())
self.headers = {
"Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
"Content-Type": "application/json",
"X-Tenant-Id": "acme-corp",
"X-User-Id": f"user-{random.randint(1000, 9999)}"
}
@task
def multi_turn_chat(self):
starter = random.choice(CONVERSATION_STARTERS)
follow_up = random.choice(FOLLOW_UPS)
first_response = self.client.post(
"/api/v1/chat/completions",
headers=self.headers,
json={
"conversation_id": self.conversation_id,
"messages": [
{"role": "user", "content": starter}
],
"retrieval": {
"top_k": 4,
"filters": {
"visibility": "internal"
}
},
"generation": {
"model": "claude-3-5-sonnet",
"temperature": 0.3,
"max_tokens": 300
},
"include_sources": True,
"stream": False
},
name="/api/v1/chat/completions [turn1]"
)
if first_response.status_code != 200:
return
with self.client.post(
"/api/v1/chat/completions",
headers=self.headers,
json={
"conversation_id": self.conversation_id,
"messages": [
{"role": "user", "content": follow_up}
],
"retrieval": {
"top_k": 4,
"filters": {
"visibility": "internal"
}
},
"generation": {
"model": "claude-3-5-sonnet",
"temperature": 0.3,
"max_tokens": 300
},
"include_sources": True,
"stream": False
},
catch_response=True,
name="/api/v1/chat/completions [turn2]"
) as response:
if response.status_code != 200:
response.failure(f"Follow-up failed with {response.status_code}")
return
body = response.json()
if not body.get("answer"):
response.failure("Missing answer in follow-up response")
else:
response.success()What this uncovers
This test is valuable for stress testing session-aware RAG systems because it reveals:
- memory growth across conversation turns
- prompt expansion problems
- session store latency
- degradation in follow-up response times
- token usage spikes caused by conversation history
If your p95 latency climbs sharply on second or third turns, you may need prompt truncation, memory summarization, or caching.
Scenario 3: Document Ingestion and Indexing Pipeline
A RAG application is not only about query-time performance. Ingestion matters too. If your team uploads documents during business hours, indexing jobs can affect query latency and overall system stability.
from locust import HttpUser, task, between
import random
import time
DOCUMENTS = [
{
"source_url": "https://docs.acme-corp.com/security/soc2-controls.pdf",
"title": "SOC 2 Security Controls",
"metadata": {"department": "security", "doc_type": "policy", "region": "us"}
},
{
"source_url": "https://docs.acme-corp.com/hr/remote-work-policy.pdf",
"title": "Remote Work Policy",
"metadata": {"department": "hr", "doc_type": "policy", "region": "global"}
},
{
"source_url": "https://docs.acme-corp.com/engineering/api-runbook.pdf",
"title": "API Incident Runbook",
"metadata": {"department": "engineering", "doc_type": "runbook", "region": "us"}
}
]
class DocumentIngestionUser(HttpUser):
wait_time = between(5, 10)
def on_start(self):
self.headers = {
"Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
"Content-Type": "application/json",
"X-Tenant-Id": "acme-corp"
}
@task
def ingest_document(self):
doc = random.choice(DOCUMENTS)
with self.client.post(
"/api/v1/documents/ingest",
headers=self.headers,
json={
"source_type": "url",
"source_url": doc["source_url"],
"title": doc["title"],
"chunking": {
"strategy": "recursive",
"chunk_size": 800,
"chunk_overlap": 120
},
"embedding": {
"model": "text-embedding-3-large"
},
"index": {
"namespace": "knowledge-base-prod"
},
"metadata": doc["metadata"]
},
catch_response=True,
name="/api/v1/documents/ingest"
) as response:
if response.status_code not in (200, 202):
response.failure(f"Ingestion start failed with {response.status_code}")
return
job_id = response.json().get("job_id")
if not job_id:
response.failure("Missing ingestion job_id")
return
time.sleep(1)
status_response = self.client.get(
f"/api/v1/documents/{job_id}/status",
headers=self.headers,
name="/api/v1/documents/{job_id}/status"
)
if status_response.status_code != 200:
response.failure("Could not fetch ingestion job status")
return
status = status_response.json().get("status")
if status not in ("queued", "processing", "completed"):
response.failure(f"Unexpected ingestion status: {status}")
else:
response.success()Why ingestion load testing matters
This scenario helps you evaluate:
- background job queue performance
- chunking and embedding throughput
- vector indexing latency
- contention between write-heavy and read-heavy workloads
A common mistake in RAG performance testing is to test only chat queries. In production, ingestion pipelines can create noisy-neighbor effects that slow down retrieval and generation.
Analyzing Your Results
Once your tests are running in LoadForge, focus on more than just average response time. AI & LLM systems often have wide latency variation, so percentile analysis is critical.
Key metrics to watch
p95 and p99 latency
RAG applications often have long-tail latency due to:
- slow retrieval queries
- provider-side LLM delays
- retries on upstream APIs
- oversized prompts
Average latency can hide these issues. Always look at p95 and p99.
Error rate
Track:
- HTTP 429 rate limit responses
- HTTP 5xx server errors
- timeout failures
- malformed or incomplete AI responses
A low average latency is meaningless if your error rate spikes under load.
Requests per second
For RAG systems, throughput should be interpreted alongside complexity. Ten requests per second for long-context generation may be excellent, while the same throughput for retrieval-only traffic may be poor.
Endpoint-level breakdown
Separate metrics for:
- auth
- retrieval
- chat generation
- ingestion
- job status polling
This helps pinpoint the bottleneck quickly.
Questions to ask when reviewing LoadForge results
- Does retrieval latency increase linearly or sharply with concurrency?
- Does generation time dominate total response time?
- Are follow-up chat turns slower than first-turn requests?
- Does ingestion traffic impact chat p95 latency?
- Are provider rate limits causing failures before your own infrastructure saturates?
LoadForge’s real-time reporting makes these patterns easier to spot while the test is still running, and its distributed testing model lets you validate whether latency differs across regions or cloud locations.
Performance Optimization Tips
After load testing your RAG application, you will usually find opportunities to improve performance. Here are some of the most common optimizations.
Cache embeddings and retrieval where possible
If users ask repeated or similar questions, caching query embeddings or retrieval results can reduce pressure on your embeddings service and vector database.
Limit retrieved context size
Fetching too many chunks increases:
- prompt assembly time
- token count
- LLM generation latency
- cost
Test different top_k values and chunk sizes to find the best tradeoff.
Use reranking selectively
Reranking improves relevance, but it also adds latency. Consider enabling it only for high-value queries or after an initial confidence check.
Trim conversation history
Multi-turn chat can become slow if every prior message is included. Use summarization, token budgeting, or memory windows.
Separate ingestion from query infrastructure
If possible, isolate indexing and embedding jobs from user-facing query traffic to avoid resource contention.
Add circuit breakers and graceful degradation
If the LLM provider or vector database slows down, your application should degrade gracefully:
- return partial retrieval results
- reduce
top_k - switch to a smaller model
- disable reranking temporarily
Load test with realistic prompts
Short synthetic prompts can produce misleadingly good results. Use real question lengths, real metadata filters, and realistic token budgets.
Common Pitfalls to Avoid
Load testing RAG applications requires more care than standard API load testing. Avoid these common mistakes.
Testing only one endpoint
A single /chat test does not tell the full story. Include retrieval-only and ingestion scenarios for complete performance testing.
Ignoring provider rate limits
Your LLM or embedding provider may throttle traffic long before your own app is saturated. Include rate-limit monitoring in your analysis.
Using unrealistic payloads
Tiny prompts, no filters, and low token counts do not reflect production. Use realistic payloads that match actual user behavior.
Not validating response quality
A 200 OK with an empty answer, missing sources, or partial retrieval is still a failure in many RAG applications. Add response validation in your Locust scripts.
Overlooking warm caches
If you run the same queries repeatedly, you may benchmark the cache rather than the system. Mix repeated and unique queries to get a more realistic picture.
Forgetting mixed workloads
Real systems often handle both user queries and document ingestion at the same time. Test them together, not in isolation only.
Skipping distributed tests
RAG applications often serve global users. LoadForge’s cloud-based infrastructure and global test locations help you simulate geographically distributed traffic patterns that local testing cannot reproduce.
Conclusion
Load testing RAG applications is about much more than sending requests to an LLM endpoint. To understand real-world performance, you need to test the entire pipeline: authentication, embeddings, vector retrieval, reranking, prompt construction, response generation, and ingestion workflows. That is the only way to uncover the true bottlenecks affecting user experience.
With Locust-based scripting and LoadForge’s distributed testing platform, you can build realistic AI & LLM load testing scenarios, run them at scale, analyze results in real time, and integrate performance testing into your CI/CD process. Whether you are validating a new vector database configuration, stress testing multi-turn chat, or benchmarking document indexing throughput, LoadForge gives you the flexibility to test your RAG system the way users actually use it.
If you are ready to load test your RAG application end to end, try LoadForge and start building realistic performance tests today.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test an AI Gateway
Learn how to load test an AI gateway to validate routing, caching, rate limiting, and multi-model reliability at scale.

How to Load Test Azure OpenAI
Load test Azure OpenAI deployments to validate throughput, response times, quotas, and reliability for enterprise AI applications.

How to Load Test the ChatGPT API
Discover how to load test the ChatGPT API with realistic prompts, concurrent users, streaming, and token-based performance metrics.