Introduction

Retrieval-Augmented Generation (RAG) applications are quickly becoming a core part of modern AI products. Whether you are building an internal knowledge assistant, a customer support chatbot, a document Q&A platform, or an enterprise search experience, your users expect fast, accurate, and reliable responses. That makes load testing RAG applications essential.

Unlike a traditional web app, a RAG system usually includes multiple performance-sensitive stages in a single request path: query preprocessing, embeddings generation, vector database retrieval, reranking, prompt construction, and LLM response generation. Under load, bottlenecks can appear in any of these layers. A system that works fine with a few test users may degrade quickly when dozens or hundreds of concurrent users start asking long, retrieval-heavy questions.

In this guide, you will learn how to load test RAG applications end to end using LoadForge and Locust. We will cover realistic performance testing scenarios for AI & LLM systems, including authenticated chat requests, document ingestion pipelines, retrieval-only testing, and mixed user traffic patterns. You will also see how to evaluate latency, throughput, token-heavy responses, and failure modes specific to RAG architectures.

Because LoadForge is built on Locust, every example uses practical Python scripts you can run in a cloud-based, distributed load testing environment. That makes it easier to simulate real-world traffic from global test locations, monitor results in real time, and integrate performance testing into CI/CD pipelines.

Prerequisites

Before you start load testing your RAG application, make sure you have:

A deployed RAG application with accessible HTTP endpoints
API authentication credentials such as:
- Bearer token
- OAuth client credentials
- API key
Knowledge of your application’s request flow, including:
- chat or ask endpoint
- document ingestion endpoint
- retrieval/search endpoint
- embeddings endpoint, if exposed
A representative test dataset, such as:
- realistic user questions
- sample document IDs
- tenant IDs or workspace IDs
- uploaded files or source URLs
Expected performance baselines, for example:
- p95 latency under 50 concurrent users
- max acceptable error rate below 1%
- target throughput for document ingestion jobs

It also helps to understand whether your RAG stack uses components such as:

OpenAI, Anthropic, Azure OpenAI, or self-hosted LLM inference
Pinecone, Weaviate, Qdrant, Milvus, pgvector, or Elasticsearch for vector search
Redis or another cache for retrieval or prompt caching
Background workers for chunking, embeddings, and indexing

For the examples below, we will assume the application exposes these realistic endpoints:

POST /api/v1/auth/token
POST /api/v1/chat/completions
POST /api/v1/retrieval/search
POST /api/v1/documents/ingest
GET /api/v1/documents/{job_id}/status

Understanding RAG Applications Under Load

RAG systems behave differently from standard CRUD applications during load testing and stress testing. A single user request can trigger several expensive operations:

Query normalization or guardrail checks
Embedding generation for the user query
Vector similarity search in a vector database
Metadata filtering and reranking
Prompt assembly with retrieved context
LLM inference or API call
Streaming or full-response delivery

Each stage introduces its own latency profile. Under concurrent load, common bottlenecks include:

Embedding Throughput Limits

If your application generates embeddings on every query, your embedding service may become a constraint before the vector database or LLM does. This is especially common when using external API providers with rate limits.

Vector Database Saturation

Vector search performance can degrade under high concurrency, especially with:

large indexes
complex metadata filters
hybrid search
reranking on top of initial retrieval

LLM Token Generation Delays

The generation step often dominates end-to-end latency. Long prompts, large retrieved contexts, and high max_tokens settings can significantly increase response times.

Background Ingestion Contention

If document ingestion and query traffic share the same infrastructure, indexing jobs may compete for CPU, memory, I/O, or provider quotas.

Authentication and Multi-Tenant Overhead

Enterprise RAG systems often add:

per-tenant retrieval filters
RBAC checks
audit logging
conversation memory lookup

These can become hidden latency contributors during performance testing.

That is why effective load testing for RAG applications should not focus only on a single chat endpoint. You should test the full system behavior, including read-heavy query traffic, write-heavy ingestion traffic, and mixed workloads.

Writing Your First Load Test

Let’s start with a basic end-to-end RAG chat load test. This script authenticates a user, sends realistic questions to the chat endpoint, and validates that the response includes an answer and retrieved sources.

python

from locust import HttpUser, task, between
import random
 
QUESTIONS = [
    "What is our refund policy for annual enterprise subscriptions?",
    "How do I configure SSO with Okta for the admin portal?",
    "Summarize the security controls for SOC 2 compliance.",
    "What are the API rate limits for the premium plan?",
    "How do I rotate service account credentials safely?"
]
 
class RAGChatUser(HttpUser):
    wait_time = between(1, 3)
 
    def on_start(self):
        response = self.client.post(
            "/api/v1/auth/token",
            json={
                "client_id": "loadtest-client",
                "client_secret": "super-secret-value",
                "audience": "rag-api",
                "grant_type": "client_credentials"
            },
            name="/api/v1/auth/token"
        )
        data = response.json()
        self.token = data["access_token"]
        self.headers = {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json",
            "X-Tenant-Id": "acme-corp"
        }
 
    @task
    def ask_question(self):
        question = random.choice(QUESTIONS)
 
        with self.client.post(
            "/api/v1/chat/completions",
            headers=self.headers,
            json={
                "conversation_id": None,
                "messages": [
                    {
                        "role": "user",
                        "content": question
                    }
                ],
                "retrieval": {
                    "top_k": 5,
                    "filters": {
                        "department": "support",
                        "visibility": "internal"
                    }
                },
                "generation": {
                    "model": "gpt-4o-mini",
                    "temperature": 0.2,
                    "max_tokens": 400
                },
                "include_sources": True,
                "stream": False
            },
            catch_response=True,
            name="/api/v1/chat/completions"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Unexpected status: {response.status_code}")
                return
 
            body = response.json()
            answer = body.get("answer", "")
            sources = body.get("sources", [])
 
            if not answer:
                response.failure("Missing answer in response")
            elif len(sources) == 0:
                response.failure("No retrieval sources returned")
            else:
                response.success()

What this test covers

This first script is useful because it measures the full request path for a common RAG query:

authentication
query handling
retrieval
prompt building
LLM generation
response serialization

It also validates application correctness, not just status codes. In AI & LLM load testing, a 200 OK response is not enough if the answer is empty or retrieval failed silently.

Why this matters for RAG performance testing

A simple chat test helps you establish:

average and p95 response times
error rates under concurrent user traffic
whether retrieval remains functional at scale
whether LLM latency dominates total request time

In LoadForge, you can run this script with distributed testing workers to simulate traffic from multiple cloud regions and observe real-time reporting as concurrency increases.

Advanced Load Testing Scenarios

Basic chat traffic is only the beginning. Real RAG systems need more targeted load testing scenarios to uncover bottlenecks in retrieval, ingestion, and mixed workloads.

Scenario 1: Retrieval-Heavy Search Without Full Generation

Sometimes you want to isolate the retrieval layer from LLM generation. This is especially useful when diagnosing whether latency comes from the vector database or the model itself.

python

from locust import HttpUser, task, between
import random
 
SEARCH_QUERIES = [
    "PCI DSS logging requirements for payment processing",
    "Incident response runbook for database failover",
    "Employee onboarding checklist for engineering",
    "Data retention policy for customer support transcripts",
    "VPN access troubleshooting steps for remote workers"
]
 
class RetrievalOnlyUser(HttpUser):
    wait_time = between(0.5, 2)
 
    def on_start(self):
        self.headers = {
            "Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
            "Content-Type": "application/json",
            "X-Tenant-Id": "acme-corp"
        }
 
    @task
    def semantic_search(self):
        query = random.choice(SEARCH_QUERIES)
 
        with self.client.post(
            "/api/v1/retrieval/search",
            headers=self.headers,
            json={
                "query": query,
                "top_k": 8,
                "namespace": "knowledge-base-prod",
                "filters": {
                    "doc_type": ["policy", "runbook", "guide"],
                    "region": "us",
                    "published": True
                },
                "rerank": {
                    "enabled": True,
                    "model": "cross-encoder-v2",
                    "top_n": 5
                },
                "include_chunks": True
            },
            catch_response=True,
            name="/api/v1/retrieval/search"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Search failed with {response.status_code}")
                return
 
            data = response.json()
            results = data.get("results", [])
 
            if len(results) < 3:
                response.failure("Too few retrieval results returned")
            else:
                response.success()

When to use this test

Use retrieval-only testing when you want to:

benchmark vector database performance
compare reranking enabled vs disabled
test metadata filtering overhead
isolate retrieval bottlenecks from LLM latency

This is one of the most effective forms of performance testing for RAG applications because it narrows the problem scope. If retrieval is fast but chat is slow, the issue likely sits in prompt assembly or generation.

Scenario 2: Authenticated Multi-Turn Chat Sessions

Real users do not always send one isolated question. They often continue a conversation, which means your system may store session state, retrieve prior messages, and expand prompts over time.

python

from locust import HttpUser, task, between
import random
import uuid
 
CONVERSATION_STARTERS = [
    "Can you explain our disaster recovery plan?",
    "What is the process for requesting production access?",
    "Summarize the SLA for enterprise customers."
]
 
FOLLOW_UPS = [
    "Can you give me the key steps as a checklist?",
    "Which section mentions approval requirements?",
    "Summarize that in simpler language.",
    "What are the exceptions to this policy?"
]
 
class MultiTurnRAGUser(HttpUser):
    wait_time = between(2, 5)
 
    def on_start(self):
        self.conversation_id = str(uuid.uuid4())
        self.headers = {
            "Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
            "Content-Type": "application/json",
            "X-Tenant-Id": "acme-corp",
            "X-User-Id": f"user-{random.randint(1000, 9999)}"
        }
 
    @task
    def multi_turn_chat(self):
        starter = random.choice(CONVERSATION_STARTERS)
        follow_up = random.choice(FOLLOW_UPS)
 
        first_response = self.client.post(
            "/api/v1/chat/completions",
            headers=self.headers,
            json={
                "conversation_id": self.conversation_id,
                "messages": [
                    {"role": "user", "content": starter}
                ],
                "retrieval": {
                    "top_k": 4,
                    "filters": {
                        "visibility": "internal"
                    }
                },
                "generation": {
                    "model": "claude-3-5-sonnet",
                    "temperature": 0.3,
                    "max_tokens": 300
                },
                "include_sources": True,
                "stream": False
            },
            name="/api/v1/chat/completions [turn1]"
        )
 
        if first_response.status_code != 200:
            return
 
        with self.client.post(
            "/api/v1/chat/completions",
            headers=self.headers,
            json={
                "conversation_id": self.conversation_id,
                "messages": [
                    {"role": "user", "content": follow_up}
                ],
                "retrieval": {
                    "top_k": 4,
                    "filters": {
                        "visibility": "internal"
                    }
                },
                "generation": {
                    "model": "claude-3-5-sonnet",
                    "temperature": 0.3,
                    "max_tokens": 300
                },
                "include_sources": True,
                "stream": False
            },
            catch_response=True,
            name="/api/v1/chat/completions [turn2]"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Follow-up failed with {response.status_code}")
                return
 
            body = response.json()
            if not body.get("answer"):
                response.failure("Missing answer in follow-up response")
            else:
                response.success()

What this uncovers

This test is valuable for stress testing session-aware RAG systems because it reveals:

memory growth across conversation turns
prompt expansion problems
session store latency
degradation in follow-up response times
token usage spikes caused by conversation history

If your p95 latency climbs sharply on second or third turns, you may need prompt truncation, memory summarization, or caching.

Scenario 3: Document Ingestion and Indexing Pipeline

A RAG application is not only about query-time performance. Ingestion matters too. If your team uploads documents during business hours, indexing jobs can affect query latency and overall system stability.

python

from locust import HttpUser, task, between
import random
import time
 
DOCUMENTS = [
    {
        "source_url": "https://docs.acme-corp.com/security/soc2-controls.pdf",
        "title": "SOC 2 Security Controls",
        "metadata": {"department": "security", "doc_type": "policy", "region": "us"}
    },
    {
        "source_url": "https://docs.acme-corp.com/hr/remote-work-policy.pdf",
        "title": "Remote Work Policy",
        "metadata": {"department": "hr", "doc_type": "policy", "region": "global"}
    },
    {
        "source_url": "https://docs.acme-corp.com/engineering/api-runbook.pdf",
        "title": "API Incident Runbook",
        "metadata": {"department": "engineering", "doc_type": "runbook", "region": "us"}
    }
]
 
class DocumentIngestionUser(HttpUser):
    wait_time = between(5, 10)
 
    def on_start(self):
        self.headers = {
            "Authorization": "Bearer YOUR_STATIC_LOAD_TEST_TOKEN",
            "Content-Type": "application/json",
            "X-Tenant-Id": "acme-corp"
        }
 
    @task
    def ingest_document(self):
        doc = random.choice(DOCUMENTS)
 
        with self.client.post(
            "/api/v1/documents/ingest",
            headers=self.headers,
            json={
                "source_type": "url",
                "source_url": doc["source_url"],
                "title": doc["title"],
                "chunking": {
                    "strategy": "recursive",
                    "chunk_size": 800,
                    "chunk_overlap": 120
                },
                "embedding": {
                    "model": "text-embedding-3-large"
                },
                "index": {
                    "namespace": "knowledge-base-prod"
                },
                "metadata": doc["metadata"]
            },
            catch_response=True,
            name="/api/v1/documents/ingest"
        ) as response:
            if response.status_code not in (200, 202):
                response.failure(f"Ingestion start failed with {response.status_code}")
                return
 
            job_id = response.json().get("job_id")
            if not job_id:
                response.failure("Missing ingestion job_id")
                return
 
            time.sleep(1)
 
            status_response = self.client.get(
                f"/api/v1/documents/{job_id}/status",
                headers=self.headers,
                name="/api/v1/documents/{job_id}/status"
            )
 
            if status_response.status_code != 200:
                response.failure("Could not fetch ingestion job status")
                return
 
            status = status_response.json().get("status")
            if status not in ("queued", "processing", "completed"):
                response.failure(f"Unexpected ingestion status: {status}")
            else:
                response.success()

Why ingestion load testing matters

This scenario helps you evaluate:

background job queue performance
chunking and embedding throughput
vector indexing latency
contention between write-heavy and read-heavy workloads

A common mistake in RAG performance testing is to test only chat queries. In production, ingestion pipelines can create noisy-neighbor effects that slow down retrieval and generation.

Analyzing Your Results

Once your tests are running in LoadForge, focus on more than just average response time. AI & LLM systems often have wide latency variation, so percentile analysis is critical.

Key metrics to watch

p95 and p99 latency

RAG applications often have long-tail latency due to:

slow retrieval queries
provider-side LLM delays
retries on upstream APIs
oversized prompts

Average latency can hide these issues. Always look at p95 and p99.

Error rate

Track:

HTTP 429 rate limit responses
HTTP 5xx server errors
timeout failures
malformed or incomplete AI responses

A low average latency is meaningless if your error rate spikes under load.

Requests per second

For RAG systems, throughput should be interpreted alongside complexity. Ten requests per second for long-context generation may be excellent, while the same throughput for retrieval-only traffic may be poor.

Endpoint-level breakdown

Separate metrics for:

auth
retrieval
chat generation
ingestion
job status polling

This helps pinpoint the bottleneck quickly.

Questions to ask when reviewing LoadForge results

Does retrieval latency increase linearly or sharply with concurrency?
Does generation time dominate total response time?
Are follow-up chat turns slower than first-turn requests?
Does ingestion traffic impact chat p95 latency?
Are provider rate limits causing failures before your own infrastructure saturates?

LoadForge’s real-time reporting makes these patterns easier to spot while the test is still running, and its distributed testing model lets you validate whether latency differs across regions or cloud locations.

Performance Optimization Tips

After load testing your RAG application, you will usually find opportunities to improve performance. Here are some of the most common optimizations.

Cache embeddings and retrieval where possible

If users ask repeated or similar questions, caching query embeddings or retrieval results can reduce pressure on your embeddings service and vector database.

Limit retrieved context size

Fetching too many chunks increases:

prompt assembly time
token count
LLM generation latency
cost

Test different top_k values and chunk sizes to find the best tradeoff.

Use reranking selectively

Reranking improves relevance, but it also adds latency. Consider enabling it only for high-value queries or after an initial confidence check.

Trim conversation history

Multi-turn chat can become slow if every prior message is included. Use summarization, token budgeting, or memory windows.

Separate ingestion from query infrastructure

If possible, isolate indexing and embedding jobs from user-facing query traffic to avoid resource contention.

Add circuit breakers and graceful degradation

If the LLM provider or vector database slows down, your application should degrade gracefully:

return partial retrieval results
reduce top_k
switch to a smaller model
disable reranking temporarily

Load test with realistic prompts

Short synthetic prompts can produce misleadingly good results. Use real question lengths, real metadata filters, and realistic token budgets.

Common Pitfalls to Avoid

Load testing RAG applications requires more care than standard API load testing. Avoid these common mistakes.

Testing only one endpoint

A single /chat test does not tell the full story. Include retrieval-only and ingestion scenarios for complete performance testing.

Ignoring provider rate limits

Your LLM or embedding provider may throttle traffic long before your own app is saturated. Include rate-limit monitoring in your analysis.

Using unrealistic payloads

Tiny prompts, no filters, and low token counts do not reflect production. Use realistic payloads that match actual user behavior.

Not validating response quality

A 200 OK with an empty answer, missing sources, or partial retrieval is still a failure in many RAG applications. Add response validation in your Locust scripts.

Overlooking warm caches

If you run the same queries repeatedly, you may benchmark the cache rather than the system. Mix repeated and unique queries to get a more realistic picture.

Forgetting mixed workloads

Real systems often handle both user queries and document ingestion at the same time. Test them together, not in isolation only.

Skipping distributed tests

RAG applications often serve global users. LoadForge’s cloud-based infrastructure and global test locations help you simulate geographically distributed traffic patterns that local testing cannot reproduce.

Conclusion

Load testing RAG applications is about much more than sending requests to an LLM endpoint. To understand real-world performance, you need to test the entire pipeline: authentication, embeddings, vector retrieval, reranking, prompt construction, response generation, and ingestion workflows. That is the only way to uncover the true bottlenecks affecting user experience.

With Locust-based scripting and LoadForge’s distributed testing platform, you can build realistic AI & LLM load testing scenarios, run them at scale, analyze results in real time, and integrate performance testing into your CI/CD process. Whether you are validating a new vector database configuration, stress testing multi-turn chat, or benchmarking document indexing throughput, LoadForge gives you the flexibility to test your RAG system the way users actually use it.

If you are ready to load test your RAG application end to end, try LoadForge and start building realistic performance tests today.

How to Load Test RAG Applications

Introduction

Prerequisites

Understanding RAG Applications Under Load

Embedding Throughput Limits

Vector Database Saturation

LLM Token Generation Delays

Background Ingestion Contention

Authentication and Multi-Tenant Overhead

Writing Your First Load Test

What this test covers

Why this matters for RAG performance testing

Advanced Load Testing Scenarios

Scenario 1: Retrieval-Heavy Search Without Full Generation

When to use this test

Scenario 2: Authenticated Multi-Turn Chat Sessions

What this uncovers

Scenario 3: Document Ingestion and Indexing Pipeline

Why ingestion load testing matters

Analyzing Your Results

Key metrics to watch

p95 and p99 latency

Error rate

Requests per second

Endpoint-level breakdown

Questions to ask when reviewing LoadForge results

Performance Optimization Tips

Cache embeddings and retrieval where possible

Limit retrieved context size

Use reranking selectively

Trim conversation history

Separate ingestion from query infrastructure

Add circuit breakers and graceful degradation

Load test with realistic prompts

Common Pitfalls to Avoid

Testing only one endpoint

Ignoring provider rate limits

Using unrealistic payloads

Not validating response quality

Overlooking warm caches

Forgetting mixed workloads

Skipping distributed tests

Conclusion

Try LoadForge free for 7 days

Related guides

How to Load Test an AI Gateway

How to Load Test Azure OpenAI

How to Load Test the ChatGPT API