
Introduction
Running large language models locally with Ollama gives teams more control over privacy, cost, and deployment flexibility—but it also introduces a new performance challenge: how well does your Ollama server behave under real user load?
If you are serving models like Llama 3, Mistral, Code Llama, or other self-hosted LLMs through Ollama, load testing is essential. A single prompt may work perfectly in development, but production traffic introduces concurrency, token generation pressure, GPU or CPU saturation, model loading delays, and memory bottlenecks. Without proper performance testing, users may experience long response times, failed generations, or inconsistent throughput.
In this guide, you’ll learn how to load test Ollama using Locust on LoadForge. We’ll cover realistic Ollama API endpoints, streaming and non-streaming generation patterns, authentication headers commonly used behind reverse proxies, and advanced scenarios like concurrent chat requests, model warm-up behavior, and embeddings workloads. You’ll also learn how to interpret results such as latency, requests per second, and token throughput so you can identify where your self-hosted LLM infrastructure starts to break down.
Because LoadForge uses Locust under the hood, every example here is practical Python code you can adapt directly to your environment. And when you need to scale beyond a single machine, LoadForge’s cloud-based infrastructure, distributed testing, real-time reporting, CI/CD integration, and global test locations make it much easier to simulate realistic AI traffic patterns.
Prerequisites
Before you start load testing Ollama, make sure you have the following:
- An Ollama server running and reachable over HTTP
- At least one model already pulled, such as:
llama3mistralcodellamanomic-embed-text
- Familiarity with the Ollama HTTP API
- A LoadForge account for running distributed load tests
- A clear test goal, such as:
- Measuring maximum concurrent generations
- Benchmarking token throughput
- Finding hardware bottlenecks
- Stress testing a reverse-proxied Ollama deployment
You should also know your Ollama base URL. Typical examples include:
http://localhost:11434
http://ollama.internal:11434
https://llm.example.com/ollamaUseful Ollama API endpoints you’ll likely test include:
POST /api/generate— text generationPOST /api/chat— chat-style completionPOST /api/embeddings— embedding generationGET /api/tags— list installed modelsPOST /api/show— inspect model info
In many production deployments, Ollama is placed behind Nginx, Traefik, or an API gateway. In that case, you may also need:
Authorization: Bearer <token>- API key headers such as
X-API-Key - Rate limiting or WAF rules to account for during testing
Understanding Ollama Under Load
Ollama is different from a typical stateless REST API. When you load test Ollama, you are not just testing HTTP handling—you are testing the full model inference pipeline.
What happens during an Ollama request
For a request to /api/generate or /api/chat, Ollama typically needs to:
- Accept the HTTP request
- Parse the prompt or chat messages
- Ensure the model is loaded into memory
- Run token generation on CPU or GPU
- Stream or return the generated response
- Potentially unload or swap models depending on memory pressure
This means performance depends heavily on:
- Model size
- Prompt length
- Output token count
- Hardware acceleration
- Available VRAM or RAM
- Number of concurrent requests
- Whether the model is already warm in memory
Common Ollama bottlenecks
When load testing Ollama, the most common bottlenecks are:
Model loading latency
The first request after model startup can be much slower than subsequent requests. If models are swapped in and out under pressure, this can happen repeatedly.
GPU or CPU saturation
Even if Ollama responds correctly, token generation speed may collapse when concurrency rises.
Memory pressure
Large models can consume substantial RAM or VRAM. Multiple concurrent users may trigger swapping, degraded throughput, or failures.
Reverse proxy constraints
If Ollama sits behind a proxy, timeouts, buffering, max body sizes, or authentication middleware may become the real bottleneck.
Streaming overhead
Streaming responses are user-friendly, but they can alter connection behavior and impact concurrency differently from non-streaming requests.
A good load test should measure not only whether requests succeed, but also how latency and throughput change as prompt complexity and user concurrency increase.
Writing Your First Load Test
Let’s start with a basic non-streaming generation test against Ollama’s /api/generate endpoint. This is a great first benchmark because it gives you a simple way to measure average response time for a realistic prompt.
Basic Ollama generation load test
from locust import HttpUser, task, between
import os
class OllamaGenerateUser(HttpUser):
wait_time = between(1, 3)
host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
common_headers = {
"Content-Type": "application/json",
}
@task
def generate_summary(self):
payload = {
"model": "llama3",
"prompt": (
"Summarize the following support ticket in 3 bullet points:\n\n"
"Customer reports intermittent timeout errors when uploading PDF files "
"larger than 10MB through the web dashboard. The issue began after the "
"latest deployment and affects users in multiple regions."
),
"stream": False,
"options": {
"temperature": 0.2,
"num_predict": 120
}
}
with self.client.post(
"/api/generate",
json=payload,
headers=self.common_headers,
catch_response=True,
name="/api/generate summary"
) as response:
if response.status_code != 200:
response.failure(f"Unexpected status: {response.status_code}")
return
data = response.json()
if "response" not in data or not data["response"].strip():
response.failure("Missing generated response")
return
response.success()What this test does
This script simulates users sending a summarization prompt to Ollama. It uses:
model: llama3for a realistic local LLM deploymentstream: Falseso Locust measures full response latencynum_predictto limit output size and keep the test controlledcatch_response=Trueso we can validate the response body
This is a good starting point for performance testing because it answers a basic question:
“How long does Ollama take to complete a typical generation request under concurrent load?”
Why non-streaming is useful first
Although many LLM applications use streaming, non-streaming tests are easier to interpret. They give you a single end-to-end response time that includes prompt processing and token generation. Once you understand this baseline, you can move on to more advanced scenarios.
Advanced Load Testing Scenarios
Real Ollama deployments rarely serve just one simple prompt type. In production, you may have authenticated traffic, multiple endpoints, chat-style conversations, and embedding generation running side by side. Below are several more realistic load testing scenarios.
Scenario 1: Authenticated chat requests behind a reverse proxy
Many teams expose Ollama through an internal API gateway or reverse proxy with bearer token authentication. This example uses /api/chat to simulate a chatbot assistant workload.
from locust import HttpUser, task, between
import os
class OllamaChatUser(HttpUser):
wait_time = between(2, 5)
host = os.getenv("OLLAMA_HOST", "https://llm.example.com")
def on_start(self):
self.headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.getenv('OLLAMA_API_TOKEN', 'dev-token')}",
"X-Forwarded-For": "203.0.113.10"
}
@task(3)
def customer_support_chat(self):
payload = {
"model": "llama3",
"messages": [
{
"role": "system",
"content": "You are a concise customer support assistant for a SaaS platform."
},
{
"role": "user",
"content": (
"A customer says their dashboard loads slowly after login. "
"Provide a troubleshooting checklist with no more than 6 steps."
)
}
],
"stream": False,
"options": {
"temperature": 0.3,
"num_predict": 180
}
}
with self.client.post(
"/api/chat",
json=payload,
headers=self.headers,
catch_response=True,
name="/api/chat support"
) as response:
if response.status_code != 200:
response.failure(f"Chat request failed: {response.status_code}")
return
data = response.json()
message = data.get("message", {})
content = message.get("content", "")
if not content.strip():
response.failure("Empty chat response")
return
response.success()
@task(1)
def list_models_healthcheck(self):
with self.client.get(
"/api/tags",
headers=self.headers,
catch_response=True,
name="/api/tags"
) as response:
if response.status_code != 200:
response.failure("Model listing failed")
return
response.success()Why this scenario matters
This test is more production-like because it includes:
- Bearer token authentication
- Chat-style prompts
- Weighted tasks
- A lightweight health-style endpoint to compare against generation-heavy traffic
If /api/tags remains fast while /api/chat slows dramatically, your bottleneck is likely inference capacity rather than basic HTTP routing.
Scenario 2: Mixed workload with embeddings and generation
Many AI applications use embeddings for semantic search alongside text generation. If your Ollama instance serves both, you need to understand how these workloads interact.
from locust import HttpUser, task, between
import os
import random
DOCUMENTS = [
"Load testing helps identify latency spikes before users are impacted.",
"Ollama allows teams to run open models locally with simple HTTP APIs.",
"Embedding vectors are commonly used for semantic search and retrieval.",
"GPU memory pressure can reduce throughput during concurrent inference."
]
class OllamaMixedWorkloadUser(HttpUser):
wait_time = between(1, 2)
host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
headers = {
"Content-Type": "application/json",
"X-API-Key": os.getenv("OLLAMA_API_KEY", "local-dev-key")
}
@task(2)
def create_embedding(self):
payload = {
"model": "nomic-embed-text",
"prompt": random.choice(DOCUMENTS)
}
with self.client.post(
"/api/embeddings",
json=payload,
headers=self.headers,
catch_response=True,
name="/api/embeddings"
) as response:
if response.status_code != 200:
response.failure(f"Embedding failed: {response.status_code}")
return
data = response.json()
embedding = data.get("embedding")
if not embedding or not isinstance(embedding, list):
response.failure("Missing embedding vector")
return
response.success()
@task(1)
def retrieval_augmented_generation(self):
payload = {
"model": "llama3",
"prompt": (
"Use the following retrieved context to answer the question.\n\n"
"Context:\n"
"- Load testing reveals concurrency bottlenecks in self-hosted LLMs.\n"
"- Token throughput often drops when GPU memory becomes constrained.\n"
"- Embeddings can support semantic retrieval for RAG applications.\n\n"
"Question: Why should teams load test Ollama before production rollout?"
),
"stream": False,
"options": {
"temperature": 0.1,
"num_predict": 140
}
}
with self.client.post(
"/api/generate",
json=payload,
headers=self.headers,
catch_response=True,
name="/api/generate rag"
) as response:
if response.status_code != 200:
response.failure(f"RAG generation failed: {response.status_code}")
return
data = response.json()
if not data.get("response", "").strip():
response.failure("Empty RAG response")
return
response.success()Why this scenario matters
This mixed workload helps you answer questions like:
- Does embedding traffic starve generation requests?
- Can one Ollama instance support both semantic search and chat?
- At what concurrency does latency become unacceptable?
This is especially useful for AI applications that combine retrieval and generation in a single user flow.
Scenario 3: Measuring warm vs cold model behavior
One of the most important performance testing scenarios for self-hosted LLMs is model warm-up. The first request may be significantly slower if the model is not already loaded.
from locust import HttpUser, task, between, events
import os
import time
class OllamaWarmColdUser(HttpUser):
wait_time = between(5, 10)
host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
headers = {
"Content-Type": "application/json"
}
@task
def code_generation_request(self):
payload = {
"model": "codellama",
"prompt": (
"Write a Python function that validates an email address using regex "
"and returns True or False. Include a short docstring."
),
"stream": False,
"options": {
"temperature": 0.2,
"num_predict": 200
}
}
start = time.time()
with self.client.post(
"/api/generate",
json=payload,
headers=self.headers,
catch_response=True,
name="/api/generate code"
) as response:
elapsed = time.time() - start
if response.status_code != 200:
response.failure(f"Code generation failed: {response.status_code}")
return
data = response.json()
output = data.get("response", "")
if "def " not in output:
response.failure("Generated code missing function definition")
return
response.success()
events.request.fire(
request_type="CUSTOM",
name="ollama_generation_over_10s",
response_time=elapsed * 1000,
response_length=len(output),
exception=None if elapsed <= 10 else Exception("Generation exceeded 10 seconds"),
context={}
)Why this scenario matters
This test helps surface issues such as:
- Long cold-start delays
- Model eviction under memory pressure
- Large differences between average and tail latency
- Poor performance for code generation compared with basic text prompts
By adding a custom metric for requests over 10 seconds, you can track whether your Ollama environment meets your service-level expectations.
Analyzing Your Results
After running your Ollama load test in LoadForge, focus on more than just pass/fail results. LLM performance testing requires interpreting both HTTP behavior and inference behavior.
Key metrics to watch
Response time percentiles
For Ollama, p50 is useful, but p95 and p99 are often more important. LLM workloads can show severe tail latency under concurrency.
Look for:
- Stable p50 but rising p95/p99: likely resource contention
- Sudden latency spikes: possible model loading or swapping
- Consistently high latency: hardware may be undersized for the model
Requests per second
Traditional APIs often aim for very high RPS, but LLM endpoints are different. A lower RPS may still be acceptable if each request is computationally expensive.
Compare:
/api/tagsRPS vs/api/generateRPS/api/embeddingsthroughput vs/api/chatthroughput
Error rate
Watch for:
- HTTP 500 or 502 from reverse proxies
- Timeouts
- Connection resets
- Empty or malformed model responses
A rising error rate under stress testing often indicates the system has exceeded practical concurrency limits.
Token throughput
Ollama users often care more about tokens per second than raw request volume. While Ollama’s API responses may not always expose token timing in the same way as hosted LLM APIs, you can still estimate throughput by:
- Standardizing prompt sizes
- Standardizing
num_predict - Comparing completion latency across concurrency levels
If a 200-token response takes 2 seconds at low load and 12 seconds at high load, token throughput is collapsing.
Using LoadForge effectively
LoadForge makes Ollama performance testing easier because you can:
- Run distributed testing from multiple cloud regions
- Watch real-time reporting during the test
- Compare multiple test runs over time
- Integrate tests into CI/CD pipelines before deploying model or infrastructure changes
- Simulate realistic traffic ramps instead of a single spike
For example, you might run one test from a region close to your Ollama host and another from a distant region to separate network latency from inference latency.
Performance Optimization Tips
If your Ollama load testing results show slowdowns or failures, here are the most common optimization opportunities.
Keep models warm
If cold starts are hurting latency, consider keeping critical models loaded in memory and avoiding frequent model switching.
Reduce prompt and output size
Long prompts and large completions dramatically increase compute time. Set sensible limits on:
- Prompt length
- Context size
num_predict
Separate workloads
If embeddings and generation compete for the same hardware, consider isolating them onto different Ollama instances.
Use the right model size
A larger model may improve quality, but if it destroys concurrency and token throughput, it may not be practical for production.
Tune reverse proxy settings
If Ollama is behind Nginx or Traefik, review:
- Read timeouts
- Proxy buffering
- Maximum request size
- Keepalive settings
Scale horizontally where possible
Self-hosted LLM serving can be hard to scale, but if your architecture allows it, multiple Ollama instances behind a load balancer can improve resilience and throughput.
Benchmark with realistic prompts
Tiny prompts can make performance look better than it really is. Use representative workloads from your actual application.
Common Pitfalls to Avoid
Load testing Ollama is not the same as load testing a CRUD API. Here are some mistakes teams commonly make.
Testing only a single short prompt
A short prompt with a tiny response does not represent real user behavior. Include realistic prompt lengths and output sizes.
Ignoring model warm-up effects
If you only measure warm requests, you may miss startup or eviction delays that real users will encounter.
Using too much client-side think time
Large wait times can hide concurrency issues. Make sure your test profile reflects actual usage.
Not validating response content
A 200 response is not enough. Confirm that Ollama actually returned meaningful text, chat content, or embeddings.
Overlooking infrastructure bottlenecks
The problem may not be Ollama itself. Reverse proxies, container limits, GPU drivers, and host memory can all affect performance.
Comparing different tests with different prompt sizes
To compare runs meaningfully, keep prompts, models, and generation settings consistent.
Not testing mixed workloads
Many real applications combine chat, generation, and embeddings. Testing just one endpoint may miss critical contention issues.
Conclusion
Ollama makes self-hosted LLM serving accessible, but production readiness depends on more than getting a model to respond. You need to understand concurrency limits, token throughput, model warm-up behavior, and infrastructure bottlenecks before users hit them first.
With realistic Locust scripts and LoadForge’s cloud-based load testing platform, you can run meaningful performance testing and stress testing against Ollama APIs, whether you are benchmarking a single local deployment or validating a reverse-proxied, authenticated AI service. From distributed testing and real-time reporting to CI/CD integration and global test locations, LoadForge gives you the tools to test self-hosted LLMs with confidence.
If you’re ready to measure how Ollama performs under real-world load, try LoadForge and start building reliable AI infrastructure before performance becomes a production incident.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test an AI Gateway
Learn how to load test an AI gateway to validate routing, caching, rate limiting, and multi-model reliability at scale.

How to Load Test Azure OpenAI
Load test Azure OpenAI deployments to validate throughput, response times, quotas, and reliability for enterprise AI applications.

How to Load Test the ChatGPT API
Discover how to load test the ChatGPT API with realistic prompts, concurrent users, streaming, and token-based performance metrics.