Introduction

Running large language models locally with Ollama gives teams more control over privacy, cost, and deployment flexibility—but it also introduces a new performance challenge: how well does your Ollama server behave under real user load?

If you are serving models like Llama 3, Mistral, Code Llama, or other self-hosted LLMs through Ollama, load testing is essential. A single prompt may work perfectly in development, but production traffic introduces concurrency, token generation pressure, GPU or CPU saturation, model loading delays, and memory bottlenecks. Without proper performance testing, users may experience long response times, failed generations, or inconsistent throughput.

In this guide, you’ll learn how to load test Ollama using Locust on LoadForge. We’ll cover realistic Ollama API endpoints, streaming and non-streaming generation patterns, authentication headers commonly used behind reverse proxies, and advanced scenarios like concurrent chat requests, model warm-up behavior, and embeddings workloads. You’ll also learn how to interpret results such as latency, requests per second, and token throughput so you can identify where your self-hosted LLM infrastructure starts to break down.

Because LoadForge uses Locust under the hood, every example here is practical Python code you can adapt directly to your environment. And when you need to scale beyond a single machine, LoadForge’s cloud-based infrastructure, distributed testing, real-time reporting, CI/CD integration, and global test locations make it much easier to simulate realistic AI traffic patterns.

Prerequisites

Before you start load testing Ollama, make sure you have the following:

An Ollama server running and reachable over HTTP
At least one model already pulled, such as:
- llama3
- mistral
- codellama
- nomic-embed-text
Familiarity with the Ollama HTTP API
A LoadForge account for running distributed load tests
A clear test goal, such as:
- Measuring maximum concurrent generations
- Benchmarking token throughput
- Finding hardware bottlenecks
- Stress testing a reverse-proxied Ollama deployment

You should also know your Ollama base URL. Typical examples include:

bash

http://localhost:11434
http://ollama.internal:11434
https://llm.example.com/ollama

Useful Ollama API endpoints you’ll likely test include:

POST /api/generate — text generation
POST /api/chat — chat-style completion
POST /api/embeddings — embedding generation
GET /api/tags — list installed models
POST /api/show — inspect model info

In many production deployments, Ollama is placed behind Nginx, Traefik, or an API gateway. In that case, you may also need:

Authorization: Bearer <token>
API key headers such as X-API-Key
Rate limiting or WAF rules to account for during testing

Understanding Ollama Under Load

Ollama is different from a typical stateless REST API. When you load test Ollama, you are not just testing HTTP handling—you are testing the full model inference pipeline.

What happens during an Ollama request

For a request to /api/generate or /api/chat, Ollama typically needs to:

Accept the HTTP request
Parse the prompt or chat messages
Ensure the model is loaded into memory
Run token generation on CPU or GPU
Stream or return the generated response
Potentially unload or swap models depending on memory pressure

This means performance depends heavily on:

Model size
Prompt length
Output token count
Hardware acceleration
Available VRAM or RAM
Number of concurrent requests
Whether the model is already warm in memory

Common Ollama bottlenecks

When load testing Ollama, the most common bottlenecks are:

Model loading latency

The first request after model startup can be much slower than subsequent requests. If models are swapped in and out under pressure, this can happen repeatedly.

GPU or CPU saturation

Even if Ollama responds correctly, token generation speed may collapse when concurrency rises.

Memory pressure

Large models can consume substantial RAM or VRAM. Multiple concurrent users may trigger swapping, degraded throughput, or failures.

Reverse proxy constraints

If Ollama sits behind a proxy, timeouts, buffering, max body sizes, or authentication middleware may become the real bottleneck.

Streaming overhead

Streaming responses are user-friendly, but they can alter connection behavior and impact concurrency differently from non-streaming requests.

A good load test should measure not only whether requests succeed, but also how latency and throughput change as prompt complexity and user concurrency increase.

Writing Your First Load Test

Let’s start with a basic non-streaming generation test against Ollama’s /api/generate endpoint. This is a great first benchmark because it gives you a simple way to measure average response time for a realistic prompt.

Basic Ollama generation load test

python

from locust import HttpUser, task, between
import os
 
class OllamaGenerateUser(HttpUser):
    wait_time = between(1, 3)
 
    host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
 
    common_headers = {
        "Content-Type": "application/json",
    }
 
    @task
    def generate_summary(self):
        payload = {
            "model": "llama3",
            "prompt": (
                "Summarize the following support ticket in 3 bullet points:\n\n"
                "Customer reports intermittent timeout errors when uploading PDF files "
                "larger than 10MB through the web dashboard. The issue began after the "
                "latest deployment and affects users in multiple regions."
            ),
            "stream": False,
            "options": {
                "temperature": 0.2,
                "num_predict": 120
            }
        }
 
        with self.client.post(
            "/api/generate",
            json=payload,
            headers=self.common_headers,
            catch_response=True,
            name="/api/generate summary"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Unexpected status: {response.status_code}")
                return
 
            data = response.json()
            if "response" not in data or not data["response"].strip():
                response.failure("Missing generated response")
                return
 
            response.success()

What this test does

This script simulates users sending a summarization prompt to Ollama. It uses:

model: llama3 for a realistic local LLM deployment
stream: False so Locust measures full response latency
num_predict to limit output size and keep the test controlled
catch_response=True so we can validate the response body

This is a good starting point for performance testing because it answers a basic question:

“How long does Ollama take to complete a typical generation request under concurrent load?”

Why non-streaming is useful first

Although many LLM applications use streaming, non-streaming tests are easier to interpret. They give you a single end-to-end response time that includes prompt processing and token generation. Once you understand this baseline, you can move on to more advanced scenarios.

Advanced Load Testing Scenarios

Real Ollama deployments rarely serve just one simple prompt type. In production, you may have authenticated traffic, multiple endpoints, chat-style conversations, and embedding generation running side by side. Below are several more realistic load testing scenarios.

Scenario 1: Authenticated chat requests behind a reverse proxy

Many teams expose Ollama through an internal API gateway or reverse proxy with bearer token authentication. This example uses /api/chat to simulate a chatbot assistant workload.

python

from locust import HttpUser, task, between
import os
 
class OllamaChatUser(HttpUser):
    wait_time = between(2, 5)
    host = os.getenv("OLLAMA_HOST", "https://llm.example.com")
 
    def on_start(self):
        self.headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {os.getenv('OLLAMA_API_TOKEN', 'dev-token')}",
            "X-Forwarded-For": "203.0.113.10"
        }
 
    @task(3)
    def customer_support_chat(self):
        payload = {
            "model": "llama3",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a concise customer support assistant for a SaaS platform."
                },
                {
                    "role": "user",
                    "content": (
                        "A customer says their dashboard loads slowly after login. "
                        "Provide a troubleshooting checklist with no more than 6 steps."
                    )
                }
            ],
            "stream": False,
            "options": {
                "temperature": 0.3,
                "num_predict": 180
            }
        }
 
        with self.client.post(
            "/api/chat",
            json=payload,
            headers=self.headers,
            catch_response=True,
            name="/api/chat support"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Chat request failed: {response.status_code}")
                return
 
            data = response.json()
            message = data.get("message", {})
            content = message.get("content", "")
 
            if not content.strip():
                response.failure("Empty chat response")
                return
 
            response.success()
 
    @task(1)
    def list_models_healthcheck(self):
        with self.client.get(
            "/api/tags",
            headers=self.headers,
            catch_response=True,
            name="/api/tags"
        ) as response:
            if response.status_code != 200:
                response.failure("Model listing failed")
                return
            response.success()

Why this scenario matters

This test is more production-like because it includes:

Bearer token authentication
Chat-style prompts
Weighted tasks
A lightweight health-style endpoint to compare against generation-heavy traffic

If /api/tags remains fast while /api/chat slows dramatically, your bottleneck is likely inference capacity rather than basic HTTP routing.

Scenario 2: Mixed workload with embeddings and generation

Many AI applications use embeddings for semantic search alongside text generation. If your Ollama instance serves both, you need to understand how these workloads interact.

python

from locust import HttpUser, task, between
import os
import random
 
DOCUMENTS = [
    "Load testing helps identify latency spikes before users are impacted.",
    "Ollama allows teams to run open models locally with simple HTTP APIs.",
    "Embedding vectors are commonly used for semantic search and retrieval.",
    "GPU memory pressure can reduce throughput during concurrent inference."
]
 
class OllamaMixedWorkloadUser(HttpUser):
    wait_time = between(1, 2)
    host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
 
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": os.getenv("OLLAMA_API_KEY", "local-dev-key")
    }
 
    @task(2)
    def create_embedding(self):
        payload = {
            "model": "nomic-embed-text",
            "prompt": random.choice(DOCUMENTS)
        }
 
        with self.client.post(
            "/api/embeddings",
            json=payload,
            headers=self.headers,
            catch_response=True,
            name="/api/embeddings"
        ) as response:
            if response.status_code != 200:
                response.failure(f"Embedding failed: {response.status_code}")
                return
 
            data = response.json()
            embedding = data.get("embedding")
            if not embedding or not isinstance(embedding, list):
                response.failure("Missing embedding vector")
                return
 
            response.success()
 
    @task(1)
    def retrieval_augmented_generation(self):
        payload = {
            "model": "llama3",
            "prompt": (
                "Use the following retrieved context to answer the question.\n\n"
                "Context:\n"
                "- Load testing reveals concurrency bottlenecks in self-hosted LLMs.\n"
                "- Token throughput often drops when GPU memory becomes constrained.\n"
                "- Embeddings can support semantic retrieval for RAG applications.\n\n"
                "Question: Why should teams load test Ollama before production rollout?"
            ),
            "stream": False,
            "options": {
                "temperature": 0.1,
                "num_predict": 140
            }
        }
 
        with self.client.post(
            "/api/generate",
            json=payload,
            headers=self.headers,
            catch_response=True,
            name="/api/generate rag"
        ) as response:
            if response.status_code != 200:
                response.failure(f"RAG generation failed: {response.status_code}")
                return
 
            data = response.json()
            if not data.get("response", "").strip():
                response.failure("Empty RAG response")
                return
 
            response.success()

Why this scenario matters

This mixed workload helps you answer questions like:

Does embedding traffic starve generation requests?
Can one Ollama instance support both semantic search and chat?
At what concurrency does latency become unacceptable?

This is especially useful for AI applications that combine retrieval and generation in a single user flow.

Scenario 3: Measuring warm vs cold model behavior

One of the most important performance testing scenarios for self-hosted LLMs is model warm-up. The first request may be significantly slower if the model is not already loaded.

python

from locust import HttpUser, task, between, events
import os
import time
 
class OllamaWarmColdUser(HttpUser):
    wait_time = between(5, 10)
    host = os.getenv("OLLAMA_HOST", "http://localhost:11434")
 
    headers = {
        "Content-Type": "application/json"
    }
 
    @task
    def code_generation_request(self):
        payload = {
            "model": "codellama",
            "prompt": (
                "Write a Python function that validates an email address using regex "
                "and returns True or False. Include a short docstring."
            ),
            "stream": False,
            "options": {
                "temperature": 0.2,
                "num_predict": 200
            }
        }
 
        start = time.time()
        with self.client.post(
            "/api/generate",
            json=payload,
            headers=self.headers,
            catch_response=True,
            name="/api/generate code"
        ) as response:
            elapsed = time.time() - start
 
            if response.status_code != 200:
                response.failure(f"Code generation failed: {response.status_code}")
                return
 
            data = response.json()
            output = data.get("response", "")
 
            if "def " not in output:
                response.failure("Generated code missing function definition")
                return
 
            response.success()
 
            events.request.fire(
                request_type="CUSTOM",
                name="ollama_generation_over_10s",
                response_time=elapsed * 1000,
                response_length=len(output),
                exception=None if elapsed <= 10 else Exception("Generation exceeded 10 seconds"),
                context={}
            )

Why this scenario matters

This test helps surface issues such as:

Long cold-start delays
Model eviction under memory pressure
Large differences between average and tail latency
Poor performance for code generation compared with basic text prompts

By adding a custom metric for requests over 10 seconds, you can track whether your Ollama environment meets your service-level expectations.

Analyzing Your Results

After running your Ollama load test in LoadForge, focus on more than just pass/fail results. LLM performance testing requires interpreting both HTTP behavior and inference behavior.

Key metrics to watch

Response time percentiles

For Ollama, p50 is useful, but p95 and p99 are often more important. LLM workloads can show severe tail latency under concurrency.

Look for:

Stable p50 but rising p95/p99: likely resource contention
Sudden latency spikes: possible model loading or swapping
Consistently high latency: hardware may be undersized for the model

Requests per second

Traditional APIs often aim for very high RPS, but LLM endpoints are different. A lower RPS may still be acceptable if each request is computationally expensive.

Compare:

/api/tags RPS vs /api/generate RPS
/api/embeddings throughput vs /api/chat throughput

Error rate

Watch for:

HTTP 500 or 502 from reverse proxies
Timeouts
Connection resets
Empty or malformed model responses

A rising error rate under stress testing often indicates the system has exceeded practical concurrency limits.

Token throughput

Ollama users often care more about tokens per second than raw request volume. While Ollama’s API responses may not always expose token timing in the same way as hosted LLM APIs, you can still estimate throughput by:

Standardizing prompt sizes
Standardizing num_predict
Comparing completion latency across concurrency levels

If a 200-token response takes 2 seconds at low load and 12 seconds at high load, token throughput is collapsing.

Using LoadForge effectively

LoadForge makes Ollama performance testing easier because you can:

Run distributed testing from multiple cloud regions
Watch real-time reporting during the test
Compare multiple test runs over time
Integrate tests into CI/CD pipelines before deploying model or infrastructure changes
Simulate realistic traffic ramps instead of a single spike

For example, you might run one test from a region close to your Ollama host and another from a distant region to separate network latency from inference latency.

Performance Optimization Tips

If your Ollama load testing results show slowdowns or failures, here are the most common optimization opportunities.

Keep models warm

If cold starts are hurting latency, consider keeping critical models loaded in memory and avoiding frequent model switching.

Reduce prompt and output size

Long prompts and large completions dramatically increase compute time. Set sensible limits on:

Prompt length
Context size
num_predict

Separate workloads

If embeddings and generation compete for the same hardware, consider isolating them onto different Ollama instances.

Use the right model size

A larger model may improve quality, but if it destroys concurrency and token throughput, it may not be practical for production.

Tune reverse proxy settings

If Ollama is behind Nginx or Traefik, review:

Read timeouts
Proxy buffering
Maximum request size
Keepalive settings

Scale horizontally where possible

Self-hosted LLM serving can be hard to scale, but if your architecture allows it, multiple Ollama instances behind a load balancer can improve resilience and throughput.

Benchmark with realistic prompts

Tiny prompts can make performance look better than it really is. Use representative workloads from your actual application.

Common Pitfalls to Avoid

Load testing Ollama is not the same as load testing a CRUD API. Here are some mistakes teams commonly make.

Testing only a single short prompt

A short prompt with a tiny response does not represent real user behavior. Include realistic prompt lengths and output sizes.

Ignoring model warm-up effects

If you only measure warm requests, you may miss startup or eviction delays that real users will encounter.

Using too much client-side think time

Large wait times can hide concurrency issues. Make sure your test profile reflects actual usage.

Not validating response content

A 200 response is not enough. Confirm that Ollama actually returned meaningful text, chat content, or embeddings.

Overlooking infrastructure bottlenecks

The problem may not be Ollama itself. Reverse proxies, container limits, GPU drivers, and host memory can all affect performance.

Comparing different tests with different prompt sizes

To compare runs meaningfully, keep prompts, models, and generation settings consistent.

Not testing mixed workloads

Many real applications combine chat, generation, and embeddings. Testing just one endpoint may miss critical contention issues.

Conclusion

Ollama makes self-hosted LLM serving accessible, but production readiness depends on more than getting a model to respond. You need to understand concurrency limits, token throughput, model warm-up behavior, and infrastructure bottlenecks before users hit them first.

With realistic Locust scripts and LoadForge’s cloud-based load testing platform, you can run meaningful performance testing and stress testing against Ollama APIs, whether you are benchmarking a single local deployment or validating a reverse-proxied, authenticated AI service. From distributed testing and real-time reporting to CI/CD integration and global test locations, LoadForge gives you the tools to test self-hosted LLMs with confidence.

If you’re ready to measure how Ollama performs under real-world load, try LoadForge and start building reliable AI infrastructure before performance becomes a production incident.

How to Load Test Ollama

Introduction

Prerequisites

Understanding Ollama Under Load

What happens during an Ollama request

Common Ollama bottlenecks

Model loading latency

GPU or CPU saturation

Memory pressure

Reverse proxy constraints

Streaming overhead

Writing Your First Load Test

Basic Ollama generation load test

What this test does

Why non-streaming is useful first

Advanced Load Testing Scenarios

Scenario 1: Authenticated chat requests behind a reverse proxy

Why this scenario matters

Scenario 2: Mixed workload with embeddings and generation

Why this scenario matters

Scenario 3: Measuring warm vs cold model behavior

Why this scenario matters

Analyzing Your Results

Key metrics to watch

Response time percentiles

Requests per second

Error rate

Token throughput

Using LoadForge effectively

Performance Optimization Tips

Keep models warm

Reduce prompt and output size

Separate workloads

Use the right model size

Tune reverse proxy settings

Scale horizontally where possible

Benchmark with realistic prompts

Common Pitfalls to Avoid

Testing only a single short prompt

Ignoring model warm-up effects

Using too much client-side think time

Not validating response content

Overlooking infrastructure bottlenecks

Comparing different tests with different prompt sizes

Not testing mixed workloads

Conclusion

Try LoadForge free for 7 days

Related guides

How to Load Test an AI Gateway

How to Load Test Azure OpenAI

How to Load Test the ChatGPT API