LoadForge LogoLoadForge

How to Load Test the OpenAI API

How to Load Test the OpenAI API

Introduction

As more teams build AI-powered products, the OpenAI API often becomes a critical dependency in production workflows. Whether you are generating chat responses, summarizing documents, classifying text, or powering internal copilots, your application performance is now closely tied to the responsiveness and reliability of the OpenAI API.

That makes load testing the OpenAI API an essential part of performance testing, stress testing, and capacity planning. If your application suddenly receives a spike in user traffic, can your backend handle concurrent requests to OpenAI without unacceptable latency, retries, or failures? Will you hit rate limits? How does response time change when prompts get larger or when you use different models?

In this guide, you will learn how to load test the OpenAI API using LoadForge and Locust. We will cover realistic test scenarios, including chat completions, authentication headers, variable prompt sizes, rate-limit behavior, and multi-step AI workflows. By the end, you will have practical Locust scripts you can run in LoadForge’s cloud-based infrastructure, with distributed testing, real-time reporting, and CI/CD integration to validate your OpenAI-backed applications at scale.

Prerequisites

Before you begin load testing the OpenAI API, make sure you have the following:

  • An OpenAI API key
  • Access to LoadForge
  • A clear understanding of the OpenAI endpoints your application uses
  • Expected traffic patterns, including peak concurrency and request frequency
  • A test environment or use case that avoids generating unnecessary production costs

You should also be familiar with:

  • Basic Python
  • HTTP APIs and JSON payloads
  • Locust task structure
  • OpenAI authentication using the Authorization: Bearer <API_KEY> header

For security, never hardcode your API key directly into your test script. In LoadForge, you should store it as an environment variable or secret and reference it in your Locust code.

A typical OpenAI request uses:

  • Base URL: https://api.openai.com
  • Header: Authorization: Bearer YOUR_API_KEY
  • Header: Content-Type: application/json

If you are testing through your own backend rather than directly against OpenAI, that is often even better because it captures the real performance of your full application stack.

Understanding OpenAI API Under Load

The OpenAI API behaves differently from a traditional CRUD API. When load testing AI and LLM services, you need to think beyond simple request-per-second metrics.

Key performance characteristics

Latency varies by prompt and output size

A short classification request may complete quickly, while a long reasoning or generation request can take significantly longer. Token count matters. Larger prompts and larger completions increase latency and cost.

Throughput is affected by model behavior

Different models have different performance characteristics. A lightweight model may support much higher throughput than a more capable reasoning model. When performance testing the OpenAI API, always test with the exact model and prompt shape you plan to use in production.

Rate limits matter

OpenAI enforces usage limits based on your account tier, model, and token volume. Under load, your application may begin receiving:

  • 429 Too Many Requests
  • intermittent higher latency
  • retries that amplify traffic

A good load test should measure not only successful response times, but also how your application behaves when rate limiting occurs.

Reliability is part of the test

AI-powered systems are often integrated into user-facing flows. That means reliability under concurrency matters just as much as average latency. You should track:

  • error rate
  • timeout rate
  • rate limit frequency
  • percentile latency, especially p95 and p99

Common bottlenecks when using OpenAI API

When teams load test OpenAI integrations, the bottleneck is not always OpenAI itself. Common issues include:

  • application-side retry storms
  • synchronous request handling that blocks worker threads
  • poor connection pooling
  • oversized prompts
  • lack of caching for repeated queries
  • hitting account-level rate limits before application limits
  • queue buildup in your own backend

This is why load testing and stress testing should reflect realistic user behavior, not just fire identical requests as fast as possible.

Writing Your First Load Test

Let’s start with a simple Locust script that sends chat completion requests directly to the OpenAI API. This gives you a baseline for latency and success rates.

Basic OpenAI chat completions load test

python
from locust import HttpUser, task, between
import os
import random
 
class OpenAIChatUser(HttpUser):
    host = "https://api.openai.com"
    wait_time = between(1, 3)
 
    def on_start(self):
        self.client.headers = {
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
            "Content-Type": "application/json"
        }
 
    @task
    def chat_completion_basic(self):
        prompts = [
            "Summarize the benefits of load testing an API in two sentences.",
            "Explain what rate limiting means for API consumers.",
            "Write a short response describing why latency matters in AI applications."
        ]
 
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are a concise technical assistant."},
                {"role": "user", "content": random.choice(prompts)}
            ],
            "max_tokens": 120,
            "temperature": 0.3
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            name="/v1/chat/completions basic",
            catch_response=True,
            timeout=60
        ) as response:
            if response.status_code == 200:
                data = response.json()
                if "choices" in data and len(data["choices"]) > 0:
                    response.success()
                else:
                    response.failure("Missing choices in response")
            else:
                response.failure(f"Unexpected status code: {response.status_code} - {response.text}")

What this test does

This script simulates users sending lightweight chat requests to the OpenAI API. It includes:

  • realistic authentication headers
  • a real endpoint path: /v1/chat/completions
  • multiple prompt variations
  • response validation
  • request naming for easier LoadForge reporting

This is a good starting point for baseline load testing. In LoadForge, you can scale this from a few virtual users to a distributed test across multiple cloud regions to understand how the OpenAI API performs for your expected traffic.

How to run it in LoadForge

  1. Create a new test in LoadForge
  2. Paste in the Locust script
  3. Add OPENAI_API_KEY as an environment variable or secret
  4. Configure user count and spawn rate
  5. Run the test and monitor real-time reporting

For an initial performance testing run, try:

  • 5 to 10 users
  • spawn rate of 1 to 2 users per second
  • test duration of 5 to 10 minutes

This gives you a stable baseline before moving into more aggressive stress testing.

Advanced Load Testing Scenarios

Once you have a baseline, you should test more realistic OpenAI API usage patterns. Most production applications do more than send one simple prompt repeatedly.

Scenario 1: Mixed prompt sizes and realistic content complexity

This test simulates a production workload where some users submit short prompts and others submit larger document-style inputs. This is important because token size has a direct impact on latency and throughput.

python
from locust import HttpUser, task, between
import os
import random
 
SHORT_PROMPTS = [
    "Classify this message as positive, neutral, or negative: 'The onboarding flow was smooth and fast.'",
    "Rewrite this sentence to sound more professional: 'Can you fix this ASAP?'",
    "Extract the main topic from this text: 'Our API performance degraded during peak traffic.'"
]
 
LONG_PROMPTS = [
    """Summarize the following incident report and identify the likely root cause:
 
    During a traffic spike between 09:00 and 09:20 UTC, API response times increased from 300ms to over 4 seconds.
    Error rates rose to 8 percent. Database CPU reached 95 percent utilization. Several retries were triggered
    by upstream services, increasing traffic further. The system recovered after autoscaling added more application
    instances, but database contention remained high until read-heavy endpoints were cached.
    """,
    """Review the following product feedback and provide:
    1. A one paragraph summary
    2. Three recurring themes
    3. Two recommended actions
 
    Feedback:
    - Search is fast, but results are sometimes irrelevant.
    - Mobile experience is smoother after the redesign.
    - Notifications are useful but too frequent.
    - Uploading files occasionally fails without a clear error.
    - The chatbot gives helpful answers, but sometimes takes too long.
    """
]
 
class OpenAIMixedWorkloadUser(HttpUser):
    host = "https://api.openai.com"
    wait_time = between(1, 2)
 
    def on_start(self):
        self.client.headers = {
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
            "Content-Type": "application/json"
        }
 
    @task(3)
    def short_prompt_request(self):
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are a precise NLP assistant."},
                {"role": "user", "content": random.choice(SHORT_PROMPTS)}
            ],
            "max_tokens": 100,
            "temperature": 0.2
        }
 
        self.client.post(
            "/v1/chat/completions",
            json=payload,
            name="/v1/chat/completions short",
            timeout=60
        )
 
    @task(1)
    def long_prompt_request(self):
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are an expert technical analyst."},
                {"role": "user", "content": random.choice(LONG_PROMPTS)}
            ],
            "max_tokens": 300,
            "temperature": 0.4
        }
 
        self.client.post(
            "/v1/chat/completions",
            json=payload,
            name="/v1/chat/completions long",
            timeout=90
        )

This script uses Locust task weighting to simulate a realistic traffic mix:

  • 75% short requests
  • 25% long requests

That helps you understand how prompt size affects p95 latency, request throughput, and failure rates.

Scenario 2: Rate limit detection and graceful failure tracking

One of the most important OpenAI API load testing scenarios is understanding rate limits. You should know when 429 responses begin and how often they occur under stress.

python
from locust import HttpUser, task, between
import os
 
class OpenAIRateLimitUser(HttpUser):
    host = "https://api.openai.com"
    wait_time = between(0.1, 0.5)
 
    def on_start(self):
        self.client.headers = {
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
            "Content-Type": "application/json"
        }
 
    @task
    def aggressive_chat_requests(self):
        payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are a concise assistant."},
                {"role": "user", "content": "Respond with a one sentence explanation of why API backpressure matters."}
            ],
            "max_tokens": 60,
            "temperature": 0
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            name="/v1/chat/completions rate-limit-test",
            catch_response=True,
            timeout=45
        ) as response:
            if response.status_code == 200:
                response.success()
            elif response.status_code == 429:
                response.failure("Rate limit reached (429)")
            else:
                response.failure(f"Unexpected status: {response.status_code} - {response.text}")

This is a deliberate stress testing scenario. It uses a very short wait time to push concurrency and request frequency higher. The goal is not just to break the system, but to identify:

  • when rate limits begin
  • how quickly 429s increase
  • whether latency degrades before failures appear
  • what safe operating range exists below the limit

In LoadForge, you can run this as a stepped ramp test using distributed testing across regions to understand how your OpenAI usage behaves under increasing pressure.

Scenario 3: Multi-step AI workflow with embeddings and chat

Many real applications do not just call one endpoint. They might generate embeddings for retrieval, then pass context into a chat completion. This scenario better reflects AI application load.

python
from locust import HttpUser, task, between
import os
import random
 
DOCUMENTS = [
    "Load testing helps identify latency bottlenecks, rate limits, and reliability issues before production traffic spikes.",
    "Distributed performance testing allows teams to simulate real-world traffic from multiple geographic regions.",
    "Embeddings can be used to power semantic search, recommendation systems, and retrieval-augmented generation workflows."
]
 
QUESTIONS = [
    "What are the benefits of distributed load testing?",
    "How do embeddings support semantic search?",
    "Why is stress testing important before launch?"
]
 
class OpenAIWorkflowUser(HttpUser):
    host = "https://api.openai.com"
    wait_time = between(2, 4)
 
    def on_start(self):
        self.client.headers = {
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
            "Content-Type": "application/json"
        }
 
    @task
    def embedding_then_chat_workflow(self):
        document = random.choice(DOCUMENTS)
        question = random.choice(QUESTIONS)
 
        embedding_payload = {
            "model": "text-embedding-3-small",
            "input": document
        }
 
        with self.client.post(
            "/v1/embeddings",
            json=embedding_payload,
            name="/v1/embeddings",
            catch_response=True,
            timeout=45
        ) as embedding_response:
            if embedding_response.status_code != 200:
                embedding_response.failure(
                    f"Embedding failed: {embedding_response.status_code} - {embedding_response.text}"
                )
                return
 
        chat_payload = {
            "model": "gpt-4o-mini",
            "messages": [
                {
                    "role": "system",
                    "content": "Answer using the provided context only."
                },
                {
                    "role": "user",
                    "content": f"Context: {document}\n\nQuestion: {question}"
                }
            ],
            "max_tokens": 150,
            "temperature": 0.2
        }
 
        with self.client.post(
            "/v1/chat/completions",
            json=chat_payload,
            name="/v1/chat/completions rag-workflow",
            catch_response=True,
            timeout=60
        ) as chat_response:
            if chat_response.status_code == 200:
                data = chat_response.json()
                if data.get("choices"):
                    chat_response.success()
                else:
                    chat_response.failure("No choices returned from chat completion")
            else:
                chat_response.failure(
                    f"Chat failed: {chat_response.status_code} - {chat_response.text}"
                )

This workflow is useful for performance testing retrieval-augmented applications, semantic search systems, and AI assistants that rely on both embeddings and generation.

It also reveals an important truth about load testing AI systems: end-to-end latency is often the sum of multiple API calls, not a single request.

Analyzing Your Results

After running your OpenAI API load test in LoadForge, focus on a few key metrics.

Response time percentiles

Average latency is helpful, but p95 and p99 are more important. AI workloads often have long-tail latency, especially for larger prompts. If your average is 1.5 seconds but p95 is 8 seconds, many real users will feel the slowdown.

Requests per second

Track how many successful requests per second your system can sustain. For OpenAI API load testing, higher throughput is not always better if it comes at the cost of increased 429s or degraded user experience.

Error rate

Pay close attention to:

  • 429 Too Many Requests
  • 500 or 502 style upstream failures
  • timeouts
  • malformed responses
  • connection errors

Segmenting request names in Locust, such as /v1/chat/completions short versus /v1/chat/completions long, makes this analysis much easier in LoadForge’s real-time reporting dashboard.

Latency by scenario

Compare different request types:

  • short prompts vs long prompts
  • embeddings vs chat completions
  • single-step vs multi-step workflows

This helps you identify which operations are safe for synchronous user flows and which may need background processing or queueing.

Concurrency thresholds

A good stress testing outcome is identifying the point where performance begins to degrade. For example:

  • up to 20 users: stable
  • 20 to 40 users: latency rises sharply
  • above 40 users: 429 errors appear frequently

That gives you a practical operating range and informs autoscaling, retry policies, and traffic shaping.

Performance Optimization Tips

If your load testing reveals bottlenecks, these optimizations often help with OpenAI API integrations.

Reduce token volume

Prompt size and output size directly affect latency. Keep prompts concise, remove unnecessary context, and set reasonable max_tokens values.

Cache repeated results

If users frequently ask similar questions or request repeated summaries, caching can reduce API traffic and improve performance.

Use the right model

Not every workflow needs the most advanced model. For many classification, summarization, or extraction tasks, a smaller and faster model can significantly improve throughput.

Implement backoff and retry carefully

Retries should use exponential backoff and jitter. Aggressive retries during rate limiting can make performance worse and increase costs.

Separate synchronous and asynchronous flows

For user-facing interactions, keep requests small and fast. For large document processing or batch workflows, move work into background jobs.

Test from multiple regions

If your users are global, use LoadForge’s global test locations and distributed testing to understand regional latency differences and validate real-world performance.

Common Pitfalls to Avoid

Load testing the OpenAI API has some unique pitfalls.

Testing with unrealistic prompts

If your production prompts are large and complex, a tiny demo prompt will not give meaningful performance data. Match real prompt structure as closely as possible.

Ignoring rate limits

Many teams run a load test, see failures, and assume the API is unstable. In reality, they may simply be hitting account-level rate limits. Always interpret results in the context of your plan and usage tier.

Hardcoding API keys

Never put secrets directly in Locust scripts. Use environment variables in LoadForge.

Measuring only average latency

AI systems often have wide latency variation. Always inspect p95 and p99, not just averages.

Overlooking cost during stress testing

OpenAI API stress testing can generate real usage charges. Start small, validate scripts, and scale carefully.

Testing OpenAI directly when your real bottleneck is elsewhere

If your application calls OpenAI through your own API gateway, backend, or queueing layer, you should also load test that full path. Otherwise, you may miss the actual bottleneck in your architecture.

Not validating response content

A 200 OK does not always mean the response is usable. Add basic checks for expected fields like choices or embedding data.

Conclusion

Load testing the OpenAI API is about more than just sending requests at scale. You need to understand latency by prompt size, throughput under concurrency, rate-limit behavior, and the reliability of real AI workflows such as chat completions and embeddings.

With LoadForge, you can build realistic Locust-based performance testing and stress testing scenarios for the OpenAI API, run them across cloud-based infrastructure, distribute traffic globally, monitor results in real time, and integrate testing into your CI/CD pipeline. That makes it much easier to validate your AI application before users experience slowdowns or failures in production.

If you are building on the OpenAI API, now is the time to test how it performs under real-world load. Try LoadForge and start measuring your OpenAI API performance with confidence.

Try LoadForge free for 7 days

Set up your first load test in under 2 minutes. No commitment.