LoadForge LogoLoadForge

How to Load Test Azure OpenAI

How to Load Test Azure OpenAI

Introduction

Azure OpenAI powers many enterprise AI applications, from internal copilots and customer support assistants to document analysis workflows and content generation pipelines. But building with large language models is only half the challenge. The other half is making sure your Azure OpenAI deployment can handle real-world traffic without running into latency spikes, quota bottlenecks, throttling, or reliability issues.

That is why load testing Azure OpenAI is essential. Unlike traditional API performance testing, AI & LLM workloads introduce unique variables: prompt size, output token volume, model latency, streaming behavior, content filtering, and regional capacity constraints. A deployment that performs well for a handful of users may degrade quickly under concurrent traffic, especially when requests are large or when multiple application features share the same Azure OpenAI resource.

In this guide, you will learn how to load test Azure OpenAI using LoadForge and Locust. We will cover realistic Azure OpenAI API scenarios, including chat completions, embeddings, authenticated requests, and mixed workload testing. You will also see how to interpret results so you can validate throughput, response times, quotas, and reliability before your users do.

Prerequisites

Before you begin load testing Azure OpenAI, make sure you have the following:

  • An Azure subscription
  • An Azure OpenAI resource
  • At least one deployed model, such as:
    • gpt-4o-mini
    • gpt-4.1-mini
    • text-embedding-3-large
  • Your Azure OpenAI endpoint, for example:
    • https://my-openai-resource.openai.azure.com
  • An API key for the Azure OpenAI resource
  • The deployment names for the models you want to test
  • A LoadForge account if you want to run distributed testing from cloud-based infrastructure across global test locations

You should also know which API version your application uses. Azure OpenAI requests typically include an api-version query parameter such as:

bash
2024-02-15-preview

or a newer supported version depending on the endpoint and model family.

For LoadForge, it is best to store secrets such as your Azure OpenAI API key in environment variables or LoadForge configuration rather than hardcoding them into your script.

A typical Azure OpenAI request uses:

  • Base URL: https://<resource-name>.openai.azure.com
  • Path: /openai/deployments/<deployment-name>/chat/completions
  • Authentication header: api-key: <your-key>
  • Query parameter: api-version=<version>

Understanding Azure OpenAI Under Load

Azure OpenAI behaves differently from many standard REST APIs during load testing. The main reason is that response time is influenced not just by infrastructure performance, but also by model inference time and token generation.

Key factors that affect Azure OpenAI performance

Model choice

Larger, more capable models typically have higher latency and lower throughput than smaller models. A deployment using a compact model may handle significantly more concurrent requests than one using a frontier model.

Input and output token size

Prompt length matters. A short classification prompt may complete in under a second, while a long retrieval-augmented generation request with a large context window can take much longer. Likewise, setting max_tokens too high can inflate latency and cost.

Concurrency and quotas

Azure OpenAI enforces rate limits and quota constraints. Under load, you may encounter:

  • HTTP 429 Too Many Requests
  • Increased queueing delay
  • Inconsistent latency across requests
  • Reduced throughput when capacity is saturated

This makes stress testing and sustained load testing especially important.

Streaming vs non-streaming responses

Streaming can improve perceived responsiveness for end users, but it changes how you measure timing. For many backend load testing scenarios, non-streaming requests are easier to benchmark consistently.

Content filtering and safety checks

Azure OpenAI may apply moderation and content safety checks that add processing overhead or occasionally alter response behavior. Your load test should use realistic prompts that reflect production usage.

Common bottlenecks in Azure OpenAI applications

When performance testing Azure OpenAI, the bottleneck is not always the model itself. You may also see issues in:

  • Application servers that orchestrate prompts
  • Retrieval pipelines calling vector databases
  • Authentication and token issuance layers
  • Network latency between your app and Azure region
  • Client retry logic that amplifies traffic during throttling

A good load testing strategy isolates Azure OpenAI performance first, then validates the full application workflow.

Writing Your First Load Test

Let’s start with a simple Azure OpenAI chat completions load test using Locust. This example sends realistic prompts to a deployed chat model and validates both status codes and response structure.

Basic Azure OpenAI chat completions test

python
from locust import HttpUser, task, between
import os
import random
 
class AzureOpenAIUser(HttpUser):
    wait_time = between(1, 3)
 
    host = os.getenv("AZURE_OPENAI_ENDPOINT", "https://my-openai-resource.openai.azure.com")
 
    deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT", "gpt-4o-mini")
    api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
    api_key = os.getenv("AZURE_OPENAI_API_KEY", "")
 
    prompts = [
        "Summarize the benefits of load testing an AI application in 3 bullet points.",
        "Write a short customer support reply apologizing for a delayed shipment.",
        "Explain the difference between throughput and latency in simple terms."
    ]
 
    def on_start(self):
        self.headers = {
            "Content-Type": "application/json",
            "api-key": self.api_key
        }
 
    @task
    def chat_completion(self):
        payload = {
            "messages": [
                {"role": "system", "content": "You are a concise enterprise AI assistant."},
                {"role": "user", "content": random.choice(self.prompts)}
            ],
            "temperature": 0.3,
            "max_tokens": 120
        }
 
        url = f"/openai/deployments/{self.deployment_name}/chat/completions?api-version={self.api_version}"
 
        with self.client.post(
            url,
            json=payload,
            headers=self.headers,
            name="Azure OpenAI Chat Completions",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Unexpected status code: {response.status_code} - {response.text}")
                return
 
            try:
                data = response.json()
                choices = data.get("choices", [])
                if not choices or "message" not in choices[0]:
                    response.failure("Missing expected chat completion response structure")
                else:
                    response.success()
            except Exception as e:
                response.failure(f"Invalid JSON response: {e}")

What this test does

This script simulates users sending chat completion requests to Azure OpenAI. It uses:

  • The real Azure OpenAI endpoint path
  • API key authentication
  • A realistic request body with messages, temperature, and max_tokens
  • Validation that checks for a successful and correctly structured response

This is a good starting point for baseline performance testing. Run it first at low concurrency to verify:

  • Your endpoint and deployment name are correct
  • Your API version is supported
  • Your model deployment is healthy
  • Your authentication is working

Once the baseline is stable, you can scale up users in LoadForge and observe latency, requests per second, and error rates in real time.

Advanced Load Testing Scenarios

Real Azure OpenAI applications rarely send one simple prompt repeatedly. They often combine multiple AI operations, use different prompt sizes, and rely on authentication or downstream workflows. Below are more advanced and realistic load testing scenarios.

Scenario 1: Mixed prompt sizes to simulate real enterprise chat traffic

A production chatbot often receives a mix of short and long prompts. This matters because token size directly affects Azure OpenAI performance.

python
from locust import HttpUser, task, between
import os
import random
 
class AzureOpenAIMixedChatUser(HttpUser):
    wait_time = between(1, 2)
    host = os.getenv("AZURE_OPENAI_ENDPOINT", "https://my-openai-resource.openai.azure.com")
 
    deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT", "gpt-4o-mini")
    api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
    api_key = os.getenv("AZURE_OPENAI_API_KEY", "")
 
    short_prompts = [
        "Classify this sentiment: 'The onboarding process was smooth and fast.'",
        "Rewrite this sentence to sound more professional: 'Can you send that over ASAP?'"
    ]
 
    long_prompts = [
        """You are assisting an operations manager. Summarize the following incident report and provide
        three recommended next steps:
        On March 12, the payment processing service experienced intermittent failures across the EU region.
        Error rates increased from 0.2% to 8.5% over a 47-minute period. Initial investigation showed elevated
        database connection pool exhaustion after a deployment. The rollback reduced errors, but some queued
        transactions required manual reconciliation. Customer support received 182 tickets during the incident.""",
        """Review this customer feedback and produce a concise executive summary with top themes:
        Customers appreciate the product's reporting features and dashboard customization, but several enterprise
        accounts reported slow response times during peak usage, especially when exporting large datasets.
        Users also requested better SSO documentation and more granular role-based access controls."""
    ]
 
    def on_start(self):
        self.headers = {
            "Content-Type": "application/json",
            "api-key": self.api_key
        }
 
    @task(3)
    def short_request(self):
        self.send_chat_request(random.choice(self.short_prompts), 80, "Chat Short Prompt")
 
    @task(1)
    def long_request(self):
        self.send_chat_request(random.choice(self.long_prompts), 250, "Chat Long Prompt")
 
    def send_chat_request(self, prompt, max_tokens, request_name):
        payload = {
            "messages": [
                {"role": "system", "content": "You are a helpful enterprise assistant."},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.2,
            "max_tokens": max_tokens
        }
 
        url = f"/openai/deployments/{self.deployment_name}/chat/completions?api-version={self.api_version}"
 
        with self.client.post(
            url,
            json=payload,
            headers=self.headers,
            name=request_name,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            elif response.status_code == 429:
                response.failure("Rate limited by Azure OpenAI")
            else:
                response.failure(f"Unexpected error {response.status_code}: {response.text}")

This scenario helps you answer important questions:

  • How much slower are long-context prompts?
  • At what concurrency level do 429 errors begin?
  • Can your deployment support mixed enterprise traffic patterns?

This is especially useful when validating throughput for AI & LLM applications with varied prompt complexity.

Scenario 2: Testing embeddings for retrieval and vector search workloads

Many Azure OpenAI applications use embeddings for semantic search, retrieval-augmented generation, document matching, and recommendation systems. Embeddings endpoints have different performance characteristics than chat completions, so they should be tested separately.

python
from locust import HttpUser, task, between
import os
import random
 
class AzureOpenAIEmbeddingsUser(HttpUser):
    wait_time = between(0.5, 1.5)
    host = os.getenv("AZURE_OPENAI_ENDPOINT", "https://my-openai-resource.openai.azure.com")
 
    embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
    api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
    api_key = os.getenv("AZURE_OPENAI_API_KEY", "")
 
    documents = [
        "Azure OpenAI provides enterprise-grade access to advanced language models.",
        "Load testing helps identify latency regressions before production releases.",
        "Vector search uses embeddings to find semantically similar content.",
        "Performance testing AI APIs requires realistic prompts and concurrency levels."
    ]
 
    def on_start(self):
        self.headers = {
            "Content-Type": "application/json",
            "api-key": self.api_key
        }
 
    @task
    def create_embedding(self):
        payload = {
            "input": random.choice(self.documents)
        }
 
        url = f"/openai/deployments/{self.embedding_deployment}/embeddings?api-version={self.api_version}"
 
        with self.client.post(
            url,
            json=payload,
            headers=self.headers,
            name="Azure OpenAI Embeddings",
            catch_response=True
        ) as response:
            if response.status_code != 200:
                response.failure(f"Embedding request failed: {response.status_code} - {response.text}")
                return
 
            try:
                data = response.json()
                embeddings = data.get("data", [])
                if not embeddings or "embedding" not in embeddings[0]:
                    response.failure("Missing embedding vector in response")
                else:
                    response.success()
            except Exception as e:
                response.failure(f"Invalid embedding response: {e}")

This test is valuable if your application depends on high-throughput embedding generation. In many systems, embeddings are generated in bursts during ingestion or indexing jobs, so you may also want to run stress testing with rapid ramp-up.

Scenario 3: Azure Active Directory token-based authentication and multi-endpoint workflow

Some enterprise environments avoid API keys and use Azure AD bearer tokens through managed identities or service principals. You may also want to test a realistic workflow that includes embeddings followed by chat generation.

python
from locust import HttpUser, task, between
import os
import time
import random
import requests
 
class AzureOpenAIWorkflowUser(HttpUser):
    wait_time = between(1, 2)
    host = os.getenv("AZURE_OPENAI_ENDPOINT", "https://my-openai-resource.openai.azure.com")
 
    chat_deployment = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT", "gpt-4o-mini")
    embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
    api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
 
    tenant_id = os.getenv("AZURE_TENANT_ID", "")
    client_id = os.getenv("AZURE_CLIENT_ID", "")
    client_secret = os.getenv("AZURE_CLIENT_SECRET", "")
 
    token = None
    token_expiry = 0
 
    knowledge_snippets = [
        "LoadForge provides distributed load testing and real-time reporting for API performance testing.",
        "Azure OpenAI quotas can affect throughput and trigger 429 responses under heavy concurrency.",
        "Embeddings are commonly used to support retrieval-augmented generation architectures."
    ]
 
    user_questions = [
        "How can I reduce latency in an enterprise AI application?",
        "What causes throttling in Azure OpenAI?",
        "Why should I load test embeddings separately from chat completions?"
    ]
 
    def get_bearer_token(self):
        if self.token and time.time() < self.token_expiry - 60:
            return self.token
 
        token_url = f"https://login.microsoftonline.com/{self.tenant_id}/oauth2/v2.0/token"
        payload = {
            "grant_type": "client_credentials",
            "client_id": self.client_id,
            "client_secret": self.client_secret,
            "scope": "https://cognitiveservices.azure.com/.default"
        }
 
        response = requests.post(token_url, data=payload, timeout=10)
        response.raise_for_status()
        token_data = response.json()
 
        self.token = token_data["access_token"]
        self.token_expiry = time.time() + int(token_data.get("expires_in", 3600))
        return self.token
 
    def auth_headers(self):
        return {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.get_bearer_token()}"
        }
 
    @task
    def embedding_then_chat(self):
        snippet = random.choice(self.knowledge_snippets)
        question = random.choice(self.user_questions)
 
        embedding_payload = {
            "input": snippet
        }
 
        embedding_url = f"/openai/deployments/{self.embedding_deployment}/embeddings?api-version={self.api_version}"
 
        with self.client.post(
            embedding_url,
            json=embedding_payload,
            headers=self.auth_headers(),
            name="Workflow - Embedding Step",
            catch_response=True
        ) as embedding_response:
            if embedding_response.status_code != 200:
                embedding_response.failure(
                    f"Embedding step failed: {embedding_response.status_code} - {embedding_response.text}"
                )
                return
 
        chat_payload = {
            "messages": [
                {
                    "role": "system",
                    "content": "Answer using the provided context. If the context is insufficient, say so."
                },
                {
                    "role": "user",
                    "content": f"Context: {snippet}\n\nQuestion: {question}"
                }
            ],
            "temperature": 0.1,
            "max_tokens": 180
        }
 
        chat_url = f"/openai/deployments/{self.chat_deployment}/chat/completions?api-version={self.api_version}"
 
        with self.client.post(
            chat_url,
            json=chat_payload,
            headers=self.auth_headers(),
            name="Workflow - Chat Step",
            catch_response=True
        ) as chat_response:
            if chat_response.status_code == 200:
                chat_response.success()
            elif chat_response.status_code == 429:
                chat_response.failure("Workflow chat step rate limited")
            else:
                chat_response.failure(
                    f"Workflow chat step failed: {chat_response.status_code} - {chat_response.text}"
                )

This example is closer to a production AI & LLM workflow. It tests:

  • Azure AD authentication
  • Token refresh behavior
  • Multi-step request chains
  • Separate latency for embeddings and chat completions

In LoadForge, you can run this script across multiple cloud regions to measure how global user traffic affects Azure OpenAI response times.

Analyzing Your Results

Once your Azure OpenAI load test is running, the next step is understanding what the metrics actually mean.

Key metrics to watch

Response time percentiles

Average latency is not enough. Focus on:

  • P50 for typical experience
  • P95 for most users
  • P99 for worst-case behavior

AI APIs often have long-tail latency, especially with larger prompts or under throttling.

Requests per second

This tells you how much throughput your deployment can sustain. Compare this against your expected production traffic and Azure quota limits.

Error rate

Pay close attention to:

  • 429 Too Many Requests
  • 500-series errors
  • Timeouts
  • Authentication failures

A rising 429 rate often means you have reached a practical throughput limit for your deployment or quota allocation.

Latency by endpoint type

If you test chat and embeddings together, separate them in your reporting. Embeddings may remain stable while chat completions degrade significantly, or vice versa.

How to interpret Azure OpenAI-specific patterns

Gradual latency increase with stable success rate

This often means the model deployment is approaching saturation but has not started throttling yet.

Sudden increase in 429 errors

This usually indicates you hit quota or rate limits. In this case, retry logic in your application may make things worse unless it includes backoff and jitter.

High variability in response times

This may be caused by:

  • Large differences in prompt size
  • Mixed workload types
  • Regional capacity variation
  • Shared deployments used by multiple services

Using LoadForge effectively

LoadForge helps here by offering:

  • Real-time reporting during the test
  • Distributed testing from multiple regions
  • Easy ramp-up patterns for stress testing
  • CI/CD integration so Azure OpenAI performance testing can be automated in release pipelines

For enterprise teams, these features make it much easier to validate Azure OpenAI reliability before production rollouts.

Performance Optimization Tips

If your Azure OpenAI load testing reveals bottlenecks, these are the first areas to optimize.

Reduce prompt size

Shorter prompts usually mean lower latency and better throughput. Remove unnecessary system instructions, repeated context, and verbose formatting.

Tune max_tokens

Do not request more output than you need. A lower max_tokens value can substantially improve response times and reduce costs.

Split workloads by deployment

If embeddings and chat completions share the same resource strategy, consider separating them so one workload does not starve the other.

Use smaller models where appropriate

Not every workflow needs the most capable model. Classification, extraction, and short summarization tasks often perform well on smaller, faster deployments.

Implement retry logic carefully

Use exponential backoff with jitter for 429 responses. Avoid aggressive retries that create retry storms during peak load.

Test by region

Azure OpenAI performance can vary by region. Use LoadForge global test locations to compare latency from the same geographies your users will use.

Validate quota planning

Make sure your Azure quota aligns with expected concurrency, token volume, and peak traffic windows. Load testing is one of the best ways to justify quota increase requests with real data.

Common Pitfalls to Avoid

Load testing Azure OpenAI is not the same as load testing a simple CRUD API. Here are common mistakes to avoid.

Using unrealistic prompts

If your test prompts are too short or too simple, your results will underestimate production latency. Use realistic message sizes and response expectations.

Ignoring token volume

Two requests per second can have very different infrastructure impact depending on token count. Always consider both request rate and token usage.

Hardcoding secrets

Do not embed API keys or client secrets directly in your Locust scripts. Use environment variables or secure secret management in LoadForge.

Not separating endpoint types

Chat completions, embeddings, and image or audio endpoints all behave differently. Test them independently before combining them into mixed workloads.

Treating 429s as random failures

In Azure OpenAI, 429 responses are often a key signal that you have reached capacity or quota limits. They are not just noise; they are part of the performance story.

Overlooking warm-up behavior

Model deployments and application layers may behave differently at the start of a test. Include a warm-up phase before evaluating steady-state metrics.

Testing only from one location

If your users are global, a single-region test may hide real network latency and routing issues. Distributed load testing gives a more accurate picture.

Conclusion

Azure OpenAI can power impressive enterprise AI applications, but success in production depends on more than model quality. You need confidence that your deployments can handle concurrency, maintain acceptable response times, stay within quota boundaries, and recover gracefully under stress.

By using Locust-based scripts and realistic Azure OpenAI request patterns, you can build meaningful load testing and performance testing scenarios for chat completions, embeddings, and full AI workflows. With LoadForge, you can scale those tests using cloud-based infrastructure, monitor results in real time, run distributed testing from global locations, and integrate performance validation into your CI/CD pipeline.

If you are preparing an Azure OpenAI application for production, now is the time to test it properly. Try LoadForge and start validating your Azure OpenAI throughput, latency, and reliability before your users feel the impact.

Try LoadForge free for 7 days

Set up your first load test in under 2 minutes. No commitment.