
Introduction
Load testing AWS Bedrock is essential if you plan to serve generative AI features reliably in production. Whether you are building chat assistants, document summarization pipelines, retrieval-augmented generation workflows, or classification services, Bedrock performance under load directly affects user experience, cost, and system stability.
Unlike traditional REST APIs, AI and LLM workloads introduce unique performance testing concerns. Response times vary by prompt size, model choice, output token length, guardrails, and downstream integrations. A single request to AWS Bedrock might complete in under a second for a lightweight inference task, while a larger generation request can take several seconds or longer. Under concurrent traffic, these differences become even more important.
In this guide, you’ll learn how to load test AWS Bedrock using Locust on LoadForge. We’ll cover realistic authentication patterns, actual Bedrock runtime endpoints, model invocation payloads, streaming-like scenarios, and how to compare model latency across different workloads. By the end, you’ll have practical scripts for load testing, performance testing, and stress testing AWS Bedrock before your users hit production traffic.
LoadForge is especially useful here because distributed testing, real-time reporting, cloud-based infrastructure, CI/CD integration, and global test locations make it much easier to simulate realistic AI application traffic at scale.
Prerequisites
Before you start load testing AWS Bedrock, make sure you have the following:
- An AWS account with access to Amazon Bedrock
- At least one enabled foundation model in your target AWS region
- IAM credentials with permission to invoke Bedrock models
- The Bedrock Runtime endpoint for your region
- A LoadForge account
- Basic familiarity with Python and Locust
You’ll typically need IAM permissions such as:
bedrock:InvokeModelbedrock:InvokeModelWithResponseStreamif you plan to test streamingbedrock:Converseor related permissions if you use newer conversation APIs- Optional CloudWatch permissions if correlating server-side metrics
A realistic AWS setup also includes:
- Region-specific Bedrock endpoint, such as:
https://bedrock-runtime.us-east-1.amazonaws.comhttps://bedrock-runtime.us-west-2.amazonaws.com
- Environment variables for credentials:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_SESSION_TOKENif using temporary credentialsAWS_REGION
If you are running tests from LoadForge, store credentials securely in environment variables or secret configuration rather than hardcoding them into your Locust scripts.
Understanding AWS Bedrock Under Load
AWS Bedrock is a managed service for foundation model inference, but managed does not mean limitless. Under load, you still need to understand where bottlenecks appear.
Key performance characteristics of Bedrock
When load testing AWS Bedrock, latency is influenced by:
- Model family and size
- Input token count
- Output token count
- Request concurrency
- Region selection
- Prompt complexity
- Guardrail or moderation overhead
- Application-side preprocessing and postprocessing
For example, a short prompt sent to an efficient model may respond quickly, while a multi-turn prompt with a high max_tokens value can create significantly longer tail latency.
Common bottlenecks in Bedrock workloads
Typical bottlenecks include:
- Client-side request signing overhead with AWS Signature Version 4
- Large JSON payload serialization
- Long generation times from high token limits
- Rate limiting or throughput constraints
- Network latency between your app and the Bedrock region
- Downstream application bottlenecks in RAG pipelines, such as vector search or document retrieval
Metrics that matter
For AWS Bedrock load testing, focus on:
- Average response time
- P95 and P99 latency
- Requests per second
- Failure rate
- Timeouts
- Error codes such as 429, 5xx, and auth failures
- Latency by model ID
- Latency by prompt type
- Throughput under concurrent users
This is where LoadForge helps a lot. You can distribute traffic from multiple regions, observe real-time reporting, and compare workloads across models and prompt types to spot performance regression early.
Writing Your First Load Test
Let’s start with a basic Bedrock Runtime test using the InvokeModel API. This example sends a simple prompt to Anthropic Claude through AWS Bedrock. It uses SigV4 signing so the request is authenticated the same way a real production client would be.
Basic AWS Bedrock invoke model test
from locust import HttpUser, task, between
import os
import json
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
class BedrockUser(HttpUser):
wait_time = between(1, 3)
def on_start(self):
self.region = os.getenv("AWS_REGION", "us-east-1")
self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-haiku-20240307-v1:0")
self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()
self.credentials = Credentials(
creds.access_key,
creds.secret_key,
creds.token
)
def signed_headers(self, method, path, body):
url = self.host + path
headers = {
"Content-Type": "application/json",
"Accept": "application/json"
}
request = AWSRequest(method=method, url=url, data=body, headers=headers)
SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
return dict(request.headers.items())
@task
def invoke_claude(self):
path = f"/model/{self.model_id}/invoke"
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the benefits of load testing an AI chatbot in 3 bullet points."
}
]
}
]
}
body = json.dumps(payload)
headers = self.signed_headers("POST", path, body)
with self.client.post(
path,
data=body,
headers=headers,
name="Bedrock Invoke Claude Haiku",
catch_response=True
) as response:
if response.status_code == 200:
try:
data = response.json()
if "content" in data:
response.success()
else:
response.failure("Missing expected content field")
except Exception as e:
response.failure(f"Invalid JSON response: {e}")
else:
response.failure(f"Unexpected status code: {response.status_code}, body: {response.text}")What this script does
This test:
- Connects to the regional Bedrock Runtime endpoint
- Signs requests using AWS SigV4
- Calls the real Bedrock path:
/model/{modelId}/invoke
- Sends a realistic Claude messages payload
- Validates that the response contains generated content
This is a good starting point for baseline performance testing. Run it first with low concurrency to establish normal latency, then gradually increase user count to see where response times and error rates begin to climb.
Advanced Load Testing Scenarios
Once your basic test works, you should move on to more representative Bedrock workloads. Real applications rarely send a single static prompt. They involve different prompt sizes, model comparisons, and often multiple request types.
Scenario 1: Compare latency across multiple Bedrock models
One of the most valuable AWS Bedrock load testing strategies is comparing models under the same traffic pattern. Maybe Claude Haiku is fast enough for chat, while a larger model is better reserved for premium workflows.
from locust import HttpUser, task, between
import os
import json
import random
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
class BedrockModelComparisonUser(HttpUser):
wait_time = between(1, 2)
def on_start(self):
self.region = os.getenv("AWS_REGION", "us-east-1")
self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
self.models = [
"anthropic.claude-3-haiku-20240307-v1:0",
"amazon.titan-text-premier-v1:0"
]
session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()
self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
def signed_headers(self, method, path, body):
url = self.host + path
headers = {
"Content-Type": "application/json",
"Accept": "application/json"
}
request = AWSRequest(method=method, url=url, data=body, headers=headers)
SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
return dict(request.headers.items())
@task(3)
def test_short_prompt(self):
model_id = random.choice(self.models)
path = f"/model/{model_id}/invoke"
if model_id.startswith("anthropic."):
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 150,
"temperature": 0.2,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Write a concise customer support reply explaining a delayed shipment."
}
]
}
]
}
else:
payload = {
"inputText": "Write a concise customer support reply explaining a delayed shipment.",
"textGenerationConfig": {
"maxTokenCount": 150,
"temperature": 0.2,
"topP": 0.9
}
}
body = json.dumps(payload)
headers = self.signed_headers("POST", path, body)
self.client.post(
path,
data=body,
headers=headers,
name=f"Invoke {model_id} Short Prompt"
)
@task(1)
def test_long_prompt(self):
model_id = random.choice(self.models)
path = f"/model/{model_id}/invoke"
long_context = """
You are assisting an operations team. Analyze the following incident summary and produce:
1. A one-paragraph summary
2. Three likely root causes
3. Two immediate remediation steps
Incident details:
Over the last 24 hours, API response times increased from 250ms to 2.5s during peak traffic.
Error rates rose from 0.2% to 4.8%. The issue appears correlated with increased document ingestion,
background embedding jobs, and elevated cache miss rates in the recommendation service.
"""
if model_id.startswith("anthropic."):
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 300,
"temperature": 0.4,
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": long_context}
]
}
]
}
else:
payload = {
"inputText": long_context,
"textGenerationConfig": {
"maxTokenCount": 300,
"temperature": 0.4,
"topP": 0.95
}
}
body = json.dumps(payload)
headers = self.signed_headers("POST", path, body)
self.client.post(
path,
data=body,
headers=headers,
name=f"Invoke {model_id} Long Prompt"
)This test helps you compare:
- Short vs long prompt latency
- Model-specific throughput
- Performance differences between providers
- Tail latency under mixed workloads
In LoadForge, you can quickly see whether one model introduces significantly higher P95 latency and decide whether to route different traffic classes differently.
Scenario 2: Simulate a chat application with conversation context
Many AI applications send conversational history, not isolated prompts. This increases payload size and often reveals latency growth as context accumulates.
from locust import HttpUser, task, between
import os
import json
import random
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
class BedrockChatUser(HttpUser):
wait_time = between(2, 5)
def on_start(self):
self.region = os.getenv("AWS_REGION", "us-east-1")
self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()
self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
self.conversations = [
[
{"role": "user", "content": [{"type": "text", "text": "I need help planning a 3-day trip to Seattle."}]},
{"role": "assistant", "content": [{"type": "text", "text": "Sure, what kind of activities do you enjoy?"}]},
{"role": "user", "content": [{"type": "text", "text": "Coffee shops, museums, and seafood."}]}
],
[
{"role": "user", "content": [{"type": "text", "text": "Can you help me draft a performance review?"}]},
{"role": "assistant", "content": [{"type": "text", "text": "Absolutely. What is the employee's role and key contributions?"}]},
{"role": "user", "content": [{"type": "text", "text": "Senior backend engineer, improved API latency by 35%."}]}
]
]
def signed_headers(self, method, path, body):
url = self.host + path
headers = {
"Content-Type": "application/json",
"Accept": "application/json"
}
request = AWSRequest(method=method, url=url, data=body, headers=headers)
SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
return dict(request.headers.items())
@task
def continue_chat(self):
path = f"/model/{self.model_id}/invoke"
conversation = random.choice(self.conversations)
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 250,
"temperature": 0.6,
"system": "You are a helpful assistant for a consumer app. Keep responses practical and concise.",
"messages": conversation + [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Please continue the conversation and provide a useful next response."
}
]
}
]
}
body = json.dumps(payload)
headers = self.signed_headers("POST", path, body)
with self.client.post(
path,
data=body,
headers=headers,
name="Bedrock Chat Continuation",
catch_response=True,
timeout=90
) as response:
if response.status_code == 200:
response.success()
elif response.status_code == 429:
response.failure("Rate limited by Bedrock")
else:
response.failure(f"Status {response.status_code}: {response.text}")This scenario is useful for:
- Chatbot load testing
- Measuring the impact of growing conversation history
- Detecting rate limits during realistic user sessions
- Validating timeout settings for longer generations
Scenario 3: Test a document summarization workload with large prompts
A common AWS Bedrock production use case is summarizing uploaded reports, support tickets, or internal documents. These requests are larger and often stress both prompt handling and output generation.
from locust import HttpUser, task, between
import os
import json
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
class BedrockDocumentUser(HttpUser):
wait_time = between(3, 6)
def on_start(self):
self.region = os.getenv("AWS_REGION", "us-east-1")
self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-haiku-20240307-v1:0")
self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()
self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
self.document_text = """
Q2 Operational Review:
Revenue grew 18% year-over-year, but support ticket volume increased 42%.
The infrastructure team migrated 60% of workloads to Graviton instances, reducing compute costs by 19%.
However, customer-facing search latency degraded during regional failover tests.
Engineering launched a new recommendation engine backed by vector embeddings, but indexing delays caused stale results
for up to 45 minutes after catalog updates. Security completed SSO rollout for all internal tools, while compliance
identified logging retention gaps in one payment-related subsystem. Product launched AI-assisted support summaries,
reducing average handle time by 14%, though hallucination rates were elevated for edge-case refund scenarios.
"""
def signed_headers(self, method, path, body):
url = self.host + path
headers = {
"Content-Type": "application/json",
"Accept": "application/json"
}
request = AWSRequest(method=method, url=url, data=body, headers=headers)
SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
return dict(request.headers.items())
@task
def summarize_document(self):
path = f"/model/{self.model_id}/invoke"
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 400,
"temperature": 0.2,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"""
Summarize the following operational review.
Return JSON with keys: executive_summary, risks, wins, recommended_actions.
Document:
{self.document_text}
"""
}
]
}
]
}
body = json.dumps(payload)
headers = self.signed_headers("POST", path, body)
with self.client.post(
path,
data=body,
headers=headers,
name="Bedrock Document Summarization",
catch_response=True,
timeout=120
) as response:
if response.status_code == 200:
try:
data = response.json()
if "content" in data:
response.success()
else:
response.failure("No generated content returned")
except Exception as e:
response.failure(f"Response parsing failed: {e}")
else:
response.failure(f"Status {response.status_code}: {response.text}")This is an effective stress testing pattern for:
- Longer prompts
- Structured output generation
- Higher token usage
- Realistic enterprise AI workflows
Analyzing Your Results
After running your AWS Bedrock load test, the next step is interpreting the data correctly.
Focus on latency distribution, not just averages
Average response time can hide serious issues. In AI & LLM systems, P95 and P99 latency are often more important because some prompts naturally take longer. If your average latency is 2 seconds but P99 is 18 seconds, your users will feel that inconsistency.
Look for:
- Stable median latency under normal load
- Gradual or sudden increases in P95/P99
- Error spikes as concurrency rises
- Differences between prompt classes
- Differences between model IDs
Watch for rate limiting and throttling
If you see 429 errors, you may be hitting Bedrock throughput limits, account quotas, or model-specific service constraints. This is exactly why performance testing before production matters.
Segment these issues by:
- Region
- Model
- Prompt size
- Concurrent users
Compare models by workload type
A model that performs well for short prompts may not scale as well for summarization or multi-turn chat. Use named requests in Locust, like:
Bedrock Invoke Claude HaikuBedrock Chat ContinuationBedrock Document Summarization
These names make LoadForge reports easier to analyze.
Correlate with AWS metrics
For deeper troubleshooting, compare your LoadForge results with AWS-side telemetry:
- CloudWatch request metrics
- Bedrock service quotas
- Application logs
- Client-side timeout logs
- Network path latency
LoadForge’s real-time reporting helps you spot issues as they happen, and distributed testing from multiple global test locations can reveal region-specific latency differences.
Performance Optimization Tips
If your AWS Bedrock load testing reveals bottlenecks, these are the first areas to optimize.
Reduce token usage
The biggest lever is often prompt and response size. Lower:
max_tokens- context length
- unnecessary system instructions
- repeated conversation history
Shorter prompts usually mean lower latency and lower cost.
Match the model to the use case
Use smaller, faster models for:
- classification
- extraction
- short chat replies
- routing tasks
Reserve larger or slower models for:
- complex reasoning
- premium user flows
- offline batch generation
Test by region
If your application runs in us-west-2 but your Bedrock endpoint is in us-east-1, network latency alone can skew performance. Keep your app and model inference as close as possible.
Tune client timeouts carefully
LLM workloads can vary widely. Set realistic timeouts in Locust and in your application. Too short, and you’ll create false failures. Too long, and you may hide poor performance.
Separate workload classes
Do not mix every prompt type into one generic test. Create separate scenarios for:
- chat
- summarization
- extraction
- long-context analysis
This makes performance bottlenecks much easier to identify.
Use distributed load testing
AI traffic can come from global users. LoadForge’s cloud-based infrastructure and global test locations let you simulate geographically distributed demand patterns that are hard to reproduce from a single machine.
Common Pitfalls to Avoid
Load testing AWS Bedrock has a few traps that can invalidate your results if you’re not careful.
Hardcoding credentials
Never embed AWS keys directly in your Locust script. Use environment variables or secure secret management.
Ignoring SigV4 signing overhead
A toy script that skips real authentication does not reflect production behavior. Signing requests adds real client-side work, so include it in your test.
Using unrealistic prompts
If your production application sends 2 KB prompts with conversation history, don’t benchmark with a 20-word prompt and assume the numbers will hold.
Forgetting model-specific payload formats
Different Bedrock models may require different request schemas. Anthropic payloads differ from Titan payloads. Make sure your scripts match the actual provider format.
Testing only one concurrency level
A single test at 20 users tells you very little. Run:
- baseline load tests
- ramp-up tests
- spike tests
- stress tests
This reveals where Bedrock latency begins to degrade and where failures start.
Not validating responses
A 200 status code does not guarantee a useful result. Validate response structure so you know successful requests are actually returning model output.
Overlooking downstream architecture
If your app uses AWS Bedrock as part of a larger RAG workflow, the model may not be the bottleneck. Retrieval, embeddings, caching, and document chunking can all dominate latency. Test those paths separately when possible.
Conclusion
Load testing AWS Bedrock is one of the best ways to prepare AI features for real-world traffic. By testing realistic prompts, comparing models, simulating chat sessions, and measuring document-heavy workloads, you can identify latency problems, throttling risks, and scaling bottlenecks before users do.
With Locust-based scripting on LoadForge, you can build practical AWS Bedrock performance testing scenarios using real authentication patterns and actual Bedrock endpoints. Combined with distributed testing, real-time reporting, CI/CD integration, cloud-based infrastructure, and global test locations, LoadForge gives you a powerful way to validate AI & LLM workloads at scale.
If you’re planning to launch Bedrock-powered features in production, now is the time to test them properly. Try LoadForge and start load testing AWS Bedrock with realistic, scalable scenarios today.
LoadForge Team
LoadForge is a load and performance testing platform built on Locust. Our team has been shipping load tests against production systems since 2018, and we write these guides from real customer engagements.
Related guides
Keep going with more guides from the same category.

How to Load Test an AI Gateway
Learn how to load test an AI gateway to validate routing, caching, rate limiting, and multi-model reliability at scale.

How to Load Test Azure OpenAI
Load test Azure OpenAI deployments to validate throughput, response times, quotas, and reliability for enterprise AI applications.

How to Load Test the ChatGPT API
Discover how to load test the ChatGPT API with realistic prompts, concurrent users, streaming, and token-based performance metrics.