LoadForge LogoLoadForge

Load Testing AWS Bedrock

Load Testing AWS Bedrock

Introduction

Load testing AWS Bedrock is essential if you plan to serve generative AI features reliably in production. Whether you are building chat assistants, document summarization pipelines, retrieval-augmented generation workflows, or classification services, Bedrock performance under load directly affects user experience, cost, and system stability.

Unlike traditional REST APIs, AI and LLM workloads introduce unique performance testing concerns. Response times vary by prompt size, model choice, output token length, guardrails, and downstream integrations. A single request to AWS Bedrock might complete in under a second for a lightweight inference task, while a larger generation request can take several seconds or longer. Under concurrent traffic, these differences become even more important.

In this guide, you’ll learn how to load test AWS Bedrock using Locust on LoadForge. We’ll cover realistic authentication patterns, actual Bedrock runtime endpoints, model invocation payloads, streaming-like scenarios, and how to compare model latency across different workloads. By the end, you’ll have practical scripts for load testing, performance testing, and stress testing AWS Bedrock before your users hit production traffic.

LoadForge is especially useful here because distributed testing, real-time reporting, cloud-based infrastructure, CI/CD integration, and global test locations make it much easier to simulate realistic AI application traffic at scale.

Prerequisites

Before you start load testing AWS Bedrock, make sure you have the following:

  • An AWS account with access to Amazon Bedrock
  • At least one enabled foundation model in your target AWS region
  • IAM credentials with permission to invoke Bedrock models
  • The Bedrock Runtime endpoint for your region
  • A LoadForge account
  • Basic familiarity with Python and Locust

You’ll typically need IAM permissions such as:

  • bedrock:InvokeModel
  • bedrock:InvokeModelWithResponseStream if you plan to test streaming
  • bedrock:Converse or related permissions if you use newer conversation APIs
  • Optional CloudWatch permissions if correlating server-side metrics

A realistic AWS setup also includes:

  • Region-specific Bedrock endpoint, such as:
    • https://bedrock-runtime.us-east-1.amazonaws.com
    • https://bedrock-runtime.us-west-2.amazonaws.com
  • Environment variables for credentials:
    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
    • AWS_SESSION_TOKEN if using temporary credentials
    • AWS_REGION

If you are running tests from LoadForge, store credentials securely in environment variables or secret configuration rather than hardcoding them into your Locust scripts.

Understanding AWS Bedrock Under Load

AWS Bedrock is a managed service for foundation model inference, but managed does not mean limitless. Under load, you still need to understand where bottlenecks appear.

Key performance characteristics of Bedrock

When load testing AWS Bedrock, latency is influenced by:

  • Model family and size
  • Input token count
  • Output token count
  • Request concurrency
  • Region selection
  • Prompt complexity
  • Guardrail or moderation overhead
  • Application-side preprocessing and postprocessing

For example, a short prompt sent to an efficient model may respond quickly, while a multi-turn prompt with a high max_tokens value can create significantly longer tail latency.

Common bottlenecks in Bedrock workloads

Typical bottlenecks include:

  • Client-side request signing overhead with AWS Signature Version 4
  • Large JSON payload serialization
  • Long generation times from high token limits
  • Rate limiting or throughput constraints
  • Network latency between your app and the Bedrock region
  • Downstream application bottlenecks in RAG pipelines, such as vector search or document retrieval

Metrics that matter

For AWS Bedrock load testing, focus on:

  • Average response time
  • P95 and P99 latency
  • Requests per second
  • Failure rate
  • Timeouts
  • Error codes such as 429, 5xx, and auth failures
  • Latency by model ID
  • Latency by prompt type
  • Throughput under concurrent users

This is where LoadForge helps a lot. You can distribute traffic from multiple regions, observe real-time reporting, and compare workloads across models and prompt types to spot performance regression early.

Writing Your First Load Test

Let’s start with a basic Bedrock Runtime test using the InvokeModel API. This example sends a simple prompt to Anthropic Claude through AWS Bedrock. It uses SigV4 signing so the request is authenticated the same way a real production client would be.

Basic AWS Bedrock invoke model test

python
from locust import HttpUser, task, between
import os
import json
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
 
class BedrockUser(HttpUser):
    wait_time = between(1, 3)
 
    def on_start(self):
        self.region = os.getenv("AWS_REGION", "us-east-1")
        self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-haiku-20240307-v1:0")
        self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
 
        session = boto3.Session()
        creds = session.get_credentials().get_frozen_credentials()
        self.credentials = Credentials(
            creds.access_key,
            creds.secret_key,
            creds.token
        )
 
    def signed_headers(self, method, path, body):
        url = self.host + path
        headers = {
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
 
        request = AWSRequest(method=method, url=url, data=body, headers=headers)
        SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
        return dict(request.headers.items())
 
    @task
    def invoke_claude(self):
        path = f"/model/{self.model_id}/invoke"
 
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 200,
            "temperature": 0.3,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Summarize the benefits of load testing an AI chatbot in 3 bullet points."
                        }
                    ]
                }
            ]
        }
 
        body = json.dumps(payload)
        headers = self.signed_headers("POST", path, body)
 
        with self.client.post(
            path,
            data=body,
            headers=headers,
            name="Bedrock Invoke Claude Haiku",
            catch_response=True
        ) as response:
            if response.status_code == 200:
                try:
                    data = response.json()
                    if "content" in data:
                        response.success()
                    else:
                        response.failure("Missing expected content field")
                except Exception as e:
                    response.failure(f"Invalid JSON response: {e}")
            else:
                response.failure(f"Unexpected status code: {response.status_code}, body: {response.text}")

What this script does

This test:

  • Connects to the regional Bedrock Runtime endpoint
  • Signs requests using AWS SigV4
  • Calls the real Bedrock path:
    • /model/{modelId}/invoke
  • Sends a realistic Claude messages payload
  • Validates that the response contains generated content

This is a good starting point for baseline performance testing. Run it first with low concurrency to establish normal latency, then gradually increase user count to see where response times and error rates begin to climb.

Advanced Load Testing Scenarios

Once your basic test works, you should move on to more representative Bedrock workloads. Real applications rarely send a single static prompt. They involve different prompt sizes, model comparisons, and often multiple request types.

Scenario 1: Compare latency across multiple Bedrock models

One of the most valuable AWS Bedrock load testing strategies is comparing models under the same traffic pattern. Maybe Claude Haiku is fast enough for chat, while a larger model is better reserved for premium workflows.

python
from locust import HttpUser, task, between
import os
import json
import random
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
 
class BedrockModelComparisonUser(HttpUser):
    wait_time = between(1, 2)
 
    def on_start(self):
        self.region = os.getenv("AWS_REGION", "us-east-1")
        self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
        self.models = [
            "anthropic.claude-3-haiku-20240307-v1:0",
            "amazon.titan-text-premier-v1:0"
        ]
 
        session = boto3.Session()
        creds = session.get_credentials().get_frozen_credentials()
        self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
 
    def signed_headers(self, method, path, body):
        url = self.host + path
        headers = {
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
        request = AWSRequest(method=method, url=url, data=body, headers=headers)
        SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
        return dict(request.headers.items())
 
    @task(3)
    def test_short_prompt(self):
        model_id = random.choice(self.models)
        path = f"/model/{model_id}/invoke"
 
        if model_id.startswith("anthropic."):
            payload = {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 150,
                "temperature": 0.2,
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": "Write a concise customer support reply explaining a delayed shipment."
                            }
                        ]
                    }
                ]
            }
        else:
            payload = {
                "inputText": "Write a concise customer support reply explaining a delayed shipment.",
                "textGenerationConfig": {
                    "maxTokenCount": 150,
                    "temperature": 0.2,
                    "topP": 0.9
                }
            }
 
        body = json.dumps(payload)
        headers = self.signed_headers("POST", path, body)
 
        self.client.post(
            path,
            data=body,
            headers=headers,
            name=f"Invoke {model_id} Short Prompt"
        )
 
    @task(1)
    def test_long_prompt(self):
        model_id = random.choice(self.models)
        path = f"/model/{model_id}/invoke"
 
        long_context = """
        You are assisting an operations team. Analyze the following incident summary and produce:
        1. A one-paragraph summary
        2. Three likely root causes
        3. Two immediate remediation steps
        Incident details:
        Over the last 24 hours, API response times increased from 250ms to 2.5s during peak traffic.
        Error rates rose from 0.2% to 4.8%. The issue appears correlated with increased document ingestion,
        background embedding jobs, and elevated cache miss rates in the recommendation service.
        """
 
        if model_id.startswith("anthropic."):
            payload = {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 300,
                "temperature": 0.4,
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": long_context}
                        ]
                    }
                ]
            }
        else:
            payload = {
                "inputText": long_context,
                "textGenerationConfig": {
                    "maxTokenCount": 300,
                    "temperature": 0.4,
                    "topP": 0.95
                }
            }
 
        body = json.dumps(payload)
        headers = self.signed_headers("POST", path, body)
 
        self.client.post(
            path,
            data=body,
            headers=headers,
            name=f"Invoke {model_id} Long Prompt"
        )

This test helps you compare:

  • Short vs long prompt latency
  • Model-specific throughput
  • Performance differences between providers
  • Tail latency under mixed workloads

In LoadForge, you can quickly see whether one model introduces significantly higher P95 latency and decide whether to route different traffic classes differently.

Scenario 2: Simulate a chat application with conversation context

Many AI applications send conversational history, not isolated prompts. This increases payload size and often reveals latency growth as context accumulates.

python
from locust import HttpUser, task, between
import os
import json
import random
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
 
class BedrockChatUser(HttpUser):
    wait_time = between(2, 5)
 
    def on_start(self):
        self.region = os.getenv("AWS_REGION", "us-east-1")
        self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
        self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
 
        session = boto3.Session()
        creds = session.get_credentials().get_frozen_credentials()
        self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
 
        self.conversations = [
            [
                {"role": "user", "content": [{"type": "text", "text": "I need help planning a 3-day trip to Seattle."}]},
                {"role": "assistant", "content": [{"type": "text", "text": "Sure, what kind of activities do you enjoy?"}]},
                {"role": "user", "content": [{"type": "text", "text": "Coffee shops, museums, and seafood."}]}
            ],
            [
                {"role": "user", "content": [{"type": "text", "text": "Can you help me draft a performance review?"}]},
                {"role": "assistant", "content": [{"type": "text", "text": "Absolutely. What is the employee's role and key contributions?"}]},
                {"role": "user", "content": [{"type": "text", "text": "Senior backend engineer, improved API latency by 35%."}]}
            ]
        ]
 
    def signed_headers(self, method, path, body):
        url = self.host + path
        headers = {
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
        request = AWSRequest(method=method, url=url, data=body, headers=headers)
        SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
        return dict(request.headers.items())
 
    @task
    def continue_chat(self):
        path = f"/model/{self.model_id}/invoke"
        conversation = random.choice(self.conversations)
 
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 250,
            "temperature": 0.6,
            "system": "You are a helpful assistant for a consumer app. Keep responses practical and concise.",
            "messages": conversation + [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Please continue the conversation and provide a useful next response."
                        }
                    ]
                }
            ]
        }
 
        body = json.dumps(payload)
        headers = self.signed_headers("POST", path, body)
 
        with self.client.post(
            path,
            data=body,
            headers=headers,
            name="Bedrock Chat Continuation",
            catch_response=True,
            timeout=90
        ) as response:
            if response.status_code == 200:
                response.success()
            elif response.status_code == 429:
                response.failure("Rate limited by Bedrock")
            else:
                response.failure(f"Status {response.status_code}: {response.text}")

This scenario is useful for:

  • Chatbot load testing
  • Measuring the impact of growing conversation history
  • Detecting rate limits during realistic user sessions
  • Validating timeout settings for longer generations

Scenario 3: Test a document summarization workload with large prompts

A common AWS Bedrock production use case is summarizing uploaded reports, support tickets, or internal documents. These requests are larger and often stress both prompt handling and output generation.

python
from locust import HttpUser, task, between
import os
import json
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import Credentials
 
class BedrockDocumentUser(HttpUser):
    wait_time = between(3, 6)
 
    def on_start(self):
        self.region = os.getenv("AWS_REGION", "us-east-1")
        self.model_id = os.getenv("BEDROCK_MODEL_ID", "anthropic.claude-3-haiku-20240307-v1:0")
        self.host = f"https://bedrock-runtime.{self.region}.amazonaws.com"
 
        session = boto3.Session()
        creds = session.get_credentials().get_frozen_credentials()
        self.credentials = Credentials(creds.access_key, creds.secret_key, creds.token)
 
        self.document_text = """
        Q2 Operational Review:
        Revenue grew 18% year-over-year, but support ticket volume increased 42%.
        The infrastructure team migrated 60% of workloads to Graviton instances, reducing compute costs by 19%.
        However, customer-facing search latency degraded during regional failover tests.
        Engineering launched a new recommendation engine backed by vector embeddings, but indexing delays caused stale results
        for up to 45 minutes after catalog updates. Security completed SSO rollout for all internal tools, while compliance
        identified logging retention gaps in one payment-related subsystem. Product launched AI-assisted support summaries,
        reducing average handle time by 14%, though hallucination rates were elevated for edge-case refund scenarios.
        """
 
    def signed_headers(self, method, path, body):
        url = self.host + path
        headers = {
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
        request = AWSRequest(method=method, url=url, data=body, headers=headers)
        SigV4Auth(self.credentials, "bedrock", self.region).add_auth(request)
        return dict(request.headers.items())
 
    @task
    def summarize_document(self):
        path = f"/model/{self.model_id}/invoke"
 
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 400,
            "temperature": 0.2,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"""
Summarize the following operational review.
Return JSON with keys: executive_summary, risks, wins, recommended_actions.
 
Document:
{self.document_text}
"""
                        }
                    ]
                }
            ]
        }
 
        body = json.dumps(payload)
        headers = self.signed_headers("POST", path, body)
 
        with self.client.post(
            path,
            data=body,
            headers=headers,
            name="Bedrock Document Summarization",
            catch_response=True,
            timeout=120
        ) as response:
            if response.status_code == 200:
                try:
                    data = response.json()
                    if "content" in data:
                        response.success()
                    else:
                        response.failure("No generated content returned")
                except Exception as e:
                    response.failure(f"Response parsing failed: {e}")
            else:
                response.failure(f"Status {response.status_code}: {response.text}")

This is an effective stress testing pattern for:

  • Longer prompts
  • Structured output generation
  • Higher token usage
  • Realistic enterprise AI workflows

Analyzing Your Results

After running your AWS Bedrock load test, the next step is interpreting the data correctly.

Focus on latency distribution, not just averages

Average response time can hide serious issues. In AI & LLM systems, P95 and P99 latency are often more important because some prompts naturally take longer. If your average latency is 2 seconds but P99 is 18 seconds, your users will feel that inconsistency.

Look for:

  • Stable median latency under normal load
  • Gradual or sudden increases in P95/P99
  • Error spikes as concurrency rises
  • Differences between prompt classes
  • Differences between model IDs

Watch for rate limiting and throttling

If you see 429 errors, you may be hitting Bedrock throughput limits, account quotas, or model-specific service constraints. This is exactly why performance testing before production matters.

Segment these issues by:

  • Region
  • Model
  • Prompt size
  • Concurrent users

Compare models by workload type

A model that performs well for short prompts may not scale as well for summarization or multi-turn chat. Use named requests in Locust, like:

  • Bedrock Invoke Claude Haiku
  • Bedrock Chat Continuation
  • Bedrock Document Summarization

These names make LoadForge reports easier to analyze.

Correlate with AWS metrics

For deeper troubleshooting, compare your LoadForge results with AWS-side telemetry:

  • CloudWatch request metrics
  • Bedrock service quotas
  • Application logs
  • Client-side timeout logs
  • Network path latency

LoadForge’s real-time reporting helps you spot issues as they happen, and distributed testing from multiple global test locations can reveal region-specific latency differences.

Performance Optimization Tips

If your AWS Bedrock load testing reveals bottlenecks, these are the first areas to optimize.

Reduce token usage

The biggest lever is often prompt and response size. Lower:

  • max_tokens
  • context length
  • unnecessary system instructions
  • repeated conversation history

Shorter prompts usually mean lower latency and lower cost.

Match the model to the use case

Use smaller, faster models for:

  • classification
  • extraction
  • short chat replies
  • routing tasks

Reserve larger or slower models for:

  • complex reasoning
  • premium user flows
  • offline batch generation

Test by region

If your application runs in us-west-2 but your Bedrock endpoint is in us-east-1, network latency alone can skew performance. Keep your app and model inference as close as possible.

Tune client timeouts carefully

LLM workloads can vary widely. Set realistic timeouts in Locust and in your application. Too short, and you’ll create false failures. Too long, and you may hide poor performance.

Separate workload classes

Do not mix every prompt type into one generic test. Create separate scenarios for:

  • chat
  • summarization
  • extraction
  • long-context analysis

This makes performance bottlenecks much easier to identify.

Use distributed load testing

AI traffic can come from global users. LoadForge’s cloud-based infrastructure and global test locations let you simulate geographically distributed demand patterns that are hard to reproduce from a single machine.

Common Pitfalls to Avoid

Load testing AWS Bedrock has a few traps that can invalidate your results if you’re not careful.

Hardcoding credentials

Never embed AWS keys directly in your Locust script. Use environment variables or secure secret management.

Ignoring SigV4 signing overhead

A toy script that skips real authentication does not reflect production behavior. Signing requests adds real client-side work, so include it in your test.

Using unrealistic prompts

If your production application sends 2 KB prompts with conversation history, don’t benchmark with a 20-word prompt and assume the numbers will hold.

Forgetting model-specific payload formats

Different Bedrock models may require different request schemas. Anthropic payloads differ from Titan payloads. Make sure your scripts match the actual provider format.

Testing only one concurrency level

A single test at 20 users tells you very little. Run:

  • baseline load tests
  • ramp-up tests
  • spike tests
  • stress tests

This reveals where Bedrock latency begins to degrade and where failures start.

Not validating responses

A 200 status code does not guarantee a useful result. Validate response structure so you know successful requests are actually returning model output.

Overlooking downstream architecture

If your app uses AWS Bedrock as part of a larger RAG workflow, the model may not be the bottleneck. Retrieval, embeddings, caching, and document chunking can all dominate latency. Test those paths separately when possible.

Conclusion

Load testing AWS Bedrock is one of the best ways to prepare AI features for real-world traffic. By testing realistic prompts, comparing models, simulating chat sessions, and measuring document-heavy workloads, you can identify latency problems, throttling risks, and scaling bottlenecks before users do.

With Locust-based scripting on LoadForge, you can build practical AWS Bedrock performance testing scenarios using real authentication patterns and actual Bedrock endpoints. Combined with distributed testing, real-time reporting, CI/CD integration, cloud-based infrastructure, and global test locations, LoadForge gives you a powerful way to validate AI & LLM workloads at scale.

If you’re planning to launch Bedrock-powered features in production, now is the time to test them properly. Try LoadForge and start load testing AWS Bedrock with realistic, scalable scenarios today.

Try LoadForge free for 7 days

Set up your first load test in under 2 minutes. No commitment.