What Is Latency?

Latency is the time it takes for a request to travel from its source to its destination and receive a response. In web development, it most commonly refers to the delay between a user's action — clicking a link, submitting a form, calling an API — and the moment they see the result. Latency is measured in milliseconds (ms) and is one of the most fundamental metrics in application performance.

Latency is often confused with two related but distinct concepts: bandwidth and throughput. A useful analogy is to think of your network connection as a pipe. Latency is how long the pipe is — the time it takes for water to travel from one end to the other. Bandwidth is how wide the pipe is — how much water can flow through it simultaneously. Throughput is the actual volume of water that makes it through the pipe in a given period, which depends on both the pipe's length and its width.

You can have enormous bandwidth (a very wide pipe) and still suffer from high latency (a very long pipe). This is why a satellite internet connection with generous bandwidth still feels sluggish for interactive applications — the signal has to travel to orbit and back, adding hundreds of milliseconds of latency that no amount of bandwidth can eliminate.

Understanding this distinction is critical for diagnosing performance problems. If your application is slow because of high latency, throwing more bandwidth at it will not help. Conversely, if your bottleneck is bandwidth (too much data being transferred), reducing latency alone will not solve the problem. The fix depends on correctly identifying which constraint you are hitting.

Why Latency Matters

Latency is not an abstract metric that only infrastructure engineers care about. It directly shapes how users perceive your application, how search engines rank your pages, and how much revenue your business generates.

User Perception and Experience

Human perception of latency follows well-studied thresholds that have remained remarkably consistent across decades of research:

Latency	User Perception
0-100ms	Feels instantaneous. The user perceives the response as immediate.
100-300ms	Noticeable but acceptable. The user senses a slight delay but the experience still feels responsive.
300-1000ms	Perceptible lag. The user is aware of waiting. Concentration may begin to break.
1000ms+	Frustrating. The user's mental flow is interrupted. They may question whether the action worked.
10,000ms+	Abandonment. Most users will leave or retry, often compounding the problem.

These thresholds matter because users do not think about your application in terms of milliseconds. They think in terms of feeling. An application that consistently responds in under 100ms feels snappy and trustworthy. One that routinely takes 2-3 seconds feels sluggish, even if it is functionally identical.

Search Rankings and SEO

Google has explicitly confirmed that page speed is a ranking factor, and latency is a core component of page speed. Core Web Vitals — Google's set of metrics for measuring real-world user experience — are directly influenced by server-side latency. A slow server response time inflates your Largest Contentful Paint (LCP) score, which can push your pages down in search results. For more on this, see our guide on what are Core Web Vitals.

Conversion and Revenue

The relationship between latency and revenue is well documented. Amazon found that every 100ms of additional latency cost them 1% in sales. Google discovered that a 500ms increase in search results latency reduced traffic by 20%. For e-commerce sites, even small latency improvements translate directly into higher conversion rates and lower cart abandonment.

Mobile Users

A significant portion of global web traffic comes from mobile devices on cellular networks. These connections often carry inherent latency of 50-200ms or more, on top of whatever latency your server and application add. Designing for low latency is not just an optimization for your fastest users — it is essential for making your application usable for the millions of people on constrained connections.

Types of Latency

Latency is not a single number with a single cause. The total latency a user experiences is the sum of multiple components, each of which can be optimized independently.

Network Latency

Network latency is the time spent transmitting data between the client and the server across the network. It includes several distinct stages:

DNS Lookup. Before any connection happens, the client must resolve your domain name to an IP address. This requires querying a DNS server, which can take anywhere from 1ms (cached locally) to 100ms or more (cold lookup hitting authoritative nameservers). Slow DNS can add surprising latency to the first request.

TCP Handshake. Once the IP address is known, the client and server perform a three-way handshake to establish a TCP connection. This requires one full round trip — the client sends a SYN, the server responds with SYN-ACK, and the client sends ACK. For a server 50ms away, this adds 50ms of latency before any application data is exchanged.

TLS Negotiation. For HTTPS connections (which should be all of them in modern applications), an additional TLS handshake follows TCP. With TLS 1.2, this adds two more round trips. TLS 1.3 reduces this to one round trip, and supports 0-RTT resumption for repeat connections — a meaningful latency improvement.

Physical Distance. Data travels through fiber optic cables at roughly two-thirds the speed of light. A request from New York to a server in London covers about 5,500 kilometers and takes roughly 28ms one way. A round trip is at least 56ms, and in practice is often 70-90ms due to routing inefficiencies. Geography imposes a hard floor on network latency that no software optimization can overcome.

Server Latency

Server latency (sometimes called processing latency or backend latency) is the time your server spends actually handling the request once it arrives. This includes:

Application logic. The time your code takes to process the request — running business rules, transforming data, rendering templates, or computing results. Inefficient algorithms, excessive computation, or blocking synchronous operations all increase server latency.

Database queries. For most web applications, database interactions dominate server latency. A poorly indexed query, a missing join optimization, or a query that scans millions of rows can add hundreds of milliseconds or even seconds to a single request. Multiple sequential database queries compound the problem.

External service calls. If your application calls third-party APIs, payment processors, email services, or microservices during request processing, the latency of those external calls is added to your server latency. A single slow dependency can bottleneck your entire response time.

Client Latency

Client latency is the time the user's browser or device spends processing the response after it arrives. This includes:

HTML parsing and DOM construction. The browser must parse the HTML document and construct the Document Object Model before it can display anything.

JavaScript execution. JavaScript that blocks rendering or runs synchronously on the main thread directly increases the time between receiving the response and the user seeing a usable interface. Heavy frameworks and unoptimized bundles are common culprits.

Rendering and layout. The browser must calculate styles, compute layout, paint pixels, and composite layers. Complex CSS, large DOM trees, and frequent layout recalculations (layout thrashing) all add client-side latency.

While client latency is less directly related to server performance, it contributes to the total latency the user experiences. And critically, client latency is often amplified by server latency — a slow server response means rendering begins later, pushing everything else later in the pipeline.

How to Measure Latency

You cannot improve what you do not measure. Several tools and techniques exist for measuring latency at different levels.

Time to First Byte (TTFB)

Time to First Byte measures the duration from the client sending a request to receiving the first byte of the server's response. It captures network latency plus server processing time and is one of the most widely used latency benchmarks. A good TTFB for web pages is under 200ms; under 100ms is excellent.

Round-Trip Time (RTT)

Round-trip time measures the time for a packet to travel from client to server and back. It captures pure network latency without any server processing. You can measure RTT with a simple ping command, though real-world RTT is often higher than ICMP ping due to TCP/TLS overhead.

Percentile Metrics

Raw averages are misleading for latency measurement because latency distributions are typically skewed — most requests are fast, but a long tail of slow requests pulls the average up. Percentile metrics give a far more accurate picture:

Percentile	Meaning	Typical Use
p50 (Median)	50% of requests are faster than this value	Represents the "typical" user experience
p95	95% of requests are faster	The experience for 1-in-20 users; often used in SLAs
p99	99% of requests are faster	Tail latency; reveals hidden performance problems

A system with a p50 of 80ms and a p99 of 3,000ms has a very different character than one with a p50 of 150ms and a p99 of 400ms, even though the averages might be similar. Always measure and report percentiles. For a deeper dive into interpreting these numbers, see our post on how to read a load test report.

Measurement Tools

Browser DevTools. The Network tab in Chrome or Firefox DevTools breaks down every request into DNS, connection, TLS, waiting (TTFB), and content download phases. This is the fastest way to diagnose latency for individual requests.

curl with timing. The curl command-line tool can output detailed timing with the -w flag:

bash

curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://example.com

Load testing tools. Tools like Locust and LoadForge measure latency across thousands of concurrent requests, giving you percentile distributions under realistic conditions. Single-request measurements tell you about latency in isolation; load testing tells you about latency at scale.

Latency Under Load

Here is the critical insight that separates performance-aware engineers from the rest: latency is not a fixed property of your system. It changes — often dramatically — under load.

When your application serves a single user, requests flow through your server with minimal contention. There is no queue. Database connections are immediately available. CPU cycles are abundant. Latency is at its theoretical minimum.

As concurrent users increase, requests begin competing for shared resources — CPU time, memory, database connections, network bandwidth, disk I/O. This competition introduces queueing delay. Even if each individual request takes the same amount of processing time, the time spent waiting in line before processing begins grows with load.

This relationship follows a pattern from queueing theory known as the hockey stick curve. As resource utilization approaches capacity, latency increases gradually at first, then rockets upward in a near-vertical spike. A server running at 50% CPU utilization might show negligible queueing delay. At 80%, latency begins to climb noticeably. At 95%, it can increase by an order of magnitude or more. The curve looks like a hockey stick lying on its side — flat for a long time, then suddenly vertical.

This is exactly why load testing matters. Your application's latency profile at idle tells you almost nothing about its latency under real traffic. A system that responds in 50ms to a single request might respond in 5,000ms when 500 users hit it simultaneously. The only way to discover this is to test under load. For a comprehensive introduction, see our guide on what is load testing.

Practical Strategies to Reduce Latency

Reducing latency requires a systematic approach that addresses each contributing layer. Here are the most effective strategies, ordered roughly by impact.

CDN and Edge Caching

A Content Delivery Network (CDN) caches your content on servers distributed around the world, serving requests from the location closest to the user. This directly reduces network latency by eliminating the physical distance between client and server. For static assets like images, CSS, and JavaScript, a CDN can reduce latency from hundreds of milliseconds to single digits. Modern CDNs can also cache dynamic content and run compute at the edge, further reducing round trips to your origin server.

Database Optimization

Since database queries dominate server latency in most applications, optimizing your database is often the highest-leverage improvement:

Indexing. Ensure queries hit indexes rather than performing full table scans. A missing index on a frequently queried column can turn a 2ms query into a 2-second query.
Query optimization. Use EXPLAIN to analyze query execution plans. Look for sequential scans, nested loops on large tables, and unnecessary joins.
Connection pooling. Establishing a new database connection can take 20-50ms. Connection poolers like PgBouncer (for PostgreSQL) maintain a pool of reusable connections, eliminating this overhead for each request.

Caching Layers

In-memory caches like Redis or Memcached store frequently accessed data in RAM, where lookups take microseconds instead of the milliseconds required for a database query. Caching hot data — user sessions, product catalog entries, computed aggregations — can reduce server latency by 10x or more for cache-hit requests. The key is choosing the right cache invalidation strategy so users see fresh data without sacrificing speed.

Keep-Alive Connections

Each new TCP connection requires a handshake (one RTT), and each new TLS connection requires negotiation (one or two more RTTs). HTTP keep-alive (persistent connections) reuses existing connections for multiple requests, eliminating this overhead for all but the first request. HTTP/2 goes further with multiplexing, allowing multiple requests and responses to share a single connection simultaneously. Ensure your server, load balancer, and CDN all support and enable keep-alive.

Code Optimization

Server-side code optimization directly reduces processing latency:

Async processing. Offload slow operations (email sending, image processing, report generation) to background workers instead of handling them synchronously within the request lifecycle.
Reduce blocking operations. Use asynchronous I/O for database queries, HTTP calls, and file operations so that waiting for one resource does not block others.
Algorithmic improvements. An O(n^2) operation that is imperceptible with 100 items becomes a bottleneck with 10,000. Profile your code and optimize hot paths.

Geographic Distribution

For applications serving a global audience, deploying your application in multiple regions is one of the most effective latency reduction strategies. A user in Tokyo connecting to a server in Virginia experiences at least 150ms of network latency each way. Deploying a server in Tokyo or Singapore reduces that to under 30ms. Multi-region deployment is more complex operationally, but for latency-sensitive applications, it can be the difference between a responsive and an unusable experience.

Measuring Latency with Load Testing

Understanding latency in isolation is useful, but the real insight comes from measuring latency under realistic load. This is where load testing becomes indispensable.

Here is a basic Locust test that measures how response times change as you ramp up concurrent users:

python

from locust import HttpUser, task, between
 
class LatencyTestUser(HttpUser):
    wait_time = between(1, 3)
 
    @task(3)
    def homepage(self):
        self.client.get("/", name="Homepage")
 
    @task(2)
    def api_endpoint(self):
        self.client.get("/api/products", name="Product API")
 
    @task(1)
    def search(self):
        self.client.get("/search?q=test", name="Search")

When you run this test with LoadForge, gradually increasing virtual users from 10 to 500 over a 15-minute period, the results reveal your application's latency curve. You will typically see response times hold steady during the early ramp, then begin to climb as you approach your server's capacity, and eventually spike sharply if you exceed it.

LoadForge captures this entire progression in real-time charts, showing response time percentiles (p50, p95, p99) alongside throughput and error rates. This gives you a precise picture of where your latency inflection point lives — the concurrency level at which performance begins to degrade. Armed with that data, you can optimize proactively rather than discovering your limits during a traffic spike in production.

The pattern to watch for is the hockey stick curve described earlier. If your p95 latency remains under 500ms at 200 concurrent users but jumps to 4,000ms at 300 users, you know your current infrastructure comfortably handles 200 users but needs optimization or scaling before it can handle 300. That is actionable intelligence that no amount of single-request testing can provide.

Conclusion

Latency is one of the most important metrics in application performance, and one of the most misunderstood. It is not a single number — it is a composite of network, server, and client delays that shifts dramatically under load. Measuring latency correctly means measuring percentiles under realistic traffic conditions, not just averages against an idle server.

The strategies for reducing latency are well-established: use CDNs to eliminate distance, optimize your database to reduce processing time, cache aggressively, reuse connections, and deploy closer to your users. But the most important step is to actually measure your latency under load, identify your inflection points, and validate that your optimizations work at the traffic levels you expect.

For a broader perspective on how latency fits into overall application performance, see our complete guide to load testing and our performance testing guide. Both cover the metrics, methodologies, and workflows that will help you build applications that stay fast under pressure.

What Is Latency? A Developer's Guide to Measuring and Reducing It