Microservices Re-Explained · 107 Rate Limiting: Token Buckets, Leaky Buckets, and Sliding Windows Are Not the Same Thing

What the three core algorithms actually do, where each one breaks under real traffic, and how you choose the right one before a burst attack forces the decision for you.

Apr 06, 2026

A single enterprise client had automated their integration tests to run against the production API. Not staging. Production. Every test run fired 4,000 requests in under 90 seconds. The third test run of the day took down the recommendation service for everyone else.

The on-call engineer pulled up the dashboard. CPU: fine. Memory: fine. Database connections: fine. But request queue depth on the recommendation service was at 11,000 and climbing.

“We have rate limiting,” someone said on the incident call.

They did have rate limiting. They had a fixed window counter — the simplest possible implementation. The client had discovered, entirely by accident, that 4,000 requests in the last 10 seconds of one window plus 4,000 in the first 10 seconds of the next added up to 8,000 requests in 20 seconds. The limit was 5,000 per minute. Technically, neither window had been exceeded.

They had rate limiting. What they didn’t have was a rate limiting algorithm suited to their traffic shape.

There’s a difference. Most teams learn it the hard way.

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

Why the Algorithm Actually Matters

Most engineers implement rate limiting once — usually by reaching for whatever their Redis library ships with — and never revisit it. The result is a system that technically enforces a limit but fails to protect against the specific traffic patterns that actually hurt production services.

Each algorithm enforces limits in a fundamentally different way. Each is well-suited to some traffic shapes and dangerously blind to others.

Token Bucket — Can I send now?

Allows controlled bursting up to a capacity. Tokens refill at a fixed rate. Idle clients accumulate credit.

Leaky Bucket — Can I queue this?

Enforces a strict output rate. Absorbs bursts but adds latency, not throughput. Output is always constant.

Sliding Window — How many recently?

Counts requests in a rolling time window. No burst allowance. Precise. Eliminates boundary exploits.

Token buckets let you burst. Leaky buckets smooth your output. Sliding windows enforce precision. These are not three implementations of the same idea. They protect against different threats — and they fail differently when those threats arrive.

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

1. What Rate Limiting Actually Is

Before the algorithms, get the goal right — because most teams confuse rate limiting with throttling, and the difference determines what you build.

Throttling means slowing a caller down — queuing requests, adding delay, degrading service quality gradually. Rate limiting means enforcing a hard ceiling and rejecting requests that exceed it, immediately, with a clear signal — HTTP 429, Retry-After header.

Rate limiting serves two purposes that are easy to conflate. Protection stops one client from degrading service for every other client — that’s the incident at the top of this post. Monetization enforces tier-based usage caps so a free-tier user can’t consume resources priced for enterprise. The algorithm you choose should match the problem you’re actually solving.

The classic framing: you are not limiting requests per se. You are protecting the downstream cost of those requests — CPU cycles, database queries, third-party API calls — from being imposed by one caller on everyone else.

2. Token Bucket: Controlled Bursting With a Refill Rate

What it actually is

Imagine a bucket that holds a maximum of N tokens. Tokens are added at a fixed rate — say, 100 per second. Every request costs one token. If the bucket has tokens, the request proceeds. If it’s empty, the request is rejected.

The critical property: if a client has been idle, tokens accumulate up to the bucket capacity. A burst of requests can consume those accumulated tokens in a short window without hitting the limit — as long as the burst doesn’t exceed capacity.

token-bucket — Redis implementation (Lua script)-- Parameters: key, max_tokens, refill_rate_per_sec, cost, now_ms
local tokens = tonumber(redis.call("GET", KEYS[1])) or ARGV[1]
local last   = tonumber(redis.call("GET", KEYS[1]..":ts")) or ARGV[4]
local delta  = (tonumber(ARGV[4]) - last) / 1000

tokens = math.min(tonumber(ARGV[1]),
          tokens + delta * tonumber(ARGV[2]))

if tokens >= tonumber(ARGV[3]) then
  tokens = tokens - tonumber(ARGV[3])
  redis.call("SET", KEYS[1], tokens, "PX", 60000)
  redis.call("SET", KEYS[1]..":ts", ARGV[4], "PX", 60000)
  return 1  -- allowed
else
  return 0  -- rejected
end

What token bucket is good for

APIs where legitimate clients batch naturally. A client processing a nightly report fires 200 requests in 10 seconds, then nothing for an hour. Token bucket handles this gracefully — the bucket refills during the idle period and the burst is absorbed without a 429.

Client-side rate limiting. SDK libraries use token buckets to self-limit outbound calls. The bucket absorbs queue spikes internally before they hit the network.

Rewarding well-behaved clients. A client that stays under limit accumulates capacity. Token bucket is the only algorithm that provides this positive-feedback property.

Where token bucket breaks down

Two clients can exploit independent buckets simultaneously. A bucket with capacity 500 allows a full burst right now. Two well-timed clients can each have a full bucket and fire at once — your service sees 1,000 requests simultaneously with neither technically exceeding their individual limit.

Capacity tuning is non-obvious. Set max tokens too high and burst protection is meaningless. Set it too low and legitimate batch clients get rejected unnecessarily.

Clock skew matters. The refill calculation depends on accurate timestamps. In distributed systems with multiple rate-limiter nodes, refill arithmetic can diverge subtly without shared clock sync.

Token bucket is the most permissive algorithm. It protects against sustained overload but not against coordinated burst attacks at window resets. If your threat is the former, it’s excellent. If it’s the latter, look elsewhere.

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

3. Leaky Bucket: Strict Output Rate, Regardless of Input

What it actually is

Imagine a bucket with a small hole at the bottom. Water pours in at whatever rate the client sends. Water drains at a fixed, constant rate. If the bucket overflows, the excess is discarded.

In software: incoming requests are queued. A processor drains the queue at a fixed rate — say, 50 requests per second. If the queue fills beyond its depth limit, new requests are rejected. The output rate is constant regardless of input spikes.

leaky-bucket — queue-backed implementation (Spring Boot)@Component
public class LeakyBucketLimiter {

    private final Queue<Runnable> queue =
        new LinkedBlockingQueue<>(200); // max depth

    public LeakyBucketLimiter() {
        var scheduler = Executors.newSingleThreadScheduledExecutor();
        // Drain: 50 req/sec → 1 req every 20ms
        scheduler.scheduleAtFixedRate(
            this::drain, 0, 20, TimeUnit.MILLISECONDS);
    }

    public boolean submit(Runnable task) {
        return queue.offer(task); // false = full = reject
    }

    private void drain() {
        Runnable task = queue.poll();
        if (task != null) task.run();
    }
}

What leaky bucket is good for

Protecting downstream services with a hard throughput ceiling. A payment gateway that handles exactly 30 transactions per second needs a leaky bucket upstream. The output rate is guaranteed constant. No burst can exceed what the downstream can absorb.

Preventing thundering herd on downstream APIs. After a cache miss storm, a token bucket might allow thousands of requests to hit the origin simultaneously. A leaky bucket queues them and drains steadily.

Where leaky bucket breaks down

Latency, not just throughput. A 200-deep queue draining at 50 req/sec adds up to 4 seconds of latency for requests at the back. That 429 from a token bucket was kinder.

No burst allowance for legitimate clients. A valid client firing 200 requests in 2 seconds hits the queue depth limit just as fast as a malicious one. Leaky bucket cannot distinguish intent — it only sees rate.

Queue state is hard to observe. Unlike a counter or token count, queue depth is runtime state that’s harder to export as a metric and alert on.

Leaky bucket is the most protective algorithm for downstream services — but it trades burst tolerance for queue latency. It’s the right choice when your constraint is “the downstream can handle exactly X req/sec, no more, ever.”

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

4. Sliding Window: Precision Without Burst Allowance

What it actually is

A fixed window counter divides time into discrete epochs, counts requests per epoch, and rejects if over the limit. The problem: a client can straddle the boundary and send 2N requests across two windows without triggering either limit individually. That is exactly the incident in the introduction.

A sliding window fixes it. Instead of counting in a fixed epoch, it counts requests that occurred within the last N seconds from now — a window that moves forward with every request. The count is always relative to the current moment, not a shared clock tick.

sliding-window-counter — Redis (approximate, production-grade)local now       = tonumber(ARGV[1])   -- current Unix ms
local window_ms = tonumber(ARGV[2])   -- e.g. 60000 (1 min)
local limit     = tonumber(ARGV[3])

local curr_key = KEYS[1]..":"..(math.floor(now / window_ms))
local prev_key = KEYS[1]..":"..(math.floor(now / window_ms) - 1)

local curr = tonumber(redis.call("GET", curr_key)) or 0
local prev = tonumber(redis.call("GET", prev_key)) or 0

local elapsed_ratio = (now % window_ms) / window_ms
local weighted = prev * (1 - elapsed_ratio) + curr

if weighted + 1 <= limit then
  redis.call("INCR", curr_key)
  redis.call("EXPIRE", curr_key, math.ceil(window_ms / 1000) * 2)
  return 1  -- allowed
else
  return 0  -- rejected
end

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

What sliding window is good for

Eliminating the boundary exploit. The count of requests in the last 60 seconds is always accurate, regardless of when the client times their burst relative to your clock. The incident in this post’s introduction doesn’t happen with a sliding window.

Precise monetization enforcement. Billing-period limits — 10,000 API calls per month, 500 per day — require accurate counting. Sliding window gives you that precision without the memory overhead of storing every individual timestamp.

Per-user, per-endpoint granularity. Because the sliding window is cheap per-key in Redis, you can run it independently per user, per IP, per API key, and per endpoint simultaneously without significant overhead.

Where sliding window breaks down

No burst allowance by design. A user who sends 100 requests in the first second of a 100-req/min limit is immediately at the ceiling for the next 59 seconds. This surprises legitimate clients who expect burst tolerance from a token-bucket style limiter.

Distributed coordination is required. Local in-memory sliding windows per pod will undercount if traffic is distributed across instances. You need shared Redis state to make the count accurate.

Sliding window is the most precise algorithm. It gives you an accurate view of request rate at any moment, without window-boundary exploits. But precision comes at the cost of burst tolerance — and you need shared state to make it accurate across a distributed deployment.

5. How They Actually Work Together

In practice, most production systems don’t pick one algorithm and apply it everywhere. They layer all three — each serving a different constraint at a different level of the stack.

Token bucket at the ingress gateway for burst protection per API key. Legitimate batch clients can burst without getting rejected, while no single client can send an unbounded spike that overwhelms the system.

Sliding window per user per endpoint for quota enforcement. Accurate counting against billing-period limits. Runs simultaneously with the token bucket using different Redis key namespaces — the two don’t conflict.

Leaky bucket on outbound calls to third-party APIs with hard per-second limits. Your payment service calling Stripe at exactly 30 req/sec, never more, even under a traffic spike from your own users.

The failure mode to avoid: applying a single algorithm globally, at a single granularity, for all purposes. The incident in this post is a fixed-window counter doing the job a sliding window should have been doing. A sliding window per client per minute would have counted 8,000 requests in 60 seconds — over the 5,000 limit — regardless of where they fell relative to the epoch boundary. The 4,001st request in the rolling window would have returned a 429. The recommendation service would have been fine.

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

6. Building a Rate Limiter That Tells the Truth

A rate limiter is only as useful as its response. HTTP 429 is the right status code — but the response must include Retry-After, and ideally X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset. Without these headers, your clients cannot implement correct backoff. Well-meaning clients will hammer your retry logic instead — making the overload worse, not better.

rate-limit response — Spring Boot filter@Component
public class RateLimitFilter implements OncePerRequestFilter {

  @Override
  protected void doFilterInternal(
      HttpServletRequest req,
      HttpServletResponse res,
      FilterChain chain) throws ServletException, IOException {

    RateLimitResult result = limiter.check(req.getHeader("X-API-Key"));

    res.setHeader("X-RateLimit-Limit",     String.valueOf(result.limit()));
    res.setHeader("X-RateLimit-Remaining", String.valueOf(result.remaining()));
    res.setHeader("X-RateLimit-Reset",     String.valueOf(result.resetEpoch()));

    if (!result.allowed()) {
      res.setHeader("Retry-After", String.valueOf(result.retryAfterSeconds()));
      res.setStatus(429);
      return;
    }
    chain.doFilter(req, res);
  }
}

Rate limiting must also be observable. Emit a counter on every 429, tagged by client tier, endpoint, and algorithm. Without this metric you have no idea whether your limits are tuned correctly, whether a client is being incorrectly rejected, or whether a new attack pattern has started.

observability — Micrometer counterCounter.builder("ratelimit.decisions")
    .tag("outcome",     result.allowed() ? "allowed" : "rejected")
    .tag("algorithm",   "sliding_window")
    .tag("client_tier", clientTier)
    .tag("endpoint",    normalizedPath)
    .register(meterRegistry)
    .increment();

// Alert: rejected rate > 50/min for any single client_tier
// → investigate before assuming it's an attack

7. Common Mistakes

Implementing rate limiting in-process, per pod

If each of your twelve application instances runs its own in-memory limiter, a client can multiply their effective limit by twelve by round-robin-ing across instances. Rate limiters must share state. Redis with a Lua script for atomicity is the standard solution. Don’t build this at the pod level.

Returning 429 without Retry-After

A 429 without Retry-After is an instruction to retry immediately — which is exactly what well-meaning clients will do, making your overload worse. Every 429 must tell the client when it can try again. This is part of the protocol, not a nice-to-have.

Applying a single global limit instead of layered limits

An endpoint that triggers a database aggregation query costs ten times more than one that reads from cache. Apply limits per-endpoint in addition to per-client. A single limit per API key doesn’t protect against endpoint-specific abuse.

Using fixed windows without understanding the boundary problem

Fixed window is fine for coarse limits where precision doesn’t matter. It is not fine when clients — intentionally or not — can straddle the epoch boundary to send twice the intended volume in a short window. The incident at the start of this post is exactly this failure.

Not testing your rate limiter under load

Race conditions in counter increments, clock drift in window calculations, Redis timeouts during the check itself — rate limiters are surprisingly easy to get wrong at high concurrency. A rate limiter that fails open under pressure is not a rate limiter.

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

8. Interview Prep: Rate Limiting Questions

8.1 — Conceptual Questions

In your own words, what is the difference between rate limiting and throttling? When would you choose one over the other?
Explain the token bucket algorithm. What happens when a client has been idle for 60 seconds and then fires 500 requests simultaneously?
What is the fixed window boundary problem? Give a concrete example of how a client could exploit it to send 2× the intended limit.
Why is a distributed, shared-state rate limiter required in a horizontally scaled microservice deployment? What goes wrong with per-pod local limiters?
What HTTP headers should a rate-limited response include, and why does each one matter to the client?
What is the difference between head-based and tail-based decisions in a rate limiter?

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

8.2 — Practical / Scenario Questions

A single enterprise client is flooding your recommendation service with bursts of 5,000 requests every 90 seconds. Walk me through which algorithm you’d choose, how you’d implement it in Redis, and what response you’d return to the client.
Design a rate limiting system for a public REST API with three tiers: free (100 req/min), pro (1,000 req/min), enterprise (10,000 req/min). What algorithm, what granularity, and what Redis data structure would you use?
Your payment service calls a third-party gateway that enforces 30 transactions per second. How would you implement a rate limiter on your outbound calls to ensure you never exceed this — even under a traffic spike from your own users?
A client reports they’re receiving 429s even though they believe they’re under the stated limit. How do you investigate? What metrics, logs, and Redis keys would you inspect?
You need to apply rate limits per user, per endpoint, and per IP simultaneously. How do you structure your Redis keys to support this without conflicts?

8.3 — Trade-Off Questions

When would you choose leaky bucket over token bucket? What does your downstream architecture need to look like for leaky bucket to be the clearly correct choice?
The approximate sliding window counter has a small error bound. When is this acceptable, and when do you need the exact sliding log despite its higher memory cost?
Your team is debating rate limiting at the API gateway vs at the individual microservice level. Walk me through the trade-offs.
A Redis outage takes down your rate limiter. Do you fail open or fail closed? What factors drive this decision, and how do you implement either safely?
How does your rate limiting strategy change when a significant portion of traffic is machine-to-machine vs human-driven?

8.4 — SRE / Production Questions

What metrics would you emit from your rate limiter, and what alert thresholds would you set? How do you avoid false positive pages when a legitimate client hits a temporary spike?
How do you tune rate limit values for a new endpoint before you have production traffic data? What’s your process for revisiting them after launch?
Design a runbook for an on-call engineer to diagnose a rate limit misconfiguration — specifically the case where legitimate clients are being rejected unexpectedly.
How would your rate limiting architecture change as you scale from 500 req/sec to 50,000 req/sec? What does the Redis topology look like at each stage?
A DDoS-style attack generates 200,000 requests per second from thousands of IP addresses. Your per-IP token bucket is working correctly — but the volume alone is overwhelming your rate limiter infrastructure. What layers of defense do you add above the application-level rate limiter?

The Modern Backend Interview Prep Kit

Java · Spring Boot · Microservices · DevOps · SRE · System Design

Download now →

The Modern Backend

Discussion about this post

Ready for more?