How to Implement API Rate Limiting — Token Bucket, Sliding Window, and the Edge Cases Nobody Warns You About

By banditz

Friday, April 17, 2026 • 6 min read

Diagram showing token bucket rate limiting algorithm with request flow

It happens to every API eventually. You launch without rate limiting because “we’ll add it later.” Three weeks later, someone discovers your API and hits it 50,000 times in a minute. Maybe it’s a bot scraping your data. Maybe it’s a customer’s broken retry loop. Maybe it’s a competitor stress-testing your infrastructure. The result is the same: your servers are drowning, your database is on fire, and legitimate users can’t get a response.

Rate limiting isn’t a nice-to-have. It’s the bouncer at the door that keeps your API from being abused into oblivion.

The Algorithms — Pick the Right One

There are four common rate limiting algorithms. Each has tradeoffs.

Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. Reject when the count exceeds the limit.

Window 12:00-12:01: 98/100 requests used

Window 12:01-12:02: 0/100 requests used

The problem: a burst at the window boundary. If a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, they’ve sent 200 requests in 2 seconds while technically staying within the per-minute limit. This can overload your server even though the rate limit is “working.”

Sliding Window Log

Stores the timestamp of every request. When a new request arrives, count timestamps within the last window duration and reject if over the limit. Most accurate, but stores every timestamp — memory grows linearly with traffic.

Sliding Window Counter

A hybrid. Uses fixed windows but weights the previous window’s count based on overlap. If 70% of the current window has elapsed, count 30% of the previous window’s requests plus 100% of the current window’s. Good accuracy without per-request storage.

Token Bucket

A bucket holds tokens, refilled at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts up to that capacity while enforcing the average rate over time.

For most APIs, token bucket or sliding window counter is the right choice. Token bucket if you want to allow controlled bursts (most APIs should). Sliding window counter if you want strict per-window enforcement.

Implementing Token Bucket with Redis

Redis is the standard backing store for rate limiting because it’s fast, atomic, and shared across application servers.

The token bucket needs two pieces of data per client: the number of remaining tokens and the last refill timestamp.

Here’s the logic in a Redis Lua script (atomic execution, no race conditions):

-- Token bucket rate limiter (Lua script for Redis)

local key = KEYS[1]

local max_tokens = tonumber(ARGV[1])      -- bucket capacity

local refill_rate = tonumber(ARGV[2])      -- tokens per second

local now = tonumber(ARGV[3])              -- current timestamp

local data = redis.call('HMGET', key, 'tokens', 'last_refill')

local tokens = tonumber(data[1]) or max_tokens

local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add since last refill

local elapsed = now - last_refill

local new_tokens = elapsed * refill_rate

tokens = math.min(max_tokens, tokens + new_tokens)

local allowed = 0

if tokens >= 1 then

    tokens = tokens - 1

    allowed = 1

end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)

redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

return { allowed, math.floor(tokens) }

This runs atomically in Redis. No race conditions even with 50 application servers hitting the same key simultaneously. The EXPIRE ensures cleanup of inactive clients.

Call it from your application:

const result = await redis.eval(luaScript, 1,

    `ratelimit:${clientId}`,

    100,           // max 100 tokens (burst capacity)

    1.67,          // refill at 1.67/sec = 100/min

    Date.now() / 1000

);

const [allowed, remaining] = result;

Setting Rate Limit Headers

Good APIs tell clients their rate limit status on every response. This lets well-behaved clients self-regulate instead of blindly hitting the limit.

Standard headers (following the IETF draft):

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 67

X-RateLimit-Reset: 1712345678

When the limit is exceeded, return 429 Too Many Requests with a Retry-After header:

HTTP/1.1 429 Too Many Requests

Retry-After: 30

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 0

X-RateLimit-Reset: 1712345678

{

    "error": "rate_limit_exceeded",

    "message": "Too many requests. Retry after 30 seconds.",

    "retry_after": 30

}

Include the Retry-After value in both the header and body. Some HTTP clients check headers, some parse the body.

Edge Cases That Break Naive Implementations

Client identification.

Most tutorials rate limit by IP address. This breaks immediately in production because:

  • Corporate NATs: thousands of users share one IP. Rate limiting by IP blocks an entire company.
  • VPNs and proxies: same problem.
  • Mobile carriers: CGNAT means millions of users share IP pools.

Rate limit by API key for authenticated endpoints. For unauthenticated endpoints (login, registration), rate limit by a combination of IP + other signals (user agent, fingerprint).

For login endpoints specifically, rate limit by both IP and target username. This prevents credential stuffing attacks where an attacker tries common passwords against thousands of usernames from a single IP.

Per-endpoint limits.

Not all endpoints are equal. Your /search endpoint might hit the database hard and need aggressive limiting. Your /health endpoint should probably never be limited. Your /login endpoint needs very tight limits to prevent brute force.

Apply different limits per endpoint or endpoint group:

/api/search     → 30 requests/minute

/api/users      → 100 requests/minute

/api/login      → 5 requests/minute

/api/health     → no limit

Distributed systems.

If you have multiple application servers behind a load balancer, each server needs to check the same rate limit counter. This is why Redis (or any shared store) is necessary. In-memory rate limiting on each server means a client can send N × server_count requests per window by hitting each server once.

Time synchronization.

If your application servers have slightly different clocks, timestamp-based algorithms can behave inconsistently. Use the Redis server’s time (redis.call('TIME')) instead of the application server’s clock.

Graceful Degradation

If Redis goes down, what happens to your rate limiter?

Bad answer: all requests are blocked because the rate limiter can’t check counts.

Good answer: fall back to a local in-memory rate limiter with generous limits, or allow requests through with logging and alerting.

async function checkRateLimit(clientId) {

    try {

        return await redisRateLimiter.check(clientId);

    } catch (error) {

        logger.warn('Redis rate limiter unavailable, falling back');

        return localRateLimiter.check(clientId);

    }

}

Rate limiting should protect your API, not become a single point of failure. Design it to fail open (allow requests) rather than fail closed (block everything), with monitoring to alert you when the fallback activates.

The Rate Limiting Checklist

  1. Pick the algorithm — token bucket for most APIs
  2. Use Redis — shared state across servers, atomic operations
  3. Identify clients properly — API key, not just IP
  4. Set per-endpoint limits — tight for auth, generous for reads
  5. Return proper headers — X-RateLimit-Limit, Remaining, Reset
  6. Return 429 with Retry-After — don’t just drop connections
  7. Handle Redis failure — fall back, don’t fail closed
  8. Monitor — track 429 rates, identify abusive patterns
  9. Document your limits — developers need to know them upfront

Rate limiting is one of those things that’s boring until you need it. And when you need it, you need it ten minutes ago. Build it before the bot finds your API. It’s a lot less stressful that way.


If you found this guide helpful, check out our other resources:

Step-by-Step Guide

1

Choose between rate limiting algorithms

Fixed window is simplest but allows burst at window boundaries. Sliding window log is most accurate but memory intensive. Sliding window counter balances accuracy and performance. Token bucket allows controlled bursts while enforcing average rate. For most APIs token bucket or sliding window counter is the right choice.

2

Implement with Redis for distributed systems

Use Redis INCR with EXPIRE for fixed window. For token bucket store remaining tokens and last refill timestamp. Redis atomic operations prevent race conditions across multiple app servers. Use Lua scripts for atomic multi-step operations.

3

Set proper rate limit headers

Return X-RateLimit-Limit X-RateLimit-Remaining and X-RateLimit-Reset headers on every response. Return 429 Too Many Requests with Retry-After header when limit is exceeded. These headers let clients self-regulate.

4

Handle edge cases

Identify clients by API key not just IP because multiple users share IPs behind NAT. Apply different limits per endpoint because login and search have different abuse profiles. Handle distributed systems where multiple app servers share the same Redis counter.

5

Add graceful degradation

If Redis is down do not block all requests. Fall back to in-memory rate limiting or allow requests through with logging. Rate limiting should protect your API not become a single point of failure.

Frequently Asked Questions

What is the difference between rate limiting and throttling?
Rate limiting rejects requests that exceed the limit with 429. Throttling slows down requests by adding delays. Rate limiting is more common for APIs because it gives clear feedback. Throttling is used when you want to degrade gracefully rather than reject.
Should I rate limit by IP or by API key?
By API key when possible because multiple users share IPs behind corporate NATs and VPNs. By IP as a fallback for unauthenticated endpoints. For login endpoints rate limit by both IP and username to prevent credential stuffing.
What rate limits should I set?
Depends on your API. General guidelines are 60-120 requests per minute for standard endpoints, 5-10 per minute for login, 1000 per minute for read-heavy APIs. Monitor actual usage patterns before tightening limits. Start generous and reduce based on data.
How do I handle rate limiting in microservices?
Use a shared Redis instance for counters. Each service checks the same key. For API gateways like Kong or Nginx rate limiting happens at the gateway before requests reach services. This centralizes the logic and reduces per-service complexity.
banditz

Research Bug bounty at javahack team

Freeland Reseacrh Bug Bounty

View all articles →