WEB API's Lesson 23 – Rate Limiting | Dataplexa
Web APIs · Lesson 23

Rate Limiting

Master API traffic control by implementing request limits that protect your system from overload while ensuring fair access.

Twitter's API allows 300 tweets per 15-minute window. GitHub limits code searches to 30 requests per minute. Stripe caps API calls at 100 per second in test mode. Every major API enforces boundaries on how fast and how often clients can make requests.

Without these guardrails, a single misbehaving client could overwhelm your servers with thousands of requests per second. A bot scraping your endpoints could consume bandwidth meant for legitimate users. An infinite loop in someone's code could drain your database connections.

Rate limiting acts as a traffic control system for your API. It measures incoming requests against predefined rules and blocks or delays requests that exceed those limits.

The technique protects your infrastructure while maintaining predictable performance for well-behaved clients. When implemented correctly, most users never notice the limits exist.

The Cost of No Limits
DDoS attacks routinely send millions of requests per minute to overwhelm targets. But even legitimate traffic spikes can crash an unprotected API. Black Friday sales, viral social media posts, and bot traffic can generate request volumes that exceed server capacity in minutes.

How Rate Limiting Works

Rate limiting operates on a simple principle: count requests from each client and reject new requests when limits are exceeded.
1
Request arrives with client identifier
2
System checks request count for this client
3
Compare against configured rate limit rules
4
Allow request or return 429 Too Many Requests

The system needs three pieces of information: who is making the request, what limits apply to them, and how many requests they have made recently. Client identification typically uses IP addresses, API keys, or user authentication tokens.

Request counting requires temporary storage that tracks usage over specific time windows. Redis and Memcached excel at this because they offer fast read/write operations and automatic key expiration.

When limits are exceeded, the API returns HTTP status code 429 along with headers indicating when the client can retry. Well-designed clients respect these signals and implement backoff strategies.

Concept
Traffic Control
Used for
RFC 6585
Production Ready

Rate limiting combines request counting, time window management, and conditional request blocking to prevent API abuse and ensure service availability under load.

Common Rate Limiting Algorithms

Different algorithms handle request timing and limit enforcement in distinct ways, each with specific advantages for different use cases.
Algorithm How It Works Best For
Fixed Window Counts requests in fixed time periods (e.g., per minute) Simple quotas, billing periods
Sliding Window Tracks requests over a rolling time window Smooth traffic distribution
Token Bucket Refills tokens at steady rate, requests consume tokens Allowing bursts with sustained limits
Leaky Bucket Processes requests at fixed rate, queues overflow Consistent processing rate

Fixed window counting divides time into discrete periods. A limit of 100 requests per minute means clients can make 100 requests between 14:00:00 and 14:00:59, then another 100 between 14:01:00 and 14:01:59. The approach is simple but allows traffic spikes at window boundaries.

Sliding windows track requests over rolling time periods. Instead of fixed minute boundaries, the system counts requests made in the last 60 seconds from any given moment. This smooths traffic patterns but requires more complex counting logic.

Token bucket systems maintain a bucket of tokens that refill at a steady rate. Each request consumes one token. When tokens run out, requests are rejected. This allows short bursts while maintaining average rate limits over time.

Burst Traffic

Token bucket allows 50 requests instantly if bucket is full, then enforces 10 per minute refill rate.

Steady Traffic

Fixed window works perfectly for predictable loads: 1000 requests per hour, reset at top of each hour.

Implementing Rate Limiting

The APIForge Security team needs to protect their new webhook delivery system from abuse while ensuring legitimate integrations work smoothly.

They decide on tiered limits: 60 requests per minute for free accounts, 600 for paid plans, and 6000 for enterprise customers. The system uses API keys to identify clients and Redis to track request counts.

// APIForge rate limiting middleware
const redis = require('redis');
const client = redis.createClient();

async function rateLimitMiddleware(req, res, next) {
  const apiKey = req.headers['x-api-key'];
  const clientTier = await getClientTier(apiKey);
  
  const limits = {
    free: 60,
    paid: 600,
    enterprise: 6000
  };
  
  const windowSize = 60; // seconds
  const limit = limits[clientTier];
  const key = `rate_limit:${apiKey}:${Math.floor(Date.now() / 1000 / windowSize)}`;
  
  const current = await client.incr(key);
  await client.expire(key, windowSize);
  
  if (current > limit) {
    return res.status(429).json({
      error: 'Rate limit exceeded',
      limit: limit,
      reset: (Math.floor(Date.now() / 1000 / windowSize) + 1) * windowSize
    });
  }
  
  res.set({
    'X-RateLimit-Limit': limit,
    'X-RateLimit-Remaining': limit - current,
    'X-RateLimit-Reset': (Math.floor(Date.now() / 1000 / windowSize) + 1) * windowSize
  });
  
  next();
}
HTTP/1.1 200 OK X-RateLimit-Limit: 600 X-RateLimit-Remaining: 587 X-RateLimit-Reset: 1698234060 Content-Type: application/json { "webhook_id": "wh_1NvzQ2Lkdj", "status": "queued", "delivery_time": "2023-10-25T14:01:23Z" } -- After limit exceeded -- HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 600 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1698234120 Retry-After: 45 { "error": "Rate limit exceeded", "limit": 600, "reset": 1698234120 }

What just happened?

The middleware extracts the API key, looks up the client's subscription tier, then increments a Redis counter for the current time window. If the count exceeds the limit, it returns 429 with reset timing.

Response headers tell clients their current limit, remaining quota, and when limits reset. The Retry-After header suggests how long to wait before trying again.

Try this: Implement exponential backoff in your API clients that doubles wait time after each 429 response until requests succeed again.

Rate Limiting Strategies

Effective rate limiting requires thoughtful decisions about what to limit, how to identify clients, and how to communicate limits.

Client identification determines who shares limits. IP-based limiting is simple but problematic for users behind corporate NATs or mobile networks where hundreds of people share the same external IP. API key limiting provides better granularity but requires authentication.

User-based limiting offers the most precise control, allowing different limits for different user tiers. However, it only works for authenticated endpoints and requires user session management.

Some APIs combine approaches: strict IP limits for unauthenticated requests, generous authenticated limits, and premium limits for paid users. This encourages authentication while preventing anonymous abuse.

Strategy What It Does APIForge Use Case
Tiered Limits Different limits based on subscription level Free users get 100 API calls/day, paid get 10,000
Endpoint-Specific Different limits for different API operations Search: 30/min, Create: 100/hour, Read: 1000/hour
Burst Allowance Short-term higher limits for traffic spikes Normal 10/sec, allow 100/sec for 10 seconds max
Geographic Regional limits based on server capacity US region: 1000/min, EU region: 500/min

Expensive operations deserve stricter limits than cheap ones. Database writes, complex searches, and file uploads consume more resources than simple reads. Shopify limits product creation to 2 requests per second but allows 40 product reads per second.

Time-based limits can reflect business needs. A stock trading API might allow 1000 requests during market hours but only 100 after close. Social media APIs often have daily posting limits but no limits on reading content.

Header Standards
Most APIs follow similar header conventions: X-RateLimit-Limit shows the limit, X-RateLimit-Remaining shows remaining quota, X-RateLimit-Reset indicates when limits reset. GitHub, Twitter, and Stripe all use these patterns.

Advanced Rate Limiting

Production systems require sophisticated limiting that handles edge cases, provides fair access, and scales with traffic.

Distributed rate limiting becomes necessary when APIs run across multiple servers. Redis clustering or dedicated rate limiting services like Kong or Ambassador provide shared counters that work across server instances.

Fair queuing prevents a single client from monopolizing resources. Instead of rejecting excess requests immediately, the system queues them and processes at a controlled rate. This smooths traffic while maintaining throughput.

Adaptive limits adjust automatically based on system load. When CPU usage or response times increase, limits tighten to preserve stability. When resources are abundant, limits can relax to allow higher throughput.

# APIForge adaptive rate limiting
import time
import psutil
from redis import Redis

class AdaptiveRateLimiter:
    def __init__(self):
        self.redis = Redis()
        self.base_limit = 1000
        self.current_limit = self.base_limit
        
    def check_system_load(self):
        cpu_usage = psutil.cpu_percent(interval=1)
        memory_usage = psutil.virtual_memory().percent
        
        if cpu_usage > 80 or memory_usage > 85:
            # Reduce limits under high load
            self.current_limit = int(self.base_limit * 0.5)
        elif cpu_usage < 30 and memory_usage < 50:
            # Increase limits when resources available
            self.current_limit = int(self.base_limit * 1.5)
        else:
            self.current_limit = self.base_limit
            
        return self.current_limit
    
    def is_allowed(self, client_id):
        current_limit = self.check_system_load()
        window = int(time.time() // 60)  # 1-minute windows
        key = f"adaptive:{client_id}:{window}"
        
        current_count = self.redis.incr(key)
        self.redis.expire(key, 60)
        
        return current_count <= current_limit
System Load Check Results: CPU: 45%, Memory: 62% Current Rate Limit: 1000 req/min (normal load) Client Request: api_key_abc123 Window: 1698234060 Current Count: 847 Status: ALLOWED (847/1000) -- Under high load -- CPU: 85%, Memory: 78% Current Rate Limit: 500 req/min (reduced) Client Request: api_key_xyz789 Window: 1698234120 Current Count: 456 Status: REJECTED (456/500 exceeded)

What just happened?

The adaptive limiter monitors CPU and memory usage, then adjusts rate limits in real-time. Under normal load, it allows the full 1000 requests per minute. When resources are constrained, it automatically reduces limits to preserve system stability.

This approach prevents cascading failures during traffic spikes and optimizes throughput when resources are available.

Try this: Add alert thresholds that notify ops teams when adaptive limiting activates, indicating sustained high load conditions.

Circuit breaker patterns complement rate limiting by cutting off traffic to failing services. When error rates exceed thresholds, the circuit opens and rejects all requests immediately rather than overwhelming struggling services.

Priority queues can exempt critical traffic from normal rate limits. Health checks, authentication requests, and administrative operations might bypass user-imposed limits to ensure core functionality remains available.

Storage Considerations
Rate limiting data accumulates quickly. A busy API serving 1000 clients generates millions of Redis keys daily. Implement cleanup jobs, use appropriate TTLs, and monitor storage usage to prevent Redis memory exhaustion.

Client-Side Rate Limiting

Smart clients implement their own rate limiting to avoid hitting server limits and provide better user experiences.

Client-side limiting prevents applications from sending requests faster than servers can handle them. Instead of getting 429 responses, clients queue requests internally and send them at appropriate intervals.

Exponential backoff algorithms help clients recover from rate limit violations gracefully. When a request returns 429, the client waits before retrying. Each subsequent failure doubles the wait time until requests succeed again.

Request batching reduces the total number of API calls by combining multiple operations into single requests. Instead of creating 100 users with 100 API calls, batch creation might accomplish the same work with 5 calls.

// APIForge client SDK with built-in rate limiting
class APIForgeClient {
    constructor(apiKey, rateLimit = 10) {
        this.apiKey = apiKey;
        this.requestQueue = [];
        this.rateLimit = rateLimit; // requests per second
        this.lastRequest = 0;
        this.retryDelay = 1000; // start with 1 second
    }
    
    async makeRequest(url, options) {
        return new Promise((resolve, reject) => {
            this.requestQueue.push({ url, options, resolve, reject });
            this.processQueue();
        });
    }
    
    async processQueue() {
        if (this.requestQueue.length === 0) return;
        
        const now = Date.now();
        const timeSinceLastRequest = now - this.lastRequest;
        const minInterval = 1000 / this.rateLimit;
        
        if (timeSinceLastRequest < minInterval) {
            setTimeout(() => this.processQueue(), minInterval - timeSinceLastRequest);
            return;
        }
        
        const { url, options, resolve, reject } = this.requestQueue.shift();
        this.lastRequest = Date.now();
        
        try {
            const response = await fetch(url, {
                ...options,
                headers: {
                    'X-API-Key': this.apiKey,
                    'Content-Type': 'application/json',
                    ...options.headers
                }
            });
            
            if (response.status === 429) {
                // Server rate limit hit, implement backoff
                const retryAfter = response.headers.get('retry-after') * 1000 || this.retryDelay;
                setTimeout(() => {
                    this.requestQueue.unshift({ url, options, resolve, reject });
                    this.processQueue();
                }, retryAfter);
                this.retryDelay = Math.min(this.retryDelay * 2, 30000);
                return;
            }
            
            this.retryDelay = 1000; // reset on success
            resolve(response);
            
        } catch (error) {
            reject(error);
        }
        
        // Process next request
        if (this.requestQueue.length > 0) {
            setTimeout(() => this.processQueue(), minInterval);
        }
    }
}
APIForge Client Request Log: [14:01:00] Request 1: POST /webhooks → 200 OK (processed immediately) [14:01:10] Request 2: GET /webhooks/list → 200 OK (100ms delay) [14:01:20] Request 3: PUT /webhooks/wh_123 → 200 OK (100ms delay) [14:01:30] Requests 4-15: Queued (rate limit: 10/sec) [14:01:31] Request 4: DELETE /webhooks/wh_456 → 429 Too Many Requests [14:01:31] Retry-After: 30 seconds, backing off... [14:02:01] Request 4: DELETE /webhooks/wh_456 → 200 OK (retry succeeded) [14:02:01] Processing remaining queue (11 requests) Rate Limiter Status: - Client Limit: 10 req/sec - Queue Size: 0 - Current Backoff: 1000ms (reset after success) - Last Request: 14:02:01.847

What just happened?

The client SDK queues requests and processes them at a controlled rate to avoid overwhelming the server. When it receives a 429 response, it implements exponential backoff by doubling the retry delay until requests succeed.

This approach provides a smooth experience for application developers while respecting server limits and handling failures gracefully.

Try this: Add request priority levels so critical operations can jump ahead in the queue while background tasks wait longer.

Caching frequently accessed data reduces API calls significantly. Instead of fetching user profiles on every page load, cache them locally and refresh periodically. This pattern works especially well for reference data that changes infrequently.

Request deduplication prevents sending identical requests multiple times. If a user clicks a button rapidly, the client should recognize duplicate requests and only send one to the server.

Modern browsers provide the Intersection Observer API for implementing "infinite scroll" patterns efficiently. Instead of making API calls on every scroll event, observe when users approach the bottom of content and batch load more data. This reduces API usage while maintaining responsive interfaces.

Rate limiting protects APIs from overload while ensuring fair resource access. The technique combines request counting, time windows, and conditional blocking to manage traffic flow. Whether you are building APIs or consuming them, understanding rate limiting helps create more reliable and efficient systems.

Effective rate limiting requires balancing protection with usability, choosing appropriate algorithms for your traffic patterns, and providing clear feedback to clients about limits and violations.

Quiz

1. APIForge needs to allow mobile apps to sync data in bursts when users open the app, but maintain steady limits over time. Which algorithm works best?

2. When APIForge's webhook API hits rate limits, what should the server response include to help clients retry successfully?

3. APIForge wants their rate limiting to adapt automatically when their servers experience high load. What implementation approach works best?

Up Next
API Security Best Practices
The APIForge Security team implements comprehensive protection strategies covering authentication, authorization, input validation, and threat detection.