Design a Rate Limiter
In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period.
This article covers:
- What is Rate Limiter
- Where to put the rate limiter
- Algorithms for rate limiting
- Rate limiting High-level architecture
What is Rate Limiter?
In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked.
Where to put the rate limiter?
you can implement a rate limiter at either the client or server-side.
• Client-side implementation. Generally speaking, client is an unreliable place to enforce rate limiting because client requests can easily be forged by malicious actors. Moreover,we might not have control over the client implementation.
• Server-side implementation.

Besides the client and server-side implementations, there is an alternative way. Instead of putting a rate limiter at the API servers, we create a rate limiter middle-ware.

Algorithms for rate limiting
• Token bucket
• Leaking bucket
• Fixed window counter
• Sliding window log
• Sliding window counter
Below is a brief explanation of each.
Token bucket
Both Amazon and Stripe use this algorithm to throttle their API requests.
The token bucket algorithm work as follows:
• A token bucket is a container that has predefined capacity. Tokens are put in the bucket at preset rates periodically. Once the bucket is full, no more tokens are added.
•Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket If there are enough tokens, we take one token out for each request, and the request goes through. If there are not enough tokens, the request is dropped.

Pros:
• The algorithm is easy to implement.
• Memory efficient.
• Token bucket allows a burst of traffic for short periods. A request can go through as long as there are tokens left.
Cons:
• Two parameters in the algorithm are bucket size and token refill rate. However, it might be challenging to tune them properly.
Leaking bucket algorithm
The leaking bucket algorithm is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:
• When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue.
• Otherwise, the request is dropped.
• Requests are pulled from the queue and processed at regular intervals.

Pros:
• Memory efficient given the limited queue size.
• Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed.
Cons:
• A burst of traffic fills up the queue with old requests, and if they are not processed in time, recent requests will be rate limited.
• There are two parameters in the algorithm. It might not be easy to tune them properly.
Fixed window counter algorithm
algorithm works as follows:
• The algorithm divides the timeline into fix-sized time windows and assign a counter for each window.
• Each request increments the counter by one.
• Once the counter reaches the predefined threshold, new requests are dropped until a new time window starts.
In Figure the time unit is 1 second and the system allows a maximum of 3 requests per second. In each second window, if more than 3 requests are received, extra requests are dropped as shown in Figure.

Pros:
• Memory efficient.
• Easy to understand.
• Resetting available quota at the end of a unit time window .
Cons:
• Spike in traffic at the edges of a window could cause more requests than the allowed quota to go through.
Sliding window counter algorithm
The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log.
Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling window is calculated using the following formula:
• Requests in current window + requests in the previous window * overlap percentage of the rolling window and previous window.
•Using this formula, we get 3 + 5 * 0.7% = 6.5 request. Depending on the use case, the number can either be rounded up or down. In our example, it is rounded down to 6.
Pros
• It smooths out spikes in traffic because the rate is based on the average rate of the previous window.
• Memory efficient.
Cons
• It only works for not-so-strict look back window.
High-level architecture
The basic idea of rate limiting algorithms is simple. At the high-level, we need a counter to keep track of how many requests are sent from the same user, IP address, etc. If the counter is larger than the limit, the request is disallowed.
Where shall we store counters? Using the database is not a good idea due to slowness of disk access. In-memory cache is chosen because it is fast and supports time-based expiration strategy. For instance, Redis is a popular option to implement rate limiting. It is an in-memory store that offers two commands: INCR and EXPIRE.
• INCR: It increases the stored counter by 1.
• EXPIRE: It sets a timeout for the counter. If the timeout expires, the counter is automatically deleted.

• The client sends a request to rate limiting middle-ware.
• Rate limiting middle-ware fetches the counter from the corresponding bucket in Redis and checks if the limit is reached or not.
• If the limit is reached, the request is rejected.
• If the limit is not reached, the request is sent to API servers. Meanwhile, the system increments the counter and saves it back to Redis.
for more detail refer to :
System Design Interview: An Insider’s Guide by Alex Xu second edition
image credit:
System Design Interview: An Insider’s Guide by Alex Xu second edition