System Design and Development of a URL Shortener Service (1 of 2)
Article Goals
The intent of this article is to serve a holistic approach to systems design by using a URL shortening service as a case study. The intent is to relay an overview of baseline considerations, an accurate solution, and a template for future articles.
This article, the first installment of a two-part series, aims to explore the system design of a URL shortener service. The follow-up article will shift towards the practical aspects of development and testing.
Table of Contents
Introduction
Designing a URL Shortener is a great intro to systems design as a whole since this is a common tool on the web for truncating URLs. A URL Shortening service offers valuable insight through data analytics and affiliate marketing for businesses, and customers can use this service to post cumbersome links on social media to enhance brand visibility. Over time this underlying system will need to accommodate increasing demand for instant access, and the sheer volume of data being processed necessitate both efficiency and scalability.
Initial Considerations for a URL Shortener Service
To create our URL Shortener service we will hold key-value pairs for O(1) read and write operations of the data such that the short url is the key, and the long url will be its respective value. The choice to make the short url as the key is for the case where the client service allows for a quick reverse lookup of the long URL for a given short URL. Should a URL be re-entered in the system we may treat the request as non-idempotent by generating a new unique URL.
//pseudo-code
DB = {[shortURL]: [longURL]}
Functional Requirements
This service can be created as a typical client and server application, however there are considerations for why we may need to opt for a distributed system as follows:
Non-Functional/Capacity Requirements
Calculation for the Short URL generation via the encoder function
There are an estimated 2-3 Billion short URLs created per year:
(Number of requests made per year) / (Number of seconds in a year) = Number of queries per second (qps)
(2.5 x 10^9) / (365 days x 24 hours x 60 mins x 60 seconds) ~= 75 qps
Calculation for the Long URL retrieval via the decoder function
There are estimated 20 Billion clicks to obtain the redirect URL from this service per month:
(20 x 10^9) / (30 days x 24 hours x 60 mins x 60 seconds) ~= 7700 qps
QPS CALCULATIONS SUMMARY
Short URL generation: ~75 qps for Write
Long URL retrieval: ~7700 qps for Read
Initial System Diagram, Redundancy Considerations, and Caching Strategy
Accessing our data via disk is significantly slower than accessing it in-memory since RAM provides quick data access. RAM is also volatile so we may lose our data if the system is turned off, so keeping a record of the database on disk provides better fault tolerance should the system go offline due to a power outage. Disk also has a larger capacity than RAM per cost in terms of data storage.
For our architecture we will use a hybrid solution to store our database both on disk and in-memory due to their aforementioned tradeoffs. We require a few services for our initial architecture: a compute instance, a cache to allow for high speed data retrieval, and a database to serve as a long term, non-volatile and fairly low cost storage solution.
For our initial architecture we are also employing a cache-aside (lazy-loading) strategy over a cache-through approach. This is due to the preference in the design for availability over consistency since the data in such a service is not paramount.
System Tradeoffs
No distributed system is safe from network failures, so we need to build system tolerance. Moving forward we will create the system with two replicas per tier as a baseline for redundancy. We will also assume an arbitrary 3-year overall data retention policy for further system optimizations pertaining to data load and throughput. Afterwards the service will need to be reevaluated and built around the newfound system analytics.
The designed system is read heavy, as projected earlier in the long URL retrieval QPS calculation, so an opt for a hot-warm architecture will optimize the system around availability. Therefore the overall system is formulated around availability and partition tolerance of the CAP Theorem.
To go for a hot-warm architecture the client of the service can read and write to data handled in the lead partition, and a read quorum will also be placed on the secondary partition, the replica. The tradeoff for the added redundancy would be potentially stale data in the secondary partition. And finally the third partition, a replica of the secondary partition, will serve as a failover backup. This pattern allows the client the continue to read data should the primary partition be downed, with the tertiary partition providing an additional layer of durability to recover data.
Calculation of Database Size
In order to calculate data size we may refer back to our QPS for creating the short URLs, which was about 75 qps, and use this alongside the projected life of our system (three years stated earlier) to help determine the overall expected data:
Total Number of Queries will be Number of seconds in three years x W:
Using the Total Number of Queries of 10 billion we may calculate the Total Data Size, which would would be the Total Number of Queries x Size of Data.
The Size of Data needs to save record of key-value pairs. Since the short and long URLs are text-based, an assignment of an arbitrarily long length of 2048 bytes or 2kB is sufficient expectation to keep record of a single key-value pair.
Total Data Size = (10^10 queries) x 2kB ~= (10^10 queries) x 2kB
Total Data Size = (2 x 10^10) kB
Total Data Size = 20 TB
Designing the length of the Short URL
In the previous section, Calculation of Database Size, a key detail of write queries of the short URL generation is that it correlates to the number of the key-value pairs to keep record of in our database. Since the unique keys in the system are short URLs there is an approximate 10 billion queries or 10 billion key-value pairs to keep record. We can use this information to identify the length of the short URLs because the hashing needs to generate sufficient unique URLs.
Initial Considerations for Short URL Generation
Regarding the generation of short URLs we may consider an existing cryptographic hashing function such as SHA-1, SHA-2, SHA-3, MD5 on our long URL. However, these existing functions are deterministic, so they will not meet our requirements and therefore we are better off to consider creating our own encode and decode methods to have full control of the system functionality.
Faulty Approach: Store an index as the short URL by maintaining a counter
We need to come up with a solution that is both short and unpredictable. As an initial approach we can generate our shortened URLs by storing an index as the short URL via an updated counter. Although this implementation may generate unique values for the short URL, it is a predictable solution and can also theoretically be reverse engineered. The generated data would also not be so short as the index will eventually be very long value by nature.
Sophisticated Approach: Hash the URL and update the counter to avoid data collisions
To make our hashing algorithm unique it makes sense to generate a random value to enable our encode functionality. However, there is no way to generate a truly random value in computer programming, so we will recycle the previous approach of handling an index via a count. This count is special because it will serve as a Primary Key: preventing data collision due to it being unique and never null. Alternatively, we may opt to use a unique non-repeating value such as system timestamp or epoch time.
//psuedo-code
shortURL = encode(longURL, primaryKey)
Ultimately we will convert a number n to a base x such that a unique and short URL is generated for a given long URL. So suppose the short URLs will be a binary of 0's and 1's, this would mean that we would need to generate a short URL where each character would be a base-2 representation of characters. This may suffice however we can certainly do better by simply adding a higher base, so let's do that by upping the base to base-16. This would mean that we can use 0-9 and the characters A-F as our bits to represent the short URL.
We can take it even further by upping the bits further to base 62 by using 0-9, a-z, A-Z. We can add two more character such as hypen "-" and underscore "_" to make the bits to be of base-64 because the math tends to be cleaner if we go by the powers of 2.
Calculate the length of the short URLs
There are 10 billion key-value pairs of data, so at the very least we need 10^10 data stores. Since there are 64-bit characters, can simply calculate the minimum length of the short URL needed to handle the data storage.
领英推荐
Let L be the length of the short URL, let's calculate the value of L:
64^L >= 10^10
log(64^L) >= 10
L >= 6
Rounded to the nearest whole number we get L >= 6, meaning we can generate 64^6 key value pairs.
Let's verify how many unique short URLs we can generate with values that are more human readable:
64^6 = (2^6)^6 = 2^36 = (2^30) x (2^6) = (2^10)^3 x 2^6 ~= (1 Billion) x 64 ~= 64 Billion unique short URLs may be generated, meeting and exceeding our expected requirement of 10 billion keys.
Encode and Decode Methods
Heres a language-agnostic solution for the short URL generation using an encode function based on what we've gathered in previous sections. The comments in the following code are present to guide learners. See Appendix for Python, TypeScript, and Go specific code.
counter = 0 # Counter needed to guarantee uniqueness
base_url = "https://url.shorten/"
# Simulated database using a global dictionary (consider replacing with a real database in development)
db = {}
def encoder(long_url):
global counter
counter += 1
digits = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_"
base = len(digits) # Base-64
digit_list = []
curr_val = counter
while curr_val >= base:
quotient, remainder = divmod(curr_val, base)
digit_list.append(digits[remainder])
curr_val = quotient
# Append the most significant bit
digit_list.append(digits[curr_val])
# Reverse the list and join to form the short URL path
short_url = ''.join(reversed(digit_list))
# Store the mapping in the dictionary
db[short_url] = long_url
# Return the complete short URL
return base_url + short_url
And the decode will be a simple lookup in the database:
def decode(short_url):
key = short_url.replace(base_url, "")
return db.get(key, "URL not found")
Scaling for Data Size
Before delving into the optimization strategies for data size within our URL shortener service, it's essential to establish a few heuristics related to caching and database management. These heuristics will guide our approach to efficiently managing and accessing the vast amounts of data our service will handle.
Heuristics for Caching and Data Access:
Database Partitioning
Referring back to the calculation of the data size present in our system over its lifecycle, its projected that there will be 20 TB worth of data in the system. Assuming that a typical commercial commodity machine has roughly 1-2 TB of disk space, we would need 10 partitions of 2TB disk space each. Following our initial system design, each machine will have 2 replicas, necessitating a total of 30 partitions for database storage.
Cache Partitioning
Opting to store 20% of the database in our cache, as specified in our heuristic approach, denotes that 4TB of data, or approximately 2 billion key-value pairs, will be present in our cache at all times. Given that a standard commodity machine comes with about 128 GB of RAM, we can calculate the need for roughly 30 partitions to accommodate the 4TB by dividing it by 128GB. As per our design, each partition will have two replicas, requiring 90 total partitions for the cache.
Final Systems Architecture
In the context of our distributed URL shortener service, the architecture's resilience and efficiency are paramount. This necessitates an effective routing policy at the application tier, typically managed by an API Gateway or a Load Balancer. These components play a pivotal role in distributing incoming traffic to various service instances, ensuring that no single point becomes a bottleneck or a point of failure.
Load Balancer and Redundancy
To mitigate risks associated with a single point of failure, it's crucial to implement redundancy for the Load Balancer. This can be achieved by deploying multiple Load Balancers and using a Domain Name System (DNS) strategy to distribute traffic among them. Such a setup not only enhances system reliability but also enables seamless handling of load spikes and failover scenarios.
DNS Strategy
A sophisticated DNS strategy can significantly contribute to system resilience and load distribution. By directing traffic across multiple Load Balancers, the DNS serves as an additional layer of load balancing. This approach also facilitates geographical load distribution, directing users to the nearest or most available service instance, thus minimizing latency and improving the overall user experience.
Considerations
The final architecture of our URL shortener service demands a robust, scalable, and resilient setup, with a special focus on effective traffic distribution and failover management. By leveraging a combination of API Gateways, Load Balancers, and a strategic DNS configuration, we can achieve a highly available and efficient system capable of handling the demands of a distributed environment.
Building a Practical Demo: A Step Toward Realization- Personal Takeaway
While the comprehensive system architecture outlined represents an ideal, large-scale deployment, the reality of development, especially for individuals or smaller teams, often necessitates starting on a more manageable scale. To bridge this gap between theory and practice, I aim to create a simplified yet functional demo of our URL shortener service. This demo will serve as a tangible example of applying the architectural principles discussed, scaled down to a level that's feasible for solo developers or small teams.
Why a Demo?
Demo Overview
For the upcoming demo, we'll employ a tech stack that's both robust and accessible:
Building the Demo
The development process will involve:
Future Steps
This demo will not only demonstrate our system architecture but also lay the groundwork for scaling up. As my development and data engineering skills evolve, I plan to incrementally incorporate more elements of the full system architecture, tackling challenges such as distributed caching, database sharding, and advanced load balancing techniques.
Stay tuned for a follow-up article where I'll share the development journey, challenges encountered, code snippets, and insights gained from building this demo. Together, we'll take a step closer to turning the theoretical architecture of our URL shortener service into a tangible, functioning reality.
References
Appendix: Code Implementations
This appendix provides code samples for the URL shortener service's encode function in Go, TypeScript, and Python. Additionally, a simple decode function is provided for each language, demonstrating how to retrieve the original URL from the shortened version.
Go Implementation
Encoding
package main
import (
"sync"
)
var (
counter int
base64Digits = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_"
base = len(base64Digits)
db = make(map[string]string)
lock = sync.Mutex{}
baseURL = "https://url.shorten/"
)
func encode(longURL string) string {
lock.Lock()
defer lock.Unlock()
counter++
var digitList []byte
currVal := counter
for currVal > 0 {
remainder := currVal % base
digitList = append([]byte{base64Digits[remainder]}, digitList...)
currVal /= base
}
shortURL := string(digitList)
db[shortURL] = longURL
return baseURL + shortURL
}
Decoding
func decode(shortURL string) string {
lock.Lock()
defer lock.Unlock()
shortURL = shortURL[len(baseURL):] // Remove the base URL part
return db[shortURL]
}
TypeScript Implementation
Encoding
let counter = 0;
const base64Digits = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_";
const base = base64Digits.length;
const db = new Map<string, string>();
const baseURL = "https://url.shorten/";
function encode(longURL: string): string {
counter++;
let digitList: string[] = [];
let currVal = counter;
while (currVal > 0) {
let remainder = currVal % base;
digitList.unshift(base64Digits[remainder]); // Prepend to make reversing unnecessary
currVal = Math.floor(currVal / base);
}
const shortURL = digitList.join('');
db.set(shortURL, longURL);
return baseURL + shortURL;
}
Decoding
function decode(shortURL: string): string {
const key = shortURL.replace(baseURL, "");
return db.get(key) || "URL not found";
}
Python Implementation
Encoding
counter = 0
base64_digits = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_"
base = len(base64_digits)
db = {}
base_url = "https://url.shorten/"
def encode(long_url):
global counter
counter += 1
digit_list = []
curr_val = counter
while curr_val > 0:
remainder = curr_val % base
digit_list.insert(0, base64_digits[remainder]) # Prepend to make reversing unnecessary
curr_val //= base
short_url = ''.join(digit_list)
db[short_url] = long_url
return base_url + short_url
Decoding
def decode(short_url):
key = short_url.replace(base_url, "")
return db.get(key, "URL not found")
This appendix provides a concise, clear presentation of the encode and decode functionalities across three different programming languages, intended for readers to easily adapt or extend based on their specific requirements or to explore further optimizations.
Software Engineer | Backend & DevOps Enthusiast | Passionate about Scalable Systems, Cloud Infrastructure, and Mission-Driven Technology Solutions
3 个月Hello. I came across your post when I noticed that LinkedIn automatically shortens URL. This is a really nice read. Thanks for sharing ??