System Design for URL Shortener
Rahul Arram
Data Engineering @Tesco | 6+ YOE | Polyglot Engineer | Ex- Wayfair, Turtlemint, Manhattan Associates, Hopscotch | Spark, Kafka, Airflow, Python, Superset
Pre-check :-
- Long Url - 2kb (2048 chars)
- Short Url - 17 bytes( 17 chars)
- Created-at timestamp - 7 Bytes
- Expired - 7 Bytes
Total:- 2.03 kb
If we assume there are 30M Users/month then atmost 60.7 GB/month ==> 0.7 TB/month ==> 3.6 TB/5year data will be needed.
Note- Url's will be created for a short duration.
The main idea is to create a 7 character id. e.g., www.shortto.com/ABC12qb
Approach 1:
Using Base 62 Algorithm:
Why Base(62)? -- 62^7--3.5 trillion && 10^7-- 10 Million
And also we can use a variety of character sets mentioned below.
- A-Z. - 26
- a-z. - 26
- 0-9. - 10
Worried about base 62 conversion code?
Look at this:
Looking at which DB to use:-
RDBMS-
- ACID
NoSQL-
- Eventual
- Highly Available
- Scalable
- Consistency
If we have n App Servers A1, A2, A3 respectively. The long Url's goes as an input to these App servers and random numbers of length 7 are being generated then finally these random values are converted to Base(62) code and attached to servers in key-value pair fashion.
All these data are being stored in DB and short-URL's will be generated.
What if multiple long URL's generated the same short URL --- >worst case.
Solution- Trigger the DB action if absent unlike NoSQL doesn't have this option.
For High scalability, we need to find an even better approach.
Approach 2:
Md5 Hashing Algorithm:
Md5 algorithm generates a hash code of 17 characters. In this case, we will pick the first 7 chars only. What if both hash codes are the same. Then there is a chance of collision
Approach 3-
Counter Based Approach:
Maintaining a counter server and increment when a particular app server is working on.
What is the counter server goes down though we can add multiple counters for every one million? what if it reaches the maximum limit.
We need to maintain some admin server to keep track of all these operations so finally, it will increase the cost.
**This also finally gives us a Damn shitt.
Final Approach:
Zookeeper service:-
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Features-
- Distributed Configuration Management
- Self Election / consensus building
- Coordination and locks
- Key-value store.
- Each Node is called a zNode.
- Each zNode has a path.
- zNode can be persistent and empheremal
- Each zNode can store data
- Each zNode can be watched for changes
The user requests from the client-side will be triggering the database via Load Balancer and respective services.
Zookeeper maintains the mapping of long URLs for every 1 million range since there is a provision of 3.5 trillion of free data.
As soon as servers are added to zookeper it allocates the range to them. There will be 10 billion ordered buckets.
We can also maintain a cache for faster retrieval of data. The final key-value pairs are stored in the DB.
We can use the two functions included in the services.
- CreateLongUrl()
- GetshortUrl()
Note-Redis can be used for caching
If eventually, some server goes down only that particular range will get affected. Since zookeeper monitors active and failed key-value map pairs it will remove the range which is not needed.
Irrespective of the DB used there will be no issue of collision and gives us eventual consistency.
Work for my mind
2 年try this and send me your feedback pls https://shorturls.link/ ??