Uber System Design Demysified
Rajesh Srinivasan
CGI Partner | Vice President Consulting Delivery | Emerging Tech Learner, Global Delivery Leader
This article is capturing of pointers and insights for the video about Uber System design.
DISCO - The spine of the system
This area of the system helps in the location and map related services which aids in each and every operation between the riders and driver (supply vs demand) but for the payment.
Google S2 library is used to take the map and divide them as 1 KM * 1 KM cell and store the same easily. Each cell is given a id and hence based on location, we can easily locate the cell using the id. This also aids by enabling the drawing of circle of radius around your location and identify the list of drivers (supply) available for rider (demand) location. This enables for the system to locate the cabs available.
As the system finds the cabs available within the nearby radius of rider location, it needs to calculate the ETA by calculating shortest distance. But unfortunately this is not straight forward, but we need to understand the road / path instead of plain geometry line. So the system needs to send notifications to drivers of cabs found and then depending on the acknowledgement from driver, it can show the availability and enable the allocation for rider booking.
How does system track cab location
Every cab will send the location through Web application gateway / firewall , load balancer to the Kafka REST Api which gets consumed to different places as required, database and DISCO to keep the latest cab location. We have different servers in a ring which are assigned responsibilities of different hashed cell index and hence the data is stored accordingly. These servers are in architecture called Ring Pop and they are equally distributed with loads when new servers are added or servers are taken down.
Web Sockets
Web Sockets have been kept for ease of communication between the clients and the demand/supply system management (DISCO) or any component in the server. This is written in Node Js as this is good in asynchronous messaging, small messaging and event driven enabled.
How this dance is choreographed.
This contains list of servers that are expected to manage the entire service of mapping demand to supply.The system uses consistent hashing and makes RPC calls from one server to another server. SWIM protocol for each server responsibility to be understood by each other.
- The rider makes a request via websocket and this is landed to the demand service. This request will contain the type of cab and service requests.
- Demand Service now passes this to the Supply service type of cab and nature of drive requested and using Google S2 library passes the cell id of the rider.
- Supply service based on hashing the cell index, finds the server that will have the data related to cabs in this range of cell index. (ie) If user makes request from cell index 5, it finds the server which holds the data for this cell index and makes the call. If there are multiple indexes obtained, supply service talks to one server which functions as master and makes calls to respective all other servers and manages the communication between the servers in the ring via RPC calls.
- The server now draws a circle to find all the cells from where cabs can be figured out. Then based on cabs found, it uses Map Service to find ETA and responds to Supply Service.
- The supply service now sends the request back to the drivers and depending on notification / acknowledgement from driver, allocate the same to the rider.
Geo Spatial Design - How this is achieved.
- Google S2 libraries - This service is used to identify the location for cab and rider
- Building Maps - Earlier Uber used map-box but now Google Map Framework has been used and also Google Maps API to calculate ETA by collaboration with Google while it was doing on its own earlier.
- Preferred Access points. - For large communities and also university / campus the system cannot come to exact location of rider. So based on the trend of bookings, it learns specific access points which are being used by riders and learns to recommend the same back to the rider.
ETA - Managing the customer experience
When rider requests a cab, the demand service based on rider location requests supply service which finds cab nearby the rider which are free and calculates ETA for different cabs. This does not work in ideal scenario as there might be cabs in transit which will complete the trip shortly and might reach the rider faster. So Uber has several costs for turns, U-turns and ensure serving cabs and not only free cabs to serve rider / calculate ETA.
Where it stores
Moved from RDBMS to NoSQL DB. This is to ensure horizontal scaling, manage different regions on which they are establishing the services. As each cab sends request every 4 seconds, this will be write heavy application and read heavy application to manage rider request. So in order to enable intense operations with nearly zero downtime irrespective of adding nodes, taking backup or system goes down. They build data center in nearest place to the region they establish to ensure faster response.
Leveraging Analytics
Core purpose is to improve customer satisfaction better by making sense of data it has about customer needs, behaviors of customer. This will also help to drive down costs by enabling to skim through the tons of data flowing in which are stored in RDBMS / No SQL / HDFS systems to store large volumes of data.
Hadoop Hive - has lot of different tools to build the analytics.
Component Maps - Consume the historical data with real time data to back track and build new maps or improve the data. This also enables system to track the congestion, traffic, cab speed etc to calculate ETA.
Machine Learning / Fraud Detection
Payment frauds with stolen credit cards , incentive abuse by cab drivers using simulation of trips using fake GPS apps. This component uses historical data with the current trips to identify the kind of frauds on the abuse by cab drivers. Hackers uses compromise accounts to withdraw money in wallet and hence the system uses the historical trend of riders historical position to identify.
Maintain the stability of system
All systems publish logs to separate Kafka cluster which can be analyzed and dashboards to enable the monitoring and control of large volume of system in this eco system.
Handling Data Center Failures
While the data center location is one of the key strategy for Uber faster response to riders, it is essential that it manages the data center failures and also manage switch over to backup data centers. While general strategy is to copy data always form primary to secondary, Uber uses alternative way.
The driver app when it makes call to send location to the server, it also sends state digest which is unique encrypted indicator which indicates level of knowledge available with driver app. When app makes a call to DC and it identifies that primary DC is not available, it will call the Backup DC. So Backup DC does not have trips in transit, so it uses the state digest from driver app, to retrieve required information from driver app to finish the trips in transit. So this switch happens quiet transparently to rider / driver.
For additional information you can refer the architecture information available in engineering.uber.com or watch the above video.
Lead Data Engineer | Streaming and BigData Infrastructure | Realtime and Batch Data Analytics
3 年Thank you for sharing. I have a question regarding the last step of DISCO [The supply service now sends the request back to the drivers and depending on notification / acknowledgement from driver, allocate the same to the rider] if the system is distributed, where we have n instances of WebSocket server, how SIDCO will know WebSocket server to which the Driver is connected? bests
? Lead Engineer, Intelligent Automation | Ex Yahoo
6 年Awesome, thanks for sharing. Comprehensive!!