Architecture of the School Bus Tracking System
Momen Negm
Chief Technology Officer @ T-Vencubator | Data Scientist, Generative AI | Tech entrepreneur - Engineering leader
In this article, we will explore the process of building a vehicle tracking system, with a specific focus on school buses. The design of this system will facilitate the monitoring of school buses across a wide range of educational institutions. Furthermore, we will enhance our system with a notification feature.
This feature will notify parents when a bus is approaching their child's home or when the child is being dropped off. Throughout this article, I will discuss various technical aspects related to the development and implementation of this system.
Comprehending the Functional Requirements
Our goal is to create a system that monitors the locations of school buses, catering to multiple schools, each with numerous students. The following are the specific requirements that need to be met for each school within the system:
Non-functional requirements
The business objectives establish a foundation for identifying several key non-functional requirements for the school bus tracking system:
Back-of-the-Envelope Estimation
To gain a clearer understanding of the scale and demands of our proposed school bus tracking system, let's perform some preliminary calculations based on the anticipated usage:
Frequency of Location Update Requests (RPS)
Fleet Size: With a projected 4,000 schools and an estimated 10 buses per school, our system would need to track a total of 40,000 buses.
Update Rate: To maintain accurate real-time tracking, each bus would transmit its location data every 2 seconds.
Calculation of RPS:
This results in a significant influx of data, requiring substantial backend resources to manage the load efficiently.
How Parents Use the App
Although the number of data reads per second is high, modern databases are designed to handle this volume and should not be a significant bottleneck.
Proposed High-Level System Design
The high-level design of our school bus tracking system centers around several key components. Here’s an overview of each:
API Design
We will implement a RESTful API to create a simple yet robust backend capable of managing bus location updates and views.
GET /v1/buses/locations
Response Body
{
"buses": [
{
"bus_id": "e3d54bd9-bbbc-4d8e-a653-a937c7c4a215",
"driver": {
"id": "a23e80d5-679d-4041-96bd-20d3954099ff",
"name": "bus"
},
"children": [
{
"id": "0d95d921-97ba-48a9-8f5d-73b54c4318f5",
"name": "children"
}
],
"location": {
"latitude": 29.933751883038916,
"longitude": 31.197616132406225
}
}
]
}
WebSocket /v1/buses/{bus_id}/location
Request Parameters:
{
"latitude": ...,
"longitude": ...,
"timestamp": ...
}
Algorithms To Find Student Houses
We aim to ensure that our database contains a comprehensive list of student house locations. These locations should be organized such that we can notify parents approximately two minutes before the bus arrives at each location.
There are two types of geospatial indexing approaches, which are shown in the following figure:
Evenly Divided Grid
A simple method is to split the globe into a network of small, uniformly sized squares. Each square could contain several houses, and every house on the map would be associated with a particular square.
However, the evenly spaced grid method may not be ideal for our business needs. The main issue is that house distribution varies greatly across different areas. We need a more flexible way to control the size of the grids to ensure timely notifications.
Geohash
presents a superior alternative to the evenly divided grid. It functions by converting two-dimensional latitude and longitude data into a one-dimensional string of characters. This is achieved by recursively splitting the world into smaller and smaller grids with each additional bit. Let's briefly outline how Geohash works.
The first step involves dividing the globe into four quadrants using the prime meridian and equator.
Second, divide each grid into four smaller grids. Each grid can be represented by alternating between longitude bit and latitude bit.
Repeat this subdivision until the grid size is within the precision desired. Geohash usually uses base32 representation. Let’s take a look at two examples.
Geohash of the Google headquarters (length = 6)
1001 10110 01001 10000 11011 11010 (base32 in binary) → 9q9hvu (base32)
Geohash of the Facebook headquarters (length = 6):
1001 10110 01001 10001 10000 10111 (base32 in binary) → 9q9jhr (base32)
Geohash has 12 precisions (also called levels) as shown in Table. The precision factor determines the size of the grid. We are only interested in GeoHashes with lengths between 6 and 7. This is because when it’s longer than 8, the grid size is too small, while if it is smaller than 7, the grid size is too large and our notification approach will be mostly un efficient in this case
| Geohash Length | Grid Width x Height |
|----------------|----------------------------|
| 1 | 5,009.4 km x 4,992.6 km |
| 2 | 1,252.3 km x 624.1 km |
| 3 | 156.5 km x 156 km |
| 4 | 39.1 km x 19.5 km |
| 5 | 4.9 km x 4.9 km |
| 6 | 1.2 km x 609.4 m |
| 7 | 152.9 m x 152.4 m |
| 8 | 38.2 m x 19 m |
| 9 | 4.8 m x 4.8 m |
| 10 | 1.2 m x 59.5 cm |
| 11 | 14.9 cm x 14.9 cm |
| 12 | 3.7 cm x 1.9 cm |
How do we choose the right precision? We want to find the minimal Geohash length that covers the whole circle drawn by the user-defined radius. The corresponding relationship between the radius and the length of Geohash is shown in the table below.
| Radius (Kilometers) | Geohash Length |
| - - - - - - - - - - - -| - - - - - - - - |
| 2 km (1.24 mile) | 5 |
| 1 km (0.62 mile) | 5 |
| 0.5 km (0.31 mile) | 6 |
| 0.086 km (0.05 mile) | 7 |
| 0.015 km (0.009 mile) | 8 |
for more information about how GeoHash works, you can use this site .
const geohash = require('ngeohash');
// Example coordinates for a location
const latitude = 40.6892; // Latitude for Statue of Liberty
const longitude = -74.0445; // Longitude for Statue of Liberty
// Generate the GeoHash
const geohashCode = geohash.encode(latitude, longitude, 7);
console.log("GeoHash code:", geohashCode);
This approach is highly effective for our current setup, although there are minor issues related to boundary definitions. However, these will not impact our existing framework.
Quadtree
A quadtree is a data structure that is commonly used to partition a two-dimensional space by recursively subdividing it into four quadrants (grids) until the contents of the grids meet certain criteria. For example, the criterion can be to keep subdividing until the number of businesses in the grid is not more than 100. This number is arbitrary as the actual number can be determined by business needs (houses in each grid).
With a quadtree, we build an in-memory tree structure to answer queries. Note that quadtree is an in-memory data structure and it is not a database solution. It runs on each server, and the data structure is built at server start-up time.
The following figure visualizes the conceptual process of subdividing the world into a quadtree. Let’s assume the world contains 200m (million) houses (Just a random number to demonstrate the usage).
The root node represents the whole world map. The root node is recursively broken down into 4 quadrants until no nodes are left with more than 100 businesses.
How to get nearby businesses with Quadtree?
Real-world quadtree example
Yext provided an image that shows a constructed quadtree near Denver. We want smaller, more granular grids for dense areas and larger grids for sparse areas.
if we did not take into mind the operational considerations for Quadtree and that adding new houses will take a lot of time to be updated on the map, also we do not have control over the size of the grids we are dividing in the map as they are divided by the houses length instead of the size which can be another problem to using Quadtree.
Google S2
Google S2 geometry library is another big player in this field. Similar to Quadtree, it is an in-memory solution. It maps a sphere to a 1D index based on the Hilbert curve (a space-filling curve) . The Hilbert curve has a very important property: two points that are close to each other on the Hilbert curve are close in 1D space. Search on 1D space is much more efficient than on 2D. Interested readers can play with an online tool for the Hilbert curve .
S2 is a complicated library But because it’s widely used in companies such as Google, Tinder, etc.
Recommendation
As a concluding recommendation, it is advisable to weigh the merits of using GeoHash versus Google S2, as both offer the functionality suited for our application.
However, it is important to note that while one is a database solution, the other operates as an in-memory solution. For the purposes of simplicity and the scope of this article, we will proceed with the GeoHash solution.
Bus Location Update
Each bus transmits a location update approximately every two seconds. Given the write-heavy nature of location updates, a direct write to the database could be feasible but may not be optimal. An alternative approach involves writing the most recent locations to a Redis Cache while maintaining a historical log of bus locations in the database. This method enhances the scalability of our write operations.
Additionally, it’s important to consider how the Parents app receives these updates. Instead of querying the database repeatedly to refresh the bus’s location, we can leverage Redis Pub/Sub to provide real-time notifications.
Understanding Redis Pub/Sub:
Redis Pub/Sub functions as an efficient, lightweight message bus. Channels, also known as topics, are cost-effective and easy to create. A modern Redis server, equipped with gigabytes of memory, can support millions of channels.
In our design, location updates received via a WebSocket server are published to a specific bus’s channel on the Redis Pub/Sub server. New updates are then instantly republished to all subscribers of that bus, ensuring real-time location updates on the map.
Location Update Delivery
Considering the high frequency of updates from each bus, using a RESTful API to receive these updates would be inefficient. A more effective solution is to use WebSockets, which maintain a persistent connection and allow for continuous data transmission without necessitating a full TCP handshake every few seconds.
Data model
In this section, we discuss the read/write ratio and the schema design for our application.
Read/write ratio
The system exhibits a high write volume, as numerous buses emit updates every two seconds. Consequently, the number of writes substantially exceeds the queries from parents tracking their students’ locations. Given this write-heavy nature, a NoSQL database is preferable. For the purposes of this blog, we will utilize DynamoDB.
DynamoDB as a Database
DynamoDB is a scalable and highly available database designed to manage large volumes of reads and writes. It requires thoughtful schema design but offers robust features that are beneficial for our use case.
We propose the creation of three primary tables:
1. Parent Table 2. Students Table 3. Bus Location History Table
In DynamoDB, a table consists of a Primary Key (PK) and an optional Sort Key (SK). A unique combination of PK and SK (or just PK if SK is not used) is essential to prevent duplicate records. Querying is typically constrained to these keys; for queries using other attributes, a Local Secondary Index (LSI) or Global Secondary Index (GSI) may be employed.
Given our prior discussions, GeoHash will be employed for location queries. It is crucial to define our queries precisely before schema creation to ensure the database is optimized for our operational requirements.
Parent Table
This table normalizes student_ids to minimize query volume when retrieving the locations of parents' children. The schema allows the caching of Bus IDs associated with each parent to enhance performance.
Student Table
Includes a GeoHash field to determine the geographic block to which a student belongs, also the Latitude and Longitude Fields represent where the house is located and are used to calculate the estimated distance to arrive. In addition, it’s imperative to establish a Global Secondary Index (GSI) featuring geohash as the partition key (PK) and the bus_id as the Sort Key (SK). This GSI serves to facilitate queries targeting specific geographical areas for a specific bus and it is used to notify parents about Bus arrival as we will see in the section “Storing Bus Location History” Section.
Bus Location History Table
Separates latitude and longitude attributes to accurately display the bus’s location on the map, catering to the needs of parents.
While it is possible to consolidate all data into a single table — a common practice in DynamoDB for simplification — this blog will illustrate using separate tables to clarify the concepts for readers.
System Design (Bus Flow)
Upon reviewing each component of our system, we are now ready to develop the initial diagram of our application. Here are key points to consider while designing our system architecture:
Bus Location Report
The preceding architecture effectively addressed our initial objectives. Now, let’s shift our focus to the third and fourth points outlined in our design: managing the current location and facilitating parent notifications.
Bus Location Update & Notify Users
When pushing data to the queue, it’s essential to include both the newly received location and the latest location stored in the Redis Cache. Additionally, maintaining an in-memory HashMap can help minimize read calls to Redis, enhancing efficiency.
Thus far, we have successfully managed to receive location updates from the bus and store both the current location and live updates in the Redis Cache. These updates are then seamlessly published to the parent app for all users to access.
Storing Bus Location History
Notification Flow Design
This system design ensures real-time update dissemination and effective data management for location-based notifications. Please review the end-to-end architecture diagram carefully to fully grasp the interactions between each component.
This architecture might initially seem complex, so please take your time to thoroughly review and understand each component of the system.
System Design (Parent App Flow)
The architecture of the Parent App is simpler compared to the Bus Flow and can be illustrated in a single diagram.
Upon opening the app, it performs three main actions:
Server-Side Logic For Handling Connection
It’s crucial to understand the behavior of ECS when a new connection is established, as this forms the basis of the Parent App architecture.
Upon a user’s connection to the WebSocket Gateway, it triggers an ECS and invokes a designated function. This function retrieves the list of student IDs from DynamoDB and allows the current ECS instance to listen to their channels.
Below is a sample code snippet demonstrating this process:
const redis = require('redis');
// Create a Redis client
const subscriber = redis.createClient({
url: 'redis://localhost:6379'
});
subscriber.on('error', (err) => {
console.log('Redis Client Error', err);
});
subscriber.connect();
// Function to subscribe to a new channel
function subscribeToChannel(channelName) {
subscriber.subscribe(channelName, (message, channel) => {
console.log(`Received message: ${message} from channel: ${channel}`);
});
console.log(`Subscribed to ${channelName}`);
}
// Initially subscribe to a default channel
subscribeToChannel('initialChannel');
// Example of subscribing to a new channel dynamically based on some event or condition
setTimeout(() => {
// Simulating a new channel subscription trigger after 10 seconds
subscribeToChannel('bus#uuid1');
}, 10000);
Every new update to subscribed channels is instantly reflected in the ECS instance. Using the AWS SDK, the ECS publishes live updates through the WebSocket Gateway to the mobile application.
To optimize performance, we maintain an in-memory HashMap for subscribed channels for each ECS instance. This allows efficient management of channel subscriptions, ensuring that duplicate subscriptions are discarded. Additionally, we store WebSocket Gateway IDs of connected users along with their corresponding Channel IDs in memory. This approach streamlines the notification process, eliminating the need for frequent database queries.
Moreover, the client app incorporates logic to automatically reconnect to the WebSocket Gateway in case of connection loss, ensuring uninterrupted communication between the server and the mobile application.
Scalability Deep Dive
Now that we have finalized the design for each application component, let’s explore strategies for independent scalability. It’s crucial to consider that some components are managed by AWS, which provides scalability features out of the box.
WebSocket servers
WebSocket servers can be auto-scaled based on usage, particularly with managed services like API Gateway WebSocket, where AWS handles scaling automatically. ECS and Lambda are responsible for executing specific logic in response to WebSocket messages, and they can be scaled easily. Lambda functions can be scaled by configuring the concurrency settings, while ECS instances can be scaled based on CPU and memory usage.
DynamoDB
DynamoDB is highly scalable and available, offering two scaling options: On-Demand and Provisioned Capacity. While On-Demand scales automatically based on usage, Provisioned Capacity allows for predictable performance and cost. Monitoring is essential to ensure optimal performance, especially given our usage pattern.
Cache & Pub/Sub (Redis)
AWS ElastiCache provides a hosted Redis service with auto-scaling capabilities. ElastiCache Serverless offers automatic workload traffic accommodation by continuously monitoring resource utilization and scaling out as needed, adding new shards, and redistributing data without downtime.
Data Streamings (Kafka/Kinesis)
Both Kafka and Kinesis are managed services by AWS, offering scalability with the right configuration. Regardless of the chosen platform, AWS manages scalability, relieving us of operational concerns.
SQS (Simple Queue Service)
AWS offers two versions of SQS: FIFO and unordered queues, each with its own quota and limitations.
FIFO Queues: FIFO queues support a quota of 300 transactions per second per API action (SendMessage, ReceiveMessage, and DeleteMessage). With batching, FIFO queues can handle up to 3,000 messages per second per API action, representing 300 API calls, each containing a batch of 10 messages.
Given our Back-Of-Envelope Estimation, which indicates an expected write rate of 10,000 requests per second (RPS), utilizing FIFO queues — even with batching — may not be optimal. In such cases, we have several alternatives:
Alternative to Redis pub/sub
Erlang is presented as a viable alternative to Redis pub/sub for routing tasks, particularly praised for its suitability in highly distributed and concurrent applications. This recommendation stems from Erlang’s ecosystem, which includes the Erlang or Elixir programming languages, the BEAM virtual machine, and OTP runtime libraries. The advantage of Erlang lies in its efficient handling of lightweight processes that are inexpensive to create and manage, allowing millions to run concurrently on a single server with minimal resource use.
For practical application, using Erlang would involve replacing the Redis pub/sub setup with a distributed Erlang system where each user is represented by an individual process. These processes could efficiently manage real-time updates, such as location changes, and facilitate a network of updates among a user’s friends, making Erlang a strong candidate for systems requiring scalable, real-time data distribution, assuming the availability of Erlang programming expertise.
Building Things | Co-founder | Executive Partner | Advisory Board Member | CTO, Architect, MBA, PMP, ITIL, ISO, CISSP, CMMI
4 个月very insightful article. good job
Software Engineer | Open Source Enthusiast
4 个月Thanks, great article