Architecture of the School Bus Tracking System

Architecture of the School Bus Tracking System

In this article, we will explore the process of building a vehicle tracking system, with a specific focus on school buses. The design of this system will facilitate the monitoring of school buses across a wide range of educational institutions. Furthermore, we will enhance our system with a notification feature.

This feature will notify parents when a bus is approaching their child's home or when the child is being dropped off. Throughout this article, I will discuss various technical aspects related to the development and implementation of this system.

Comprehending the Functional Requirements

Our goal is to create a system that monitors the locations of school buses, catering to multiple schools, each with numerous students. The following are the specific requirements that need to be met for each school within the system:

  • Real-time Tracking: The system must frequently update and display the current positions of buses for accurate tracking.
  • Accessible Visualization: A real-time map displaying bus locations should be readily available to both parents and school administrators.
  • Proximity Alerts: Parents should receive automatic notifications when a bus nears their child's home, primarily through push notifications, but the system should be adaptable to other methods like SMS.
  • Data Privacy: Each school's data should be isolated, ensuring they can only access information about their own buses to maintain privacy and system integrity.

Non-functional requirements

The business objectives establish a foundation for identifying several key non-functional requirements for the school bus tracking system:

  • Scalability: The system must efficiently manage scaling, accommodating both the increasing number of buses transmitting location updates and the number of parents tracking their students' locations simultaneously.
  • High Availability: The system must maintain high availability to ensure continuous service and real-time tracking without interruptions.
  • Reliability: The system should ensure that no notifications are lost. Every alert, especially those indicating a bus is nearing a student's home, must reach the intended recipients without fail.
  • Security: Due to the sensitive nature of location data, the system must incorporate robust security measures to protect against unauthorized access and potential data breaches.
  • Privacy: Strict privacy controls must be enforced to ensure users can only access the location information of buses they are directly associated with, preventing unauthorized viewing of a vehicle’s location.

Back-of-the-Envelope Estimation

To gain a clearer understanding of the scale and demands of our proposed school bus tracking system, let's perform some preliminary calculations based on the anticipated usage:


Frequency of Location Update Requests (RPS)

Fleet Size: With a projected 4,000 schools and an estimated 10 buses per school, our system would need to track a total of 40,000 buses.

Update Rate: To maintain accurate real-time tracking, each bus would transmit its location data every 2 seconds.

Calculation of RPS:

  • Total requests over 2 seconds: 40,000 buses
  • RPS (Requests Per Second): 40,000 requests / 2 seconds = 20,000 RPSThis results in a


This results in a significant influx of data, requiring substantial backend resources to manage the load efficiently.

How Parents Use the App

  • Total Students: 4000 schools * 300 students/school = 1200000 students
  • Checks Per Minute: Assuming half the parents check every two minutes, 180,000 parents / 2 minutes = 90,000 checks per minute
  • Checks Per Second: 90,000 checks per minute / 60 seconds = 1,500 checks per second (RPS)

Although the number of data reads per second is high, modern databases are designed to handle this volume and should not be a significant bottleneck.

Proposed High-Level System Design

The high-level design of our school bus tracking system centers around several key components. Here’s an overview of each:

  • API Design
  • Algorithms for Locating Student Houses
  • Bus Location Updates
  • Data Model


API Design

We will implement a RESTful API to create a simple yet robust backend capable of managing bus location updates and views.

GET /v1/buses/locations

  • Authentication: Utilize an “X-Authorization” header with a JWT to validate the logged-in user.
  • Function: The endpoint identifies the buses associated with the user’s children and returns the current locations.

Response Body

{
  "buses": [
    {
      "bus_id": "e3d54bd9-bbbc-4d8e-a653-a937c7c4a215",
      "driver": {
        "id": "a23e80d5-679d-4041-96bd-20d3954099ff",
        "name": "bus"
      },
      "children": [
        {
          "id": "0d95d921-97ba-48a9-8f5d-73b54c4318f5",
          "name": "children"
        }
      ],
      "location": {
        "latitude": 29.933751883038916,
        "longitude": 31.197616132406225
      }
    }
  ]
}        

WebSocket /v1/buses/{bus_id}/location

  • Function: Each bus sends its current location through the WebSocket connection, allowing for continuous data flow without needing to open new HTTP connections.

Request Parameters:

{
  "latitude": ...,
  "longitude": ...,
  "timestamp": ...
}        

Algorithms To Find Student Houses

We aim to ensure that our database contains a comprehensive list of student house locations. These locations should be organized such that we can notify parents approximately two minutes before the bus arrives at each location.

There are two types of geospatial indexing approaches, which are shown in the following figure:

Evenly Divided Grid

A simple method is to split the globe into a network of small, uniformly sized squares. Each square could contain several houses, and every house on the map would be associated with a particular square.

However, the evenly spaced grid method may not be ideal for our business needs. The main issue is that house distribution varies greatly across different areas. We need a more flexible way to control the size of the grids to ensure timely notifications.

Geohash

presents a superior alternative to the evenly divided grid. It functions by converting two-dimensional latitude and longitude data into a one-dimensional string of characters. This is achieved by recursively splitting the world into smaller and smaller grids with each additional bit. Let's briefly outline how Geohash works.

The first step involves dividing the globe into four quadrants using the prime meridian and equator.


  • Latitude range [-90, 0] is represented by 0
  • Latitude range [0, 90] is represented by 1
  • Longitude range [-180, 0] is represented by 0
  • Longitude range [0, 180] is represented by 1

Second, divide each grid into four smaller grids. Each grid can be represented by alternating between longitude bit and latitude bit.

Repeat this subdivision until the grid size is within the precision desired. Geohash usually uses base32 representation. Let’s take a look at two examples.

Geohash of the Google headquarters (length = 6)

1001 10110 01001 10000 11011 11010 (base32 in binary) → 9q9hvu (base32)        

Geohash of the Facebook headquarters (length = 6):

1001 10110 01001 10001 10000 10111 (base32 in binary) → 9q9jhr (base32)        

Geohash has 12 precisions (also called levels) as shown in Table. The precision factor determines the size of the grid. We are only interested in GeoHashes with lengths between 6 and 7. This is because when it’s longer than 8, the grid size is too small, while if it is smaller than 7, the grid size is too large and our notification approach will be mostly un efficient in this case

| Geohash Length | Grid Width x Height        |
|----------------|----------------------------|
| 1              | 5,009.4 km x 4,992.6 km    |
| 2              | 1,252.3 km x 624.1 km      |
| 3              | 156.5 km x 156 km          |
| 4              | 39.1 km x 19.5 km          |
| 5              | 4.9 km x 4.9 km            |
| 6              | 1.2 km x 609.4 m           |
| 7              | 152.9 m x 152.4 m          |
| 8              | 38.2 m x 19 m              |
| 9              | 4.8 m x 4.8 m              |
| 10             | 1.2 m x 59.5 cm            |
| 11             | 14.9 cm x 14.9 cm          |
| 12             | 3.7 cm x 1.9 cm            |        

How do we choose the right precision? We want to find the minimal Geohash length that covers the whole circle drawn by the user-defined radius. The corresponding relationship between the radius and the length of Geohash is shown in the table below.

| Radius (Kilometers)    | Geohash Length  |
| - - - - - - - - - - - -| - - - - - - - - |
| 2 km (1.24 mile)       | 5               |
| 1 km (0.62 mile)       | 5               |
| 0.5 km (0.31 mile)     | 6               |
| 0.086 km (0.05 mile)   | 7               |
| 0.015 km (0.009 mile)  | 8               |        

for more information about how GeoHash works, you can use this site .

const geohash = require('ngeohash');

// Example coordinates for a location
const latitude = 40.6892;  // Latitude for Statue of Liberty
const longitude = -74.0445;  // Longitude for Statue of Liberty

// Generate the GeoHash
const geohashCode = geohash.encode(latitude, longitude, 7);
console.log("GeoHash code:", geohashCode);        

This approach is highly effective for our current setup, although there are minor issues related to boundary definitions. However, these will not impact our existing framework.

Quadtree

A quadtree is a data structure that is commonly used to partition a two-dimensional space by recursively subdividing it into four quadrants (grids) until the contents of the grids meet certain criteria. For example, the criterion can be to keep subdividing until the number of businesses in the grid is not more than 100. This number is arbitrary as the actual number can be determined by business needs (houses in each grid).

With a quadtree, we build an in-memory tree structure to answer queries. Note that quadtree is an in-memory data structure and it is not a database solution. It runs on each server, and the data structure is built at server start-up time.

The following figure visualizes the conceptual process of subdividing the world into a quadtree. Let’s assume the world contains 200m (million) houses (Just a random number to demonstrate the usage).

The root node represents the whole world map. The root node is recursively broken down into 4 quadrants until no nodes are left with more than 100 businesses.

How to get nearby businesses with Quadtree?

  1. Build the quadtree in memory.
  2. After the quadtree is built, start searching from the root and traverse the tree, until we find the leaf node where the search origin is. If that leaf node has 100 businesses, return the node. Otherwise, add businesses from its neighbors until enough businesses are returned.

Real-world quadtree example

Yext provided an image that shows a constructed quadtree near Denver. We want smaller, more granular grids for dense areas and larger grids for sparse areas.

if we did not take into mind the operational considerations for Quadtree and that adding new houses will take a lot of time to be updated on the map, also we do not have control over the size of the grids we are dividing in the map as they are divided by the houses length instead of the size which can be another problem to using Quadtree.

Google S2

Google S2 geometry library is another big player in this field. Similar to Quadtree, it is an in-memory solution. It maps a sphere to a 1D index based on the Hilbert curve (a space-filling curve) . The Hilbert curve has a very important property: two points that are close to each other on the Hilbert curve are close in 1D space. Search on 1D space is much more efficient than on 2D. Interested readers can play with an online tool for the Hilbert curve .

S2 is a complicated library But because it’s widely used in companies such as Google, Tinder, etc.

  • S2 is great for geofencing because it can cover arbitrary areas with varying levels. Geofencing allows us to define perimeters that surround the areas of interest and to send notifications to users who are in those areas.

  • Another advantage of S2 is its Region Cover algorithm Instead of having a fixed level (precision) as in GeoHash, we can specify min level, max level, and max cells in S2. The result returned by S2 is more granular because the cell sizes are flexible. If you want to learn more, take a look at the S2 tool

Recommendation

As a concluding recommendation, it is advisable to weigh the merits of using GeoHash versus Google S2, as both offer the functionality suited for our application.

However, it is important to note that while one is a database solution, the other operates as an in-memory solution. For the purposes of simplicity and the scope of this article, we will proceed with the GeoHash solution.

Bus Location Update

Each bus transmits a location update approximately every two seconds. Given the write-heavy nature of location updates, a direct write to the database could be feasible but may not be optimal. An alternative approach involves writing the most recent locations to a Redis Cache while maintaining a historical log of bus locations in the database. This method enhances the scalability of our write operations.

Additionally, it’s important to consider how the Parents app receives these updates. Instead of querying the database repeatedly to refresh the bus’s location, we can leverage Redis Pub/Sub to provide real-time notifications.

Understanding Redis Pub/Sub:

Redis Pub/Sub functions as an efficient, lightweight message bus. Channels, also known as topics, are cost-effective and easy to create. A modern Redis server, equipped with gigabytes of memory, can support millions of channels.

In our design, location updates received via a WebSocket server are published to a specific bus’s channel on the Redis Pub/Sub server. New updates are then instantly republished to all subscribers of that bus, ensuring real-time location updates on the map.

Location Update Delivery

Considering the high frequency of updates from each bus, using a RESTful API to receive these updates would be inefficient. A more effective solution is to use WebSockets, which maintain a persistent connection and allow for continuous data transmission without necessitating a full TCP handshake every few seconds.

Data model

In this section, we discuss the read/write ratio and the schema design for our application.

Read/write ratio

The system exhibits a high write volume, as numerous buses emit updates every two seconds. Consequently, the number of writes substantially exceeds the queries from parents tracking their students’ locations. Given this write-heavy nature, a NoSQL database is preferable. For the purposes of this blog, we will utilize DynamoDB.

DynamoDB as a Database

DynamoDB is a scalable and highly available database designed to manage large volumes of reads and writes. It requires thoughtful schema design but offers robust features that are beneficial for our use case.

We propose the creation of three primary tables:

1. Parent Table 2. Students Table 3. Bus Location History Table

In DynamoDB, a table consists of a Primary Key (PK) and an optional Sort Key (SK). A unique combination of PK and SK (or just PK if SK is not used) is essential to prevent duplicate records. Querying is typically constrained to these keys; for queries using other attributes, a Local Secondary Index (LSI) or Global Secondary Index (GSI) may be employed.

Given our prior discussions, GeoHash will be employed for location queries. It is crucial to define our queries precisely before schema creation to ensure the database is optimized for our operational requirements.

Parent Table

This table normalizes student_ids to minimize query volume when retrieving the locations of parents' children. The schema allows the caching of Bus IDs associated with each parent to enhance performance.

  • Key: parent_id
  • Value: bus_ids

Student Table

Includes a GeoHash field to determine the geographic block to which a student belongs, also the Latitude and Longitude Fields represent where the house is located and are used to calculate the estimated distance to arrive. In addition, it’s imperative to establish a Global Secondary Index (GSI) featuring geohash as the partition key (PK) and the bus_id as the Sort Key (SK). This GSI serves to facilitate queries targeting specific geographical areas for a specific bus and it is used to notify parents about Bus arrival as we will see in the section “Storing Bus Location History” Section.

Bus Location History Table

Separates latitude and longitude attributes to accurately display the bus’s location on the map, catering to the needs of parents.

While it is possible to consolidate all data into a single table — a common practice in DynamoDB for simplification — this blog will illustrate using separate tables to clarify the concepts for readers.

System Design (Bus Flow)

Upon reviewing each component of our system, we are now ready to develop the initial diagram of our application. Here are key points to consider while designing our system architecture:

  1. Bus Notification: Buses will send location updates every two seconds using a WebSocket connection to our server.
  2. Data Handling: Our server will push these data updates to a durable storage platform (not directly to a database) for Async processing.
  3. Location Caching: The most recent bus location is cached using Redis.
  4. Real-time Publishing: The updated location is then published to parents subscribed to the bus channel.
  5. Database Updates: Finally, the location data is saved in the database.
  6. Geo-Location Awareness: When a bus enters a new GeoHash location, all parents residing in that area are notified.

Bus Location Report


  1. WebSocket Gateway: We use an AWS Managed WebSocket Gateway to establish a WebSocket connection and send location updates every two seconds.
  2. API Gateway and Lambda: The API Gateway triggers a Lambda function after every message, which forwards the data to a data stream platform such as Kafka or Kinesis.
  3. Data Durability: The data remains in the stream until it is processed.

The preceding architecture effectively addressed our initial objectives. Now, let’s shift our focus to the third and fourth points outlined in our design: managing the current location and facilitating parent notifications.

Bus Location Update & Notify Users


  • Data Stream to Redis: Data from Kafka/Kinesis is processed by a worker, which then saves it to a Redis instance. This process involves setting a TTL (Time to Live) for the data.
  • Redis Pub/Sub: Upon successful caching, the new location is published on a Redis Pub/Sub channel dedicated to that specific bus (e.g., bus#uuid). This enables real-time notification to all connected clients.
  • Queue for Asynchronous Processing: After publishing, the data is queued into SQS for asynchronous processing and database storage.

When pushing data to the queue, it’s essential to include both the newly received location and the latest location stored in the Redis Cache. Additionally, maintaining an in-memory HashMap can help minimize read calls to Redis, enhancing efficiency.

Thus far, we have successfully managed to receive location updates from the bus and store both the current location and live updates in the Redis Cache. These updates are then seamlessly published to the parent app for all users to access.

Storing Bus Location History

  • Data Processing from SQS: Data pulled from the queue via a Lambda worker is saved in the database with the bus ID and its latitude-longitude coordinates.
  • Conditional Notifications: To optimize notification efficiency, a check is performed to determine if there has been a change in the GeoHash of the last observed location. If so, parents associated with the new GeoHash location are identified, and notifications are dispatched to SQS.

Notification Flow Design

  • Data Retrieval: When a notification is needed, the system first attempts to retrieve customer information from the cache. If unavailable, it accesses the database.
  • SNS for Fan-Out: SNS is utilized to distribute notifications based on user preferences.
  • End Processing: Messages are queued in another SQS and processed by Lambda functions to deliver the final notification to the end user.

This system design ensures real-time update dissemination and effective data management for location-based notifications. Please review the end-to-end architecture diagram carefully to fully grasp the interactions between each component.


This architecture might initially seem complex, so please take your time to thoroughly review and understand each component of the system.

System Design (Parent App Flow)

The architecture of the Parent App is simpler compared to the Bus Flow and can be illustrated in a single diagram.

Upon opening the app, it performs three main actions:

  1. Initial WebSocket Connection: The app establishes an initial connection with the WebSocket Gateway.
  2. Query Current Bus IDs: It queries the current bus IDs from the Redis Cache to fetch the latest location information and display them on map.
  3. Listen to Bus Channels: Using the retrieved bus IDs, the app subscribes to their respective channels on the ECS side. This ensures that any new location updates are received through Redis Pub/Sub and delivered via the WebSocket Gateway to the user’s phone.

Server-Side Logic For Handling Connection

It’s crucial to understand the behavior of ECS when a new connection is established, as this forms the basis of the Parent App architecture.

Upon a user’s connection to the WebSocket Gateway, it triggers an ECS and invokes a designated function. This function retrieves the list of student IDs from DynamoDB and allows the current ECS instance to listen to their channels.

Below is a sample code snippet demonstrating this process:

const redis = require('redis');

// Create a Redis client
const subscriber = redis.createClient({
    url: 'redis://localhost:6379' 
});

subscriber.on('error', (err) => {
    console.log('Redis Client Error', err);
});

subscriber.connect();

// Function to subscribe to a new channel
function subscribeToChannel(channelName) {
    subscriber.subscribe(channelName, (message, channel) => {
        console.log(`Received message: ${message} from channel: ${channel}`);
    });
    console.log(`Subscribed to ${channelName}`);
}

// Initially subscribe to a default channel
subscribeToChannel('initialChannel');

// Example of subscribing to a new channel dynamically based on some event or condition
setTimeout(() => {
    // Simulating a new channel subscription trigger after 10 seconds
    subscribeToChannel('bus#uuid1');
}, 10000);        

Every new update to subscribed channels is instantly reflected in the ECS instance. Using the AWS SDK, the ECS publishes live updates through the WebSocket Gateway to the mobile application.

To optimize performance, we maintain an in-memory HashMap for subscribed channels for each ECS instance. This allows efficient management of channel subscriptions, ensuring that duplicate subscriptions are discarded. Additionally, we store WebSocket Gateway IDs of connected users along with their corresponding Channel IDs in memory. This approach streamlines the notification process, eliminating the need for frequent database queries.

Moreover, the client app incorporates logic to automatically reconnect to the WebSocket Gateway in case of connection loss, ensuring uninterrupted communication between the server and the mobile application.

Scalability Deep Dive

Now that we have finalized the design for each application component, let’s explore strategies for independent scalability. It’s crucial to consider that some components are managed by AWS, which provides scalability features out of the box.

WebSocket servers

WebSocket servers can be auto-scaled based on usage, particularly with managed services like API Gateway WebSocket, where AWS handles scaling automatically. ECS and Lambda are responsible for executing specific logic in response to WebSocket messages, and they can be scaled easily. Lambda functions can be scaled by configuring the concurrency settings, while ECS instances can be scaled based on CPU and memory usage.

DynamoDB

DynamoDB is highly scalable and available, offering two scaling options: On-Demand and Provisioned Capacity. While On-Demand scales automatically based on usage, Provisioned Capacity allows for predictable performance and cost. Monitoring is essential to ensure optimal performance, especially given our usage pattern.

Cache & Pub/Sub (Redis)

AWS ElastiCache provides a hosted Redis service with auto-scaling capabilities. ElastiCache Serverless offers automatic workload traffic accommodation by continuously monitoring resource utilization and scaling out as needed, adding new shards, and redistributing data without downtime.

Data Streamings (Kafka/Kinesis)

Both Kafka and Kinesis are managed services by AWS, offering scalability with the right configuration. Regardless of the chosen platform, AWS manages scalability, relieving us of operational concerns.

SQS (Simple Queue Service)

AWS offers two versions of SQS: FIFO and unordered queues, each with its own quota and limitations.

FIFO Queues: FIFO queues support a quota of 300 transactions per second per API action (SendMessage, ReceiveMessage, and DeleteMessage). With batching, FIFO queues can handle up to 3,000 messages per second per API action, representing 300 API calls, each containing a batch of 10 messages.

Given our Back-Of-Envelope Estimation, which indicates an expected write rate of 10,000 requests per second (RPS), utilizing FIFO queues — even with batching — may not be optimal. In such cases, we have several alternatives:

  1. Unordered Queue: Utilize an unordered queue and manage potential duplication in the application logic.
  2. Data Streaming Platform: Replace SQS entirely with a data streaming platform such as Kafka or Kinesis, particularly for the notification part. Use queues only for the notification process with SNS.
  3. Over-Provisioning Queues: Over-provision queues and implement consistent hashing logic to distribute the load among these queues. This approach can help handle higher write rates efficiently.

Alternative to Redis pub/sub

Erlang is presented as a viable alternative to Redis pub/sub for routing tasks, particularly praised for its suitability in highly distributed and concurrent applications. This recommendation stems from Erlang’s ecosystem, which includes the Erlang or Elixir programming languages, the BEAM virtual machine, and OTP runtime libraries. The advantage of Erlang lies in its efficient handling of lightweight processes that are inexpensive to create and manage, allowing millions to run concurrently on a single server with minimal resource use.

For practical application, using Erlang would involve replacing the Redis pub/sub setup with a distributed Erlang system where each user is represented by an individual process. These processes could efficiently manage real-time updates, such as location changes, and facilitate a network of updates among a user’s friends, making Erlang a strong candidate for systems requiring scalable, real-time data distribution, assuming the availability of Erlang programming expertise.

Mahmoud Ahmed

Building Things | Co-founder | Executive Partner | Advisory Board Member | CTO, Architect, MBA, PMP, ITIL, ISO, CISSP, CMMI

4 个月

very insightful article. good job

Ekram Mohamed

Software Engineer | Open Source Enthusiast

4 个月

Thanks, great article

回复

要查看或添加评论,请登录