Alert Today – Alive Tomorrow: Predicting Road Safety
Sebastiano D.
Analytics Product Manager at Vincere, an Access Group Company | Data Analytics, Data Science, and Machine Learning expert
I was not predicting the future, I was trying to prevent it. - Ray Bradbury
1. The project
“Tomorrow evening I have to go visit my brother's family. He lives in another neighborhood, on the other side of town. Shall I drive there or take the metro? Driving is certainly more convenient; will I run into an accident if I take the car? Shall I move the visit to another day?”
This is a typical everyday situation. To drive or not to drive? Take the car or find another way? Answering this question is the scope of this project: provide people with a predictive tool, able to warn them about the likelihood of car accidents and their severity, allowing them to make a more informed decision regarding their future travel plans, and choice of transportation.
1.1 Scope
In this project we predicted the risk level of getting into a car accident, and the severity of a potential collision in the city of Seattle, based on historical data collected by the Seattle Police Department (SPD), and the Seattle Department of Transportation (SDOT), from 2004.
1.2 Target market
- Anyone with a driving license, and a mean of transport (“the drivers”), interested in knowing the probability of getting into a car accident and how severe it would be, given factors such as weather, time of the year, location, light conditions..., so that they would drive more carefully or even change their travel plans, if possible;
- Public authorities such as police, departments of transportation, road authorities, healthcare providers... in order to be better prepared in dealing with such events knowing the risk level and the scale of the problem.
1.3 Classification Models
In order to deliver a product able to satisfy all relevant stakeholders' needs, I used an analytic, machine-learning driven approach, to build two predictive classification models, as follows:
? Model 1 - Classification of Risk. Is the risk of getting into car accident higher than usual? The model will return a binary-class prediction, as follows:
- Low risk: input feature data match conditions whose “number of collisions” (or frequency distribution) is less than the average number of collisions.
- High risk: input feature data match conditions whose “number of collisions” (or frequency distribution) is greater or equal to the average number of collisions.
? Model 2 - Classification of Collision Severity. How severe the potential collision is. The model will return a binary-class prediction, as follows:
- Not severe (property damaged only).
- Severe (injuries and/or fatalities).
2. Data
2.1 Data Requirements
In order to build a predictive model able to forecast the risk and severity of vehicles' collisions, I needed historical data, as more comprehensive as possible, on road accidents in the region/city of interest, in this case Seattle. For better predictive results, the dataset should include attributes such as:
- Time of the year
- Day of the week
- Time of the day
- Incident's location
- Weather condition
- Light condition
- Road condition
- The collision severity classification
2.2. Data Source
For the scope of this project I used a dataset containing information about all types of collisions that had happened in the city of Seattle, collected by Seattle Police Department (SPD) and recorded by Traffic Records.
2.3. Data Understanding
The dataset has been built with data on all types of collisions from January 1st, 2004, to May 20th, 2020, containing 194,673 records of collisions, with 37 attributes (severity label column not included). The database contained a mix of data types, including integers, floats, and text type. We are dealing with quite an imbalanced dataset, with 136,485 collisions (70% of total) belong to the Severity Class 1 (Property Damage Only).
3. Methodology
In order to build 2 different Classification models (Classification of Risk and Classification of Severity), I had to build 2 different datasets, with different feature sets. The below steps were followed, in order to prepare the needed data/feature sets, and develop efficient Classification Models:
- Data Cleaning: searching for missing values, duplicates, wrong entries…
- Data preparation: feature selection, and data pre-processing. Features will be selected/discarded according to criteria such as: availability at the time of prediction, relevancy, redundancy…
- Data Processing: feature/data engineering, data normalization…
- Model Development: different classification algorithms will be trained, and tested, and compared.
- Model Testing: to evaluate model’s performance I used a mix of metrics such as: Accuracy Score, Confusion Matrix, F1-score.
3.1. Data Cleaning
I performed some Exploratory Data Analysis checking for duplicates, anomalies, missing data, attribute’s relevancy…
The database had a small number of missing values (5,334; 2.74% of the whole dataset) in the X and Y columns (that is the geo-coordinates (longitude and latitude) of the collisions). Since the amount of these missing values is small, I decided to drop these entries. Some anomalous data such as incidents with 0 people involved and incidents with 0 vehicles involved (each less than 3% of whole dataset) were discarded. Duplicates with the same REPORTNO key were dropped. I dropped the attributes which were either not relevant to the scope of the project, redundant, or not available at the time of attempting a prediction.
3.2. Data Preparation.
After the cleaning, the database contained 178,965 records of collisions, with 7 attributes, (severity label column not included), a mix of data types (integers, floats, and text type) and the same severity class imbalance (70% VS 30%). I grouped the attributes, for further analysis/processing, in 3 clusters:
- Geo-Information/Location attributes: longitude and latitude of the incident location.
- Date and Time attributes: date of the incident, date and time of the incident.
- Environmental attributes: weather, condition of the road during the collision, light conditions.
There was a huge redundancy between the "WEATHER" and "ROADCOND" attributes, so I decided to keep only the "WEATHER" attribute, to avoid redundancy, and because predicting what the Road Conditions might be during an "overcast" day, for example, is simply not feasible.
I, then, applied few changes as follows:
- Categorized the missing values in the attributes WEATHER and LIGHTCOND as "Unknown".
- In the "LIGHTCOND" attribute, I grouped the "Dark - No Street Lights" and "Dark - Street Lights Off" categories.
- Date and Time attributes. After converting the object date/time attributes to datetime object, I created "month", "day_of_the_week", "hour_of_day", "day_period", and "weekend" columns.
For the "day_period" attribute, I grouped the 24 hours into 4 categories:
- 1: night -> from midnight to 6 am
- 2: morning -> from 6 am to 12 pm
- 3: afternoon -> from 12 pm to 6 pm
- 4: evening -> from 6 pm to midnight
3.2.A Model 1 - Classification of Risk. Data Processing and Feature Selection
Geo-Information/Location attributes. In order to develop the Model 1 - Classification of Risk, I had to build a new dataset from the SDOT historical data. To prepare this dataset, I assigned the collision's geo coordinates (latitude and longitude) to different clusters, using the partition based clustering algorithm K-Means. Since with K-Means I had to pre-set the number of clusters (k), I ran the algorithm with different k values (from 2 to 50), measured the mean Euclidean distance (datapoints - centroids) for each iteration, and then compared the mean distance per each k.
Following the “Elbow rule”, I set the number of clusters as 5, and assigned each collision to its geo-cluster ("neighborhood").
Date and Time attributes. As it shows from the plots below, the number of collisions were actually quite equally distributed over the 12 months and 7 days of week.
The collisions distribution over the 24 hours was more interesting for the project scope. The midnight hour experiences a way higher number of collisions compared to all other hours of the day, whilst the other “night hours” have a way lower frequency.
Environmental attributes. After checking the frequency distribution of the "WEATHER" and “LIGHTCOND” attributes, I grouped some of the categories together, in order to reduce the noise.
Feature selection. The following are the select features for “Model 1 - Classification of Risk”: Neighborhood, hour_of_day, Light Conditions, Weather. To build the dataset, I applied “groupby” method on the select attributes, computed the frequency (= Number of collisions), and assigned the Risk Classes as defined in the scope of the project:
- Class 1: Lower risk (less than the average number of collisions).
- Class 2: Higher risk (greater or equal to the average number of collisions).
3.2.B Model 2 - Classification of Collision Severity.
Date and Time attributes. I used ANOVA (Analysis of Variance) technique to calculate the different F-test scores for different attributes, using as target variable the SEVERITYCODE classification, and compared them to check which attribute has the highest score, that is the highest variation between different category means (over the variation within the category groups). The approach of using ANOVA for a categorical target variable (that is not a continuous one) may be unorthdox, but the purpose here was not to find a correlation coefficient between variables, but comparing the variation of different attribute categories to check whether a particular attribute had a higher impact over the Classification (a higher distance between category means (and a lower variance within categories) indicates a kind of categorization more capable to explain the target variable because more impactful in establishing the target classes), and verify if we could discard redundant/irrelevant attributes.
Dictionary with attribute name and relevant F-test score: {'month': 9.08747983999812, 'day_of_week': 13.639657854282587, 'hour_of_day': 31.840191001236775, 'day_period': 103.00228194421788, 'weekend': 56.23912077622821}
The F-test scores showed that the categorization of the daily 24 hours in 4 day periods ("1: night", "2: morning", "3: afternoon", "4: evening") had, by far, the biggest impact on the Collision-Severity classification.. The attribute Weekend was the second most impactful attribute amongst the 5. The "month" attribute was the least impactful.
Environmental attributes. After verifying that “WEATHER” and “LIGHTCOND” were relevant attributes for the Collision-Severity Classification, I performed the Analysis of Variance (ANOVA), to check which one of the different attribute's categorization (“before grouping” VS “after grouping”; see the 1st Model) was more impactful for predicting the severity classes. The new categorization (“after grouping”) for both the attributes had greater F-Test scores (WEATHER, old categories: 463.68; new categories: 1142.97; LIGHTCOND, old categories: 637.19, new categories: 1465.72).
Feature selection. The following are the select features for “Model 2 - Classification of Collision Severity”: x (longitude), y (latitude), weather, light_conditions, day_period, weekend.
3.3 Data Processing
Both feature sets (Model-1 and Model-2) have been processed by turning categorical values into numeric values (get_dummies method), converting the dataframes into Numpy arrays, and standardizing data with Standard Scaler.
3.4. Model Development and Testing
3.4.A Model 1 - Classification of Risk
For this model, I used K-Nearest-Neighbor (KNN) algorithm. After splitting the dataset into train (75%) and test (25%) sets, I tested the algorithm with several values for K (the number of neighbors), to find the best K value. It turned out that 8 was the K value with the highest accuracy score.
Confusion Matrix analysis and Classification Report:
The model worked very well, with a very high accuracy score, and a high F1-score (both 96%), despite the fact that the dataset is quite imbalanced.
3.4.B. Model 2 - Classification of Collision Severity
For the second model, I used Decision Tree, Logistic Regression, and K-Nearest-Neighbors (KNN) algorithms. Considering the size of the dataset (178,965 rows), I decided not to use algorithms that work well with small databases like Support Vector Machine. After splitting the dataset into train (75%) and test (25%) sets, I tested the different algorithms (regarding KNN, I tested the algorithm with several values for K (from 1 to 20), to find the best K value. It turned out that 7 was the K value with the greatest F1 score), and recorded algorithm's performances in the Report Table:
Logistic Regression performed very poorly on the minority class prediction; therefore, despite the highest accuracy, Logistic Regression was definitely the algorithm that performed worst. Decision Tree and KNN algorithms had very similar F1 scores, but KNN had a slightly greater accuracy, although Decision Tree seemed to perform better on predicting cases belonging to the minority class. However, the minority class had quite low precision, recall, and F1-Score values for all 3 algorithms. I decided to try, then, some resampling techniques such as:
- Oversample minority class
- Undersample majority class
- Generate synthetic samples (SMOTE)
Report Table 2:
All resampled datasets performed worse than the original one, for each algorithm.
4. Results - Cross Validation
4.1. Model 1 - Classification of Risk
I performed Cross Validation, as an additional out-of-sample metrics. I obtained an average F1-Score and Accuracy_Score of 90%. The Model 1 – Classification of Risk of getting into a car accident has performed very well.
Model 1 - "Classification of Risk" average F1-score: 0.9 Model 1 - "Classification of Risk" average accuracy score: 0.9
4.2. Model 2 – Classification of Collision Severity
Since Decision Tree and KNN algorithm’s performance is very close, I performed Cross Validation analysis on both algorithms as a further (out-of-sample) metrics and checked which model works better.
Model 2: "Classification of Collision Severity" – Decision Tree average F1-score: 0.62 Model 2: "Classification of Collision Severity" – Decision Tree average accuracy score: 0.62 Model 2 - "Classification of Collision Severity" – KNN average F1-score: 0.62 Model 2 - "Classification of Collision Severity" – KNN average accuracy score: 0.65
Decision Tree worked slightly better with regard of the minority class, but overall KNN had a greater accuracy. See Classification Reports:
Decision Tree – Classification Report:
KNN – Classification Report:
5. Discussion
For the first model, I was able to achieve 90% accuracy (with a F1-Score of 0.9), using K-Nearest-Neighbors algorithm (number of neighbors = 8). It is important to note that Model 1 – Classification of Risk doesn’t predict the likelihood of getting into a car accident, but it returns a binary classification able to predict the class of risk based on the average number of total collision that took place in the considered time frame (a high risk means that input conditions, historically, are those ones with a number of accidents greater than the mean). This approach might be considered “naif”, however I reckon this kind of model can give a “warning” to a driver who’s planning a car-trip under conditions considered dangerous from an objective point of view (that is conditions under which the majority of accidents happen).
For the second model, I achieved 65% of accuracy (with a F1-Score of 0.62), again using K-Nearest-Neighbors algorithm (number of neighbors = 7), therefore there’s still room for improvement. Resampling techniques didn’t return any better results; I tried the same algorithms with geo-clusters (instead of latitude and longitude), and even different feature selection (choosing only features with ANOVA F-test scores above a certain threshold), but, again, outputs weren’t any better (actually even worse). My conclusion is that the severity of a potential collision is a difficult prediction to make; features such as “vehicles speed”, “total people involved”, "driver under alcohol/drug influence”, might significantly help, but they are not known at the time of attempting a prediction, so they can’t be used; however, “individual/per vehicle” features of this type might also contribute to a collision’s severity, and, therefore, to the model’s performance. For instance, knowing the number of people in each vehicle, during a collision, might be helpful: in this case a potential user knows (or anyway might have an idea of) how many people will be in his/her own vehicle and the feature can be used to attempt a prediction. These types of data are obviously more difficult to obtain, but they could bring significant improvements to the model.
6. Conclusion
In this project, I built 2 classification models for potential users to enable them to make a more informed decisions about their future travel plans. I selected K-Nearest-Neighbors (KNN) as the algorithm for both models. Built classification models predict whether the risk of getting into a car accident is high (that is higher than average) or low (Model 1), and whether the severity of a potential collision will be low (only property damaged) or high (people injured/fatality).
The first model (Class of Risk) is actually made of 2 machine learning models: a clustering-algorithms (K-Means), and a classification one. Users will input the destination location of their trip (amongst other input features), the model will cluster the geo-coordinates, and return the Class of Risk.
For both models I identified location, time and environmental (weather and light-conditions) features as the most important features that affect the risk of getting into a car accident, and its severity. The models can be very useful in warning people about a potential dangerous travel decision.