Car accident severity and their prediction
1. Introduction
1.1 Background and definition of the problem
According to the WHO[1], even though vehicles have become much safer in the last decades, every year around 1.35 million people still die because of a road traffic crash and between 20 and 50 million more people suffer non-fatal injuries with many incurring a disability.
If we take a different perspective and consider the economic impact at the national level, road traffic accidents also cost around 3% of gross domestic product to most countries[2].
Therefore, there is a great interest in different parts of society (such as governments, decision-makers, carmakers, drivers, insurance companies) in changing and decreasing this trend.
A solution that would reduce the number of incidents could be the chance to warn a driver about the possibility of getting into a car accident and how severe that incident would be, given the weather, light, and road conditions. In this way, people would drive more carefully or even stay at home.
Transforming this solution into a machine learning problem, I used a dataset provided by a city and its police department (in our case Seattle City and the SPD - Seattle Police Department) to predict the severity of an accident (and its probability) based on the conditions of weather, light and the road.
1.2 Data understanding
In this project, I used the data provided by the SPD (Seattle Police Department) and recorded by Traffic Records. This dataset - called Data-Collisions.csv - includes all types of collisions involving cars, bikes, pedestrians and others (around 200,000) from 2004 to present.
- I chose the severity of the accidents as the dependent variable. SEVERITYCODE is a categorical variable and follows a code that corresponds to the severity of the collision: 2 (injury) and 1 (property damage).
- I chose 7 attributes from the 37 available in the dataset: junctions, weather, road condition, light condition, speeding, number of people involved and vehicles of involved.
- The junctions, weather, road condition, light condition are the main attributes since they are directly connected to the project’s objective. The number of people involved and the number of vehicles involved allow understanding how big the accident can be: an accident can involve a lot of vehicles and people and still have nobody injured or no property damage. Speeding is often considered to have a direct impact on accidents and is the only attribute that is a choice of the driver in this list.
- I cleaned the data and controlled that they had the proper format. Then I used one hot encoding technique to convert categorical variables to binary variables and append them to the feature data frame. Lastly, I defined the feature set X and the labels y and normalised the data.
2. Methodology
2.1 Data Exploratory Analysis
My first step was to examine the value counts of speeding and considered what impact speeding had on accidents. Here my first surprise: I saw that this attribute would not have been a good predictor variable for the severity of an incident since only 5.17% of the accidents were correlated to driving too fast.
My next step was to compare the different types of junctions. I discovered that most accidents (47,16%) take place in mid-block crossings. Accidents at intersections (related to the intersection) are the second most usual incidents (33.47%) and occur more than the double of the time than the incidents in mid-block crossing related to the intersection (12.40%).
Accidents in mid-block crossing (not related to the intersection) cause much more property damages (i.e. 65,293) than injuries (i.e. 17,993). However, even though the number of incidents with injuries in mid-block crossings (not related to the intersection) is still high (i.e. 17,993), it is smaller than the incidents with injuries at intersections (i.e. 25.769). This makes the intersections the most dangerous junctions - see the chart below.
Weather conditions were the next attribute that I examined. Even though we may think that severe crosswind, hail, snow or fog are the main causes of the accidents, this is actually not true. The larger number of accidents takes place with clear weather (66.09%), rain (18%) and overcast (15.01%) and together they are 99.10% of all the incidents.
The chart below shows even better how the major part of the accidents – both with property damages and injuries – occurs with a clear sky.
After the weather, I analysed the road conditions as they may put drivers in complicated and stressful situations. There may be Circumstances where the road has ice, snow, oil or water may not be properly managed.
Surprisingly, the vast majority of the accidents takes place when the roads are dry, in 72.90% of the cases. Wet conditions of the road are the variable with the second highest percentage of incidents (25.79%). These two variables together collect 98.69% of the accidents while the other apparently more complicated conditions are correlated to only 1.31% of the incidents.
In the chart below, we can easily see how the larger number of accidents (regarding both property damages and injuries) occurs with dry conditions of the road.
Light conditions were the fifth attribute. I studied them since situations with difficult light conditions (such as in the dark with no street lights) may be dangerous and stressful for a driver.
However, the vast majority of the accidents does not take place in the dark, or during the dusk or dawn, but in daylight (in 67.84% of the cases). Dark - with lights street on - is the variable with the second highest percentage of incidents (26.22%). Together they cover 94.06% of the accidents.
In the chart below it is possible to see how the major part of the accidents – both with property damages and injuries – occurs in daylight and darkness (with street lights on) as well as we can observe their proportions and the other light conditions.
At this point, I realised that junctions, weather, road, and light conditions have one trend in common: the number of accidents with property damages is always larger than the one with injuries in all their conditions (such as intersections, clear sky, rain, or dawn).
Then, I analysed the number of people involved in an incident. The major part of the accidents involves 2 people (59.48%). Incidents involving 3 people have the second-highest percentage (19.67%). Together they cover 79.15% of the accidents and the percentage increase to 98,74% if we consider all the incidents that involve between 1 and 6 people. Over 10 people the percentage of accidents becomes so small that is rounded to zero. The mean per accident is 2.55 people, the mode is 2 people and the 75% of the incidents include between 1 and 3 people. This means that even though the highest number of people involved in an accident is 81, the vast majority of the incidents involves few people. This is true for both of the severities (i.e. property damages and injuries).
Lastly, I examined the number of vehicles involved in an incident. The 91.23% of the accidents involve 1 or 2 vehicles and 98.29% involves between 1 and 3 vehicles. The percentage of incidents with over 8 vehicles is so low that is rounded to zero. The mean per accident is 1.97 vehicles, the mode is 2 vehicles and 75% of the accidents include 1 or 2 vehicles. This means that even though the highest number of vehicles involved in an accident is 12, the majority of incidents involves very few vehicles. This consideration is also valid for both of the severities (i.e. property damages and injuries).
2.2 Modelling
Since the target (i.e. dependent variable) is a categorical data, the logical model is classification so as to predict the severity of an accident. The application of classification models follows a specific path. I divided the samples into two classes (80% training data, 20% test data, giving a random state of 4). Then I used three approaches to create three different predictive models:
1. Decision Tree
2. Support Vector Machine (SVM)
3. Logistic Regression
Among the three models, logistic regression was the one that had the worst accuracy with all the measures that I used, as it is visible in the table below (best results in red).
Even though the decision tree has a lower Jaccard score than the one of SVM, its F1-score and the accuracy (using metrics.accuracy_score) are slightly higher. This situation is well shown by their confusion matrixes below (1 = injuries; 2 = property damages).
The three confusion matrixes show how the logistic regression model has a higher mistake with both severity 1 (i.e. injuries) and 2 (i.e. property damages). However, the decision tree has the lowest false-negative while the SVM the lowest false positive (see table below, best results in red).
Both the Decision Tree (best F1-score and accuracy) and the SVM (best Jaccard score) have good results, but considering the objective of the study, we are more interested in saving life and avoiding injuries than avoiding property damages. Thus, predicting injuries (i.e. 2, true negatives) is more important than property damages (i.e. 1).
Therefore, the Decision Tree is the best model to use for two reasons:
- It has better F1-score and accuracy
- It has higher true negatives (it predicts accidents with injuries with greater accuracy)
3. Discussion
In this study I showed that accidents occur much more often in situations that are – at least in theory – safer, such as clear sky, dry road, and daylight.
It looks like drivers are more careful when situations are more stressful and they undervalue the risk of accidents when there are good conditions of the weather, the light and the road.
4. Conclusion
Most of the people drive a vehicle for moving inside and outside cities everyday. It can be for commuting, going on holidays, visiting someone or something else.
Having a reminder showing that most of the accidents happen in the easiest conditions (such as clear sky or daylight) could be useful. It would keep the drivers in the same state of alert that they have when the conditions are worst (such as hail, snow or fog).
Lastly, the models developed could also be useful for a municipality that wants to decrease the number of accidents in their district.
Notes
For the code, here is the link of the Jupiter notebook
[1] "Road traffic injuries", World Health Organisation (WHO), 07/02/2020, https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries
[2] Ibid.