Collision Severity Prediction with Machine Learning Algorithms
Ramin Ferdos
Data Scientist ?? ? AI, ML ?? ? Python Developer ?? ? Pushing Boundaries with Data-Driven Solutions ??
Introduction
Traffic-Related Collisions are estimated to cost the US economy ~$810bn per year in property damage, medical costs, legal bills, and loss of earnings. with proper use of data and technique, we can reduce this cost by identifying risk factors and controlling them.
It's obvious that there are a number of factors related to the severity of road traffic collisions, for instance, we can name a few of these factors:
- environmental conditions such as weather, road, lighting and etc.
- time and day of week
- driving-related conditions such as speeding, driving under influence, lack of attention, and etc.
- the speed and angle of collision are also two important factors.
- and etc.
while a number of these factors may individually be a good predictor, but some of them are rarely occurred together, for instance, head-on collision on freeways is pretty rare due to the blocks and dividers between the directions. with studying and analyzing historical data of collisions we can interpret valuable insights which will eventually help authorities and activist to reduce the severity of collisions. other helpful aspects of this prediction can be the power of foreseeing a collision based on observations and dispatching emergency units accordingly, for instance when a call center receives a call about a collision in a specific intersection, the operator could acquire important factors from the caller and feeding it to this application, the result will be a probability of injury and severity of it, so the dispatcher could dispatch units according to severity. this insight can reduce fatal casualties of a collision significantly.
The two main beneficiaries of building this kind of model are (1) town/city planners, who may be able to use the model to inform their road planning and traffic calming strategies, and (2) emergency service responders, who may be able to use the model to predict the severity of an accident based on information that’s provided at the time the accident is reported in order to optimally allocate resources across the city.
In this article, we are going to analyze historical data about accidents and its severities, then with using machine learning algorithms predicting the severity of newly entered collisions. We are approaching this problem with CRISP-DM Methodologies and further you can find related steps with sections.
First Step - Business Understanding
There is plenty of severity related to collisions, for example, in a collision severities may be a fatality, serious injury, injury, prop damage and etc. but we are reducing these types to two situations: 1. Property only damages collisions and 2. Injury collision.
A dataset of more than 221,000 accidents occurring from 2014 to now (September 2020) in the Seattle city area was obtained from the Seattle Data Portal. The dataset has 40 columns describing the details of each accident including the weather conditions, collision type, road type, date/time of the accident, and location.
With appropriate data we can extract valuable knowledge from this prediction we are going to just name a few, for instance, we can understand the probability of injuries in types of collisions, for example, rear-end collisions result in injuries or side sweeps? also, we can mark dangerous spots in which the occurrence of injury collisions in them are significant, another insight which we can extract is the occurrence of collisions especially injury ones in which weathers are more. so, our problem is predicting the severity type of collisions based on some features. we will talk about features further.
Third Step - Data Requirements
As we discussed previously, we need data in order to predict our dependent variable. Our data should meet some requirements and sufficient features to make this prediction happen. Fortunately, we have a rich data set containing required features and labels to predict. We will use the effective features of this dataset as independent variables (x) and make this prediction happening.
Fourth Step - Data Collections
In this step, data scientists try to gather their required data as an initial dataset to process it and clean it and predict based on it. Fortunately, IBM hosted our data set which you can download using this link: Downloading Dataset
Also, this dataset is available at this project's GitHub Repository. In this section, we are going to load our data and explore it for a little bit.
Please have in mind that in order to make this article clean I don't put chunks of codes here, to check the codes you can visit this project GitHub at https://github.com/RaminFerdos/Coursera_Capstone
Our initial data looks like this:
obviously, we have to clean our dataset and prepare it for our machine learning,
Fifth Step - Data Preparations
This section is one of the most important steps in our methodology. In this step, we are going to prepare our data for process and prediction. this step takes most of the projects time usually and should be taken with caution. Fortunately, since our dataset is clear and processed this important step is not a sore thumb in our project. Regardless, we should go through this step carefully and completely.
Looking at our dataset, we will understand that our severity codes signify this:
- Property Damage Only Collision
- Injury Collision
According to our dataset, out of 194673 records, 136485 records of collisions were 'Property damage only collisions', and 58188 were 'injury collisions'. There is some feature which won't help us to predict and we can delete them, these features are like:
- REPORT NO
- STATUS
- INCKEY
- etc.
also since logistic regression can't work with date-time variables, we need to drop them also.
after doing what we said, the result will be like this:
now we have to make our dataset cohesive. It means that everywhere we have 'No, no,n,0, null, NaN' should be changed into 0 and everywhere we have 'Yes,y, yes,1' should be changed into 1.
after doing what we said, the result will be like this:
Data Exploration:
In this section, we are going to explore our data and extract some observations from it.
the first thing we do is checking the occurrence of severity code = 1 (it means the collisions were only property damage) and severity code = 2 (it means the collisions had injuries.)
70.11 % of collisions were "Property Damage Only" and 29.89 % of collisions were "Injury".
On Average, severities which lead to injury had more people engaged in them. also, collisions with pedestrians involve leads to more injury severities.
On Average:
- Intersections' Collisions are more severe than alley and block ones.
- more persons are involved in intersection collisions and block ones.
- more vehicles are involved in block collisions.
- people who are under influence more likely to have collisions in alleys and blocks than intersections.
- speeding leads to collisions in blocks more than intersection and allies.
frequency of pedestrians, vehicle, and person's count can be seen below:
please look at this graph:
as you can see in the graph above, people are driving far more responsibly as the number of responsible drivers is 185,552 relative to people who don't drive responsibly which their counts are: 9,121.
95.31% of drivers drive responsibly but still, 4.69% of drivers drive irresponsibly.
another fact to have in mind is if you drive irresponsibly, you are more likely to injure yourself and/or others since:
29.44% of cases of driving not under influence resulted in injury but 39.05% of cases driving irresponsibly resulted in injury.
as you can see collisions are much more common in mid-blocks since there are 89800 incidents but there are only 62810 incidents only.
also, you can see the probability of property damage only in mid-block incidents are more than injury ones since 78.39% is property damage only and 21.61% is injury incidents.
the other useful insight is that at intersection incidents are more severe since 30.26% of those resulted in injuries but only 21.61% of incidents of mid-block resulted in injuries.
overcast injury incidents percentage: 31.55%.
raining injury incidents percentage: 33.72%.
clear injury incidents percentage: 32.25%.
as you can see above, weather nowadays is not a deciding factor in the type of injuries. collisions are briefly equaled in severity through different weathers. this fact is probably due to the new technologies in vehicles, tires, and infrastructure of cities and a good indicator of mitigation measures for bad weather in areas mentioned.
since the two fields of Road Condition and Weather Condition are correlated we can extract the same insights which we understand from Weather Condition.
percentage of incidents in daylight resulted in injuries : 33.19%.
percentage of incidents in dark with street lights on resulted in injuries: 29.84%.
percentage of streets with lights on in contrast to all street: 94.64%
also, light conditions is also not a good deciding factor in the severity of accidents since the percentage of injuries is a bit more in daylight! this fact indicates that streets without light are so less than streets with proper lights and lighting systems in the car are effective and working perfectly in order to illuminate drivers' paths.
above you can monitor the correlation between the features.
Above you can see the linear model plot of the number of vehicles and persons in a collision and the result of collision as markers.
another problem we have to address is:
- some records have missing data in columns, we should take them out of our dataset.
and after addressing this issue, our result will be like:
before dropping null records we have 194,673 records.
after dropping null records we have 182,895 records.
also, some records have 'unknown' value in some columns, before going further we have to terminate them too.
Before dropping unknown records state is:
there are 11637 records with "Unknown" as the weather.
there are 11519 records with "Unknown" as road cond.
there are 10448 records with "Unknown" as light cond.
there are 5 records with "Unknown" as junction type.
there are 3111 records with "NOT ENOUGH INFORMATION / NOT APPLICABLE" as sdot_coldesc.
there are 38 records with "Not stated" as st_coldesc.
after addressing this issue our result will be:
before dropping unknown records we have 182895 records.
after dropping unknown records we have 166532 records.
everything seems fine, at this stage we need to address an issue:
- there are some categorical variables and since Logistic Regression doesn't understand the variable we have to make them dummies.
for the first step, we have to discover our categorical variables and its parameters, in the picture below you can see some of them for example:
and after dummying categorical variables, our data frame will be looking like this:
and at last, our data is ready to be modeled.
Sixth Step - Data Modeling
in this step, we are going to create our data model according to the dataset and train it. before training, we split our dataset into two parts of the train and test in order to evaluate our prediction.
for the first part of this step, we need to identify our y (dependent variable) and x (independent variables).
our dependent variable which is the severity code will be in a NumPy array and look like this:
and our independent variable which is other features will be in a NumPy array look like this:
please have in mind those two pictures above are 5 first items of each array.
it's also advised to normalize the independent variables. the result after normalizing will be like this:
now we need to split our data into two parts of the train and test so we can evaluate our algorithm accuracy. as standard, 70% of our data will be used for training and 30% of our data will be used for testing:
our independent variables in the train set have: 116572 rows, 153 features(columns), and our dependent variable in the train set have 116572 rows.
our independent variables in the test set have: 49960 rows, 153 features(columns), and our dependent variable in the test set have 49960 rows.
now everything is set for our model to be trained and tested.
after fitting our model with the train data, our test predicted probability will be like this:
Seventh Step - Evaluation
Since our model is trained, in this step we are going to evaluate the performance of our machine learning algorithms via some methods.
our classification report is:
other evaluation parameters will be :
Train set Accuracy: 0.7351851216415606
Test set Accuracy: 0.7357686148919136
Our Algorithm F1-score is 0.8261676828063893.
Our Algorithm Logloss Score is 0.5138035158443474.
Our Algorithm Jaccard score is 0.7038208700724686.
Confusion Matrix will be :
and finally, our receiver operating characteristic will be:
Conclusion
in order to analyze our variables coefficiency, we create a dictionary as keys column names and values their coefficiency, you can see a few columns samples of this dictionary in the picture below:
Results
Our main objective in this project was discovering and extracting insight from the data, for the first part we explored our data and extract some valuable information which you saw above, for instance, we found out that intersection collisions are more severe than block ones but collisions are more occurring in blocks related to the intersections. this insight for instance gives a heads-up to the operator and emergency unit that if you are called for an incident in an intersection, there are will be probably injuries. also, this insight informs city planners that you should focus on mid-block roads and try to control parameters which leads to a collision in that section.
after that we trained our logistic regression model with the given data (which we have cleaned and dummying its categorical features) in order to make predictions based on new records, you can see its performance in the section above. also you can see its ROC.
and for the final part, we made a dictionary representing the impact of every single parameter in our multivariate regression equation, which you can see above. with this dictionary, we can see the impact of each parameter leads to injury or property damage.
Discussion
As we said before, this project implemented based on the data of the Seattle and fixed features, with improving our dataset and features this project can proceed and improve, also in the data exploration section we provided valuable information for users of this project. if they use that insight wisely, they can reduce the severities in collisions, for example, we found out that speeding leads to a collision in blocks more than intersection and allies, this indicates that control measures in block section could effectively reduce the number of collisions and etc.
I used logistic regression with the regularization strength of 0.003, for future studies it can be tested among other algorithms such as k-nearest neighbor or support vector machine. and also a feature selection part could be added to this project in order to make it more effective, other solutions for increasing the accuracy score of this model are also appreciated.
Conclusion
as a result, collisions are leading to a huge hit to both economy and sentiment of families and usually, injuries of collision are un-healable. with the proper data and use of machine learning algorithms, we can extract insights in order to reduce not only the number of collisions but also their severities and save lives! I highly recommend that to start gathering data in other countries since the big setback of this project is the lack of data.
Modern Problems Require Modern Solutions.
Ramin Ferdos.