Using Machine Learning to Predict Crash Severity
INTRODUCTION
Imagine you are a 911 dispatcher. A call comes in, and it’s a frantic man on the other end of the line, saying there has been a car accident and they need help, but then the line drops. You know that your department has limited resources, and you need to triage the severity of the accident so that you know who to send to the scene of the accident. It wouldn’t be smart to send the entire fire, police and EMS services for what’s likely just a fender bender, but if it’s a major accident, seconds matter. What do you do?
What if machine learning could help you make that decision? That is the reality of our new data driven world.
As a former firefighter, I know first-hand how important it is to allocate appropriate resources to an emergency situation. Personnel constraints, equipment shortages, and budget constraints can all play a role in those decisions. If there is a major fire going on and a call comes in for a wreck, it’s not always practical, or even possible, to send all of your resources. Especially for just a small fender bender at noon on a Tuesday. But if there is an accident on a twisty backroad at 2AM on a rainy Saturday, you should expect the worst and send whoever you can.
And that’s the goal of this project; to build a model to predict the severity of an accident, so that it can help triage accidents based on historic data.
?*** Disclaimer - I am positive that I did some (read: many) things wrong here, but I learn best by doing. And besides, the best way to get the right answer is to post the wrong one.
DATA
I’m a resident of the great state of Texas, so I chose to build my model using data from my state. While researching for data I could use, I found that public crash records do indeed exist for my state.
https://data.world/spatialaustin/texas-crash-records-information-system-cris-extract
I downloaded the 'field lookup' file to see if "crash severity" was a field I could target.
Success! It looks like this data has what we need to build a model to predict the severity of a crash. And there are many good potential predictor columns in the data; for example, the names of the roads, the date and time, the road’s surface material, the weather conditions, and much, much more. We should certainly be able to build a good classification model based on this data.
Unfortunately, the data here was fairly stale, with records up to only 2016. The latest data can be accessed from the Texas Department of Transportation for statistical purposes. Accessing this data requires signing up for the CRIS database.
https://www.txdot.gov/government/enforcement/data-access.html
https://ftp.dot.state.tx.us/pub/txdot-info/trf/crash_statistics/automated/cris-guide.pdf
https://cris.dot.state.tx.us/secure/Share/app/home/welcome
You can only request one year of data at a time. I will be using data from the year 2019 (1/1-12/31) to build my model.
It took about a day to get the data from the CRIS team, but once I did, it came in an encrypted zip file. Inside of that were several more zip files, and judging by the filenames, it's the data broken up into smaller date ranges.
I unzipped the first archive, and poked around the data. The file 'extract_public_2018_20200825003448_crash_20190101-20190301Texas.csv' had the target data, CRASH_SEV_ID.
My first task was to extract all of the data and pull it into a pandas dataframe.
This took me a while to figure out! The glob function was something I had not personally used before, but it allowed me to quickly import all of the relevant files all at once.
I now have a data frame with 648337 and 171 columns. Let’s discuss the columns.
An extremely useful file for understanding all of the data is available here:
https://www.txdot.gov/government/enforcement/data-access.html
This file has detailed descriptions of all of the columns, and what each value represents.
The vast, vast majority of this data is going to be irrelevant to our problem. We need to predict the severity of a crash, and the factors that contribute to the crash are all we are concerned with. We need to dig through the list and make sure we understand each column.
I suspect that what we need to do is break this list into two columns - items that were conditions leading up to the crash, and items that are effects of the crash. For example, road conditions are likely contributors to the crash, while the damage to the vehicle is a result of the crash. Let's call these ‘contributors’ and ‘results’.
There are also going to be completely irrelevant factors, like the name of the investigating service, that we can remove entirely.
METHODOLOGY
FEATURE SELECTION
First, I wanted to find what columns were the mostly highly correlated with the crash severity. This proved to be less interesting than I thought it would. The most highly correlated items were what I would consider the ‘results’ of a crash, and not really a ‘contributor’.
The most highly correlated fields were all of the injury count fields – this is not information you will have necessarily until you get resources to the crash scene. You can’t predict how bad a crash is GOING to be with data that isn’t generated until AFTER the crash. Unfortunately, you are typically looking for fields that have at least some correlation, but we aren’t going to have much luck with this data. All of the contributor fields have an absolute value of less than 0.15 correlation.
Since most of the fields are not highly correlated with the crash severity directly, tells me that we won’t be able to use a linear regression model or similar; we should use Decision Tree, SVG, or Random Forest for our model. Which make sense, since this is a categorization problem.
After working through all of the fields, and doing lots of manual analysis of the 171 columns, I found the columns to be most relevant, at least in my mind. However, I wanted to make sure I didn’t have any redundant fields. To further reduce my dimensionality, I did a correlation heat map of the data.
There are indeed several fields that are highly correlated. I ran a quick loop to pull out exactly which fields were the most highly correlated.
Pretty interesting. The fields City_ID, Rural_Fl, and Pop_Group_ID are all over 0.8. Let’s look more closely at these fields.
We can see that Rural_Fl may not be that useful. It’s really only a True/False field that does not narrow down our data very much. City ID is very interesting, but there is so much variability in the data (and a lot of 9 – Not reported), that it may yield inaccurate results.
Pop_Group_ID however, could be considered like a version of “binning” the City_ID. It should provide me with enough granular detail to make a good prediction, but generalized enough to not over fit. We’ll drop the other two columns.
FEATURE ENGINEERING
From here, much of the data is going to need to be cleaned/normalized.
Let’s first remove all values where the target value is missing. We can’t very well make predictions on these. Plus, if we can remove rows up front, that will save a little time processing the rest of the data.
Next, we need to deal with null values. In some cases, the existing ‘not reported’ value was 0, we’ll just fill those with zero. In other cases, we’ll just drop the row entirely. Originally I had filled these with a mean value, but I found this to work better for this model.
One big issue I ran into while iterating on this model was that the time column had entirely too many samples. I found that binning the values into their respective hour helped build a model that wasn’t too overfit.
I also originally replaced the day of the week values with a numeric value. In hindsight, this may not have been strictly necessary, because I ended up using one host encoding, but it was useful early on.
Same with any Y/N values. I converted these to 1/0, but could have left them alone.
We use an extremely simple one hot encoding method. Just one line of code.
We also split the data into training and test sets.
Finally, we can focus on balancing the training data. Just how unbalanced is it?
This data is clearly skewed towards the value 5. This is going to cause severe overfitting.
We’ll use sklearn’s resample library to create two new data frames, one with over-sampling and one with under-sampling. The oversampling frame will take our under-represented values and randomly up-sample our values to match the highest value, and the under-sample frame will randomly remove rows from the over-represented values. Both of these techniques have their shortfalls; potentially overfitting and loss of information, respectively. But it will give us much more accurate results than not balancing it at all. We’ll train models with both and evaluate them.
With the data balanced, we can create our x/y sets.
We should be ready for modeling.
DECISION TREE
I was pretty confident that a decision tree was going to be my best option, but I had mixed results.
I spent quite a bit of time evaluating which depth would work best. Early on, before settling on my feature engineering, I used a function that measured the accuracy of the model at different depths using K Fold cross validation. It was very interesting to see how the model leveled off after so many iterations.
I also ran a more simple version of this later on. It took less time because it didn’t do the cross validation, but it was still useful.
Training the model from there is very straight forward, but as you can see it resulted in a fairly low F1 score.
RANDOM FOREST
I suspected that a random forest would actually perform the best, because it’s effectively a series of decision trees. We could spend a lot of time here tuning the model, but with even just this simple implementation, the F1 score was reasonably high compared to many of the other tests I did.
XGBOOST
This was a very interesting model. From what I understand, it is getting used very often on Kaggle competitions to great success, so I wanted to give it a try for this project.
First, convert the training and test sets to a Dmatrix.
We can set the hyperparameters here. Since this is a classification problem, we use the multi:softprob model. I used several different values for the hyperparameters while experimenting with this model.
This version of the model worked reasonably well compared to some of my past experiments, but the F1 score was still pretty low. Increasing the max depth and the steps increased the scores.
RESULTS
In a classification problem, your most important metric is going to be your F1 score. I found that the random forest classifier performed best out of the gate with minimal tuning, with an F1 score of 0.889. This is far from what I would consider production ready, and certainly needs tuning, likely with gridsearchcv to automate the process, but this seems to be our winning model.
DISCUSSION
· One hot encoding ended up being the solution that made my model actually work. Lesson learned: if you are doing a classification problem with categorical data, you NEED to encode it. My models performed significantly worse, .3-.4 F1 scores, before using one hot encoding on my categorical values. I would want to spend more time engineering the encoding here, to make the column labels more consistent.
· I found during training multiple models that under-sampled data performed significantly worse than oversampling. I would want to spend more time determining how best to balance the data. I experimented with SMOTE, but found scikit-learn’s resample was simple enough to implement.
· You have to balance your training data or you will skew your model significantly. For example, here is a confusion matrix of a model that was trained on unbalanced data:
· The other huge lesson I learned while iterating on this model was regarding the train/test/split and how that interacted with balancing the data. See, at first, I balanced my dataset first, and then split the data into training sets and test sets. When I did that, I got deceptively good F1 scores, but it was an illusion. You need to split your data first, then balance your training set only. Your validation set needs to represent your real values, otherwise you risk overfitting. I didn’t catch this until I attempted to use data from a previous year to validate my results, and got something like a 30% accuracy. That was obviously a huge issue.
· Something I wanted to try was using Principal component analysis (PCA) to reduce the dimensionality of my model. I had promising results early, with high accuracy, but the F1 scores were always pretty rough.
CONCLUSION
· It’s possible to train a model to predict the severity of a crash.
· Since most of your data will be categorical in nature, you need to use one hot encoding.
· It’s extremely important to balance your data, but only after splitting it into training and test sets.
· Random forest classifiers should perform the best, though XGBoost could rival it given enough time for hyperparameter tuning.
Changing the world one step at a time!!!
Software Engineer
4 年Great job. I liked your approach of using multiple algorithms. I wanted to use other algorithms but unfortunately, the KNN algorithm had too high of time complexity. I sat at google collab and IBM Watson services for hours (making sure they don't disconnect) and despite using the powerful hardware that Google Colabratory provided, it never finished. Afterward, I used smaller datasets. Starting with 1000 rows and going up to 10K,20K,50K rows, I found out that the time needed for the training process to be completed, exponentially increased as a function of the number of rows. 1K row was done instantaneously. 10K rows took 4 minutes. 20K rows took 20 minutes. At this point, I knew that it might take days to be able to train the model using 100K+ rows. Fortunately, Logistic regression was much more efficient however, as I checked your results and other peers' results, I found that Logistic regression was the worst-performing model almost all the time. Unlucky for me, I wanted my model to output probabilities so I was stuck with Logistic regression. However, I personally think that a deeper logistic regression aka ANN could perform much better while providing the user/supervisor the confidence rate for each prediction. I believe ANNs are the way to go. Let me know what you think. Stay safe. :)