Applied Time Series Forecasts for Traffic Accidents, Injuries, and Fatalities in Lebanon
Image by Clark Van Der Beken from Unsplash

Applied Time Series Forecasts for Traffic Accidents, Injuries, and Fatalities in Lebanon

Introduction

Located on the eastern coast of the Mediterranean, Lebanon is the second smallest country in the Middle East and the Arab World. Its total surface area is around 10,452 km2.

Road Traffic Accidents (RTA) are one of the top 10 leading causes of death in Lebanon. RTA is a significant but neglected global public health problem, requiring concerted efforts for effective and sustainable prevention. Road transport is the most complex and dangerous of all the systems people must handle daily.

This article aims to make short- and medium-term forecasting of traffic accidents, injuries, and fatalities using a popular time series forecasting model called ARIMA (Autoregressive Integrated Moving Average) based on the data (2007-2018) obtained from the Lebanese Central Administration of Statistics. This entity collected the data from the Internal Security Forces.?

Before introducing ARIMA to model traffic accidents, injuries, and fatalities data, I will briefly define the time series and its components.??

Definition

A time series is a sequence of data points collected regularly over time. In other words, it is a set of observations ordered chronologically and taken at fixed intervals, such as hourly, daily, weekly, or monthly.?

Time series can be used to track changes or patterns in a phenomenon over time, such as a company's stock prices, the number of website visitors per day, or the monthly sales figures of a product.?

Time is often the independent variable in a time series, and the goal is usually to forecast the future.?

Time series analysis can help identify trends, seasonality, and irregularity patterns in the data, which can then be used to make forecasts or informed decisions.

Trend

The trend is the long-term movement in a time series without calendar-related and inconsistent effects and reflects the underlying level. It shows the general tendency of the data to increase or decrease for a long time.?

Seasonal

A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or the day of the week). Seasonality is always a fixed and known period.?

Irregular

The irregular component (sometimes known as the residual) remains after a time series's seasonal and trend components have been estimated and removed. It results from short-term fluctuations in the series, which are neither systematic nor predictable. In a highly irregular series, these fluctuations can dominate movements, which will mask the trend and seasonality. Along with trend and seasonal variations, a time series could have other variations, such as a cyclic pattern.?

Visualizing the series

No alt text provided for this image
Fig 1. Traffic Accidents, Injuries, and Fatalities (2007-2018)

Looking closely at the traffic accident, injuries, and fatalities rates, shown in Fig. 1, we can see that there has been a decrease in these rates from 2016 to 2017 due to traffic safety and law enforcement and a noticeable increase in the trend of these rates from 2017.

The data, in 2018, reports a total of 6,085 injuries and 500 deaths in 4,549 reported accidents. The highest rates were in 2014, with 6,463 injuries and 657 deaths in 4,907 reported accidents.

Lebanon is the second smallest country in the Middle East, and in the Arab world, with a population of 6.85 million (2018). These figures are relatively large. It is heavily required for safety planners to forecast traffic accidents, injuries, and fatalities because it provides a better understanding of accident trends and the effectiveness of existing safety countermeasures and helps them explain what might the future needs and provide benchmarks for proper design and efficient transportation system operation.?

Method Summary

I completed this study using R language (version 3.6.1) and the Box-Jenkins method to develop the three ARIMA models.?

This method systematically identifies, fits, checks, and uses ARIMA time series models. The technique is appropriate for time series of medium to long lengths (at least 50 observations). It is divided into three stages, as shown in Fig. 2 below.

No alt text provided for this image
Fig 2. Box-Jenkins methodology

The first step is to use the data and all related information to help select a sub-class of the model that may best summarize the data.

Then, use the data to train the model's parameters, evaluate the fitted model in the context of the available data, and check for areas where the model may be improved.

It is an iterative process so that as new information is gained during diagnostics, I can reevaluate stage 1 and incorporate that into new model classes.

After choosing a model and checking its fit and forecasting ability, I can use it to forecast a future time horizon.?

I will follow the same methodology to develop all three models. Still, only one detailed model (i.e., the fatal accident model) is given below for illustration purposes. I will present all three models at the end of this article.?

Basic Concept

Nowadays, ARIMA is often used to analyze and explain most data and forecast the future values of standard errors. ARIMA models are characterized by three essential values (p, d, q) and written as ARIMA (p, d, q) where p is the order of the autoregressive component, d is the number of differences needed to arrive at a stationary process and q the order of the moving average component.

It refers to the number of lagged forecast errors that should go into the ARIMA Model. If the data contains a seasonal component, the notation is modified to ARIMA (p, d, q) (P, D, Q), where P is the seasonal autoregressive order, D is the seasonal difference order, and Q is the seasonal moving average order.?

One way to measure how the observations within a time series data are related is an AutoCorrelation Function (ACF). Given the measurements Y1, Y2,...,Yn, the autocorrelation coefficient can be established by using the following? formula:

No alt text provided for this image


If rk equals zero, then there is no autocorrelation in the available data. A Partial Auto Correlation Function (PACF) summarizes the relationship between an observation in a time series with observations at previous time steps with the relationships of intervening observations removed. The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.?

An essential requirement of ARIMA models is that the data set of interest is stationary, meaning it has a constant mean and variance over time. Suppose a data set is not stationary to begin with. In that case, stationarity can be achieved by a process called differencing, represented by the?d?component of the model. To start differencing the data?series, we consider a new variable ?? which is the change in Yt.


No alt text provided for this image

Analysis

According to Fig.3, the time series plot shows a monthly road fatality in Lebanon from 2007 to 2018. One can detect the non-stationary behavior resulting that the data exhibiting a variation in the mean of the series, which gives evidence of a trend in the time series. A decrease of 7.5% in traffic fatalities is observed from 2010 to 2011, and an increase of 17.1% the following. One also can notice another reduction in traffic fatalities from 2014 to the end of 2016, mainly caused by enforcing the traffic law and safety in Lebanon. Still, the numbers increased in 2017 and remained irregular until the end of the time series.?

Stationary testing and converting a series into a stationary series are the most crucial processes in time series modeling. Looking at Fig.3, it is visually clear that the data does not follow a stationary behavior. This conforms with the ACF and PACF plots in Figures 4 and 5. The ACF plot shows a positive correlation at higher lags. This indicates that we need differencing to make the series stationary. After differencing the data (d = 1), a stationary test (the augmented Dickey-Fuller test) was executed to detect a stationary behavior. The time plot and ACF and PACF plots show a stationary trend.?

No alt text provided for this image
No alt text provided for this image

After checking the ACF and PACF plots of the first differentiation, shown respectively in Fig.6. and 7., we identify the parameters p of autoregressive process AR, q of moving average method, P of seasonal autoregressive process, and Q of the seasonal moving average. The plots for the first differenced data suggest ?? = 0,1,2; ?? = 0,1; ?? = 0,1; ?? = 0,1,2.?

I tried and tested more than thirty model specifications separately using RStudio before the selected ARIMA (1, 1, 1) (1, 0, 1) was the most proper model to fit the traffic fatality data to estimate all models found.

The model selected, ARIMA (1, 1, 1) (1, 0, 1), can be written in the following backshift form.?

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Although a few ARIMA models have shown a slightly better fit, several criteria exist for comparing the fit quality across multiple models. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) are the most widely used criteria. These criteria are closely related and can be interpreted as estimating how much information would be lost if a given model is chosen. When comparing models, one wants to minimize AIC and BIC.

Table 1 and Table 2 are a summary of the model selected and the model suggested by using?auto.arima()?function in RStudio, estimating each coefficient, the standard errors of the parameters, and the p-value determining if each coefficient is statistically significant. Below is a description of the significative codes with their level of percentage. If the p-value is less than 5%, the coefficient is slightly or more likely to be significant.

No alt text provided for this image
No alt text provided for this image

Diagnostic Check

When selecting the most appropriate model, it should not only deliver accurate forecasts, the model's residuals must show no sign of autocorrelation. Diagnostic the model will require plotting the residuals over time to examine the existence of systematic trends, plotting the ACF and PACF of the residuals, and running a normality and randomness test.?

The randomness of residuals can be tested through the time plot of residuals, as shown in Fig.8. The Ljung-Box test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. The results of the Ljung-Box test are shown in Table 3.?

No alt text provided for this image
Fig 8. Plot of ARIMA (1,1,1) (1,0,1) residuals
No alt text provided for this image

The figure above shows the diagnostics of the residuals from the model selected, ARIMA (1,1,1) (1,0,1). First, the plot of the standardized residuals (Fig. 8) shows no obvious trend.

Table 3 exhibits the Box-Ljung test results, making the residuals look independent and identical.?

The ACF plot of residuals (Fig. 9) shows no evidence of a significant correlation in the residuals.

Finally, the normality test can be determined by plotting a histogram of the residuals, plotting the Q-Q plot of the residuals, and conducting a regular scores correlation test (Shapiro-Wilk normality test and Jarque Bera test).

Fig.8. displays a bell shape in the histogram of the residuals, Fig.10. illustrates the linearity of the normal-scores plot, and Table 3 exhibits the results of the Shapiro-Wilk normality test and Jarque Bera test. All these observations and tests indicate that the residuals are normally distributed.?

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Conclusion

This study emphasizes time series analysis's importance in forecasting traffic accidents, injuries, and fatalities. Using the Box-Jenkins methodology, three models were derived for forecasting purposes. ARIMA (1,1,1) (1,0,1) is the model for predicting the number of road traffic fatalities, ARIMA(1,1,1) (1,1,1) for forecasting the number of road accidents and ARIMA (1,0,2) (2,1,0) for the number of traffic injuries.?

In scientific research, the time series of road accidents represent handy data in road safety analysis. One of the purposes of modeling time series is to predict future values, which is of interest to safety planners. In terms of visualization, these series can be used to display and describe events related to road safety and the evolution of accidents or road victims over time. This technique lets traffic analysts know the future trend in traffic accidents, injuries, and fatalities. Therefore, the time series analysis related to accidents is essential in the decision-making process to improve road safety.?

In this study, we can conclude from the descriptive analysis of the time series of road accident data in Lebanon over 11 years, from 2007 to the end of 2018, that accidents show a slight decrease. But, generally, it is characterized by an irregular evolution in the rest of the series. After applying the Box and Jenkins methodology, ARIMA (1,1,1) (1,0,1) is identified as the most appropriate model for our data for traffic fatalities. Moreover, the forecast executed by this model shows no sign of a decrease in the number of accidents, injuries, and deaths in Lebanon for the next three years. This result can draw attention to the importance of applying time series analysis to have a clear vision of the exact level of road safety and to improve the prevention strategy based on scientific studies and research.

Noting the global pandemic in 2020, the predictions show a slight increase in traffic accidents, injuries, and fatalities. One could assume that the numbers should be less than the predictions. We must assert this assumption with the latest data, which is unfortunately unavailable or hard to retrieve. Also, this is a univariate time series analysis, meaning that it is a series with a single time-dependent variable. It is the most commonly used forecasting approach.

Fadoie Mardam-Bey Mansour

economist, management consultant & trainer

1 年

It's important for safety and security organisms to use such statistics and to compare them with the measures they have taken during the analyzed periods to see how much they were efficient.

要查看或添加评论,请登录

Rami Kanaan的更多文章

  • Data: World’s Most Valuable Asset?

    Data: World’s Most Valuable Asset?

    There was a time when oil companies ruled the world. Oil was the world's most valuable resource in the 18th century and…

    1 条评论

社区洞察

其他会员也浏览了