Time-series regression with sktime
Chidi Akurunwa
AI Technology Leader at SCSK & Sumitomo Europe | ex-CGI | AI Strategist | Helping business leaders and organisations transform AI concepts into business solutions
One of the things that got me so interested in data science, was the idea that the future could be predicted by machines programmed with data! This was just so fascinating to me and of course, as I began learning data science, it made sense that this was possible. Why? Because in data science, whenever there is a sensible pattern, then it is possible to use it to determine outcomes. Time-series regression is an important application in machine learning and is widely used in many industries. The ability to take observations from historical data and train a machine with these observations, so that the machine can predict future observations is what makes time-series regression fun, but also useful!
If you have read any of my previous articles, you will know about my deep passion to use AI to analyse core economic and social issues in Africa. There is perhaps not a more noble use of digital technology than when it is used to advocate for change. This time, however, I decided to analyse an issue, that the entire world has been dealing with in the past couple of months, and that is the COVID-19 virus. If you have ever worked with any COVID-19 dataset, you will most likely have had to work with time-series data. In times past, data scientists have had to use different methods of encoding or recurrent neural networks to work with time-series data. None of which are terrible, except one-hot encoding time-series data can be computationally expensive, particularly when your dataset is large, and neural networks are like black-boxes (their results can be difficult to interpret).
This is where sktime comes in and saves the day. Sktime is a recently released Python machine learning toolbox for time series with a unified interface for multiple learning tasks. It currently supports forecasting, time-series regression and classification. While it comes with algorithms dedicated to time-series forecasting, it is also compatible with sci-kit learn models. So let’s see what this bad boy can do! The dataset used for this project is the daily reported cases of COVID-19 in Nigeria from February 28th till July 20th and it can be downloaded here. Shall we?
After some minor data cleaning, our dataset is ready to use and we can install sktime with the pip install command. We will be using the ReducedRegressionForecaster, which is based on tabular regression. Reduced regression is simply reducing the task of time-series forecasting (extrapolation) to simple tasks of regression and combining the unique solutions to each regression (interpolation) task into a solution for the original problem, using a sliding window. Before we start building our model, we need to split the data into training and testing data then set the forecasting horizon, which is the period you are forecasting for. In our case, the forecasting horizon will be the period covered by the testing data.
The temporal_train_test_split method splits the training data without the danger of data leakage, in one line of code! Pretty good compared to what you would have had to do otherwise. At default, the method uses a test size of 36 points, with each point in time representing a day in our case, specifically from June 15th to July 20th. In other words, we will be attempting to forecast for this period. The forecasting horizon is therefore defined as an array of integers from 1 to 36, using the size of the testing data as a reference. As mentioned earlier, sktime is compatible with sci-kit learn models, and so, all we need to do is define a standard RandomForestRegression model and input that into the ReducedRegressionForecaster model. This will then use a specific sliding window length and fit the training data. Once fitted, the model can predict using the forecasting horizon.
As always, you can’t read this far without receiving a reward for your efforts, and I know just what you need! A good Python joke!
Now that you have enjoyed that! Let’s crack on! You probably noticed smape_loss in the piece of code earlier. This stands for symmetric mean absolute percentage error. It is used to quantify the accuracy of our forecasts and the lower it is, the higher the accuracy. With that in mind, you can see that our forecasts are good, and thus our model is performing well. Let us visualise our training data, testing data and predictions to get a different picture of our model’s performance:
We see that our model stays fairly accurate by predicting the average number of COVID-19 cases over the forecasting horizon. So there you go! With short lines of code, sktime can make time-series data a joy to work with! You can learn more about the sktime toolbox here: https://sktime.org/index.html.
With regards to the COVID-19 situation in Nigeria, it is clear, that once testing capability ramped up in the second month of the pandemic, the daily number of cases went up. It is imperative that the government have an economic safety net in place and I will talk about this more in my next article.
If you read this far, you are amazing! Please like, share, comment and check out my code for this project here: https://www.kaggle.com/chizzi25/data-science-project-3-covid-19-forecasting-ng?scriptVersionId=39772568#Data-cleaning-and-exploration.
You can also follow me on Medium and connect with me on Linkedin. Thank you for reading! :)
Key Client Management | Business Development | Strategic Project Manager | Product Development & Innovation Enthusiast.
4 年Always fascinating to read your articles.