登录查看更多内容

Time-series regression with sktime

Chidi Akurunwa

AI Technology Leader at SCSK & Sumitomo Europe | ex-CGI | AI Strategist | Helping business leaders and organisations transform AI concepts into business solutions

发布日期: 2020年7月29日

One of the things that got me so interested in data science, was the idea that the future could be predicted by machines programmed with data! This was just so fascinating to me and of course, as I began learning data science, it made sense that this was possible. Why? Because in data science, whenever there is a sensible pattern, then it is possible to use it to determine outcomes. Time-series regression is an important application in machine learning and is widely used in many industries. The ability to take observations from historical data and train a machine with these observations, so that the machine can predict future observations is what makes time-series regression fun, but also useful!

If you have read any of my previous articles, you will know about my deep passion to use AI to analyse core economic and social issues in Africa. There is perhaps not a more noble use of digital technology than when it is used to advocate for change. This time, however, I decided to analyse an issue, that the entire world has been dealing with in the past couple of months, and that is the COVID-19 virus. If you have ever worked with any COVID-19 dataset, you will most likely have had to work with time-series data. In times past, data scientists have had to use different methods of encoding or recurrent neural networks to work with time-series data. None of which are terrible, except one-hot encoding time-series data can be computationally expensive, particularly when your dataset is large, and neural networks are like black-boxes (their results can be difficult to interpret).

This is where sktime comes in and saves the day. Sktime is a recently released Python machine learning toolbox for time series with a unified interface for multiple learning tasks. It currently supports forecasting, time-series regression and classification. While it comes with algorithms dedicated to time-series forecasting, it is also compatible with sci-kit learn models. So let’s see what this bad boy can do! The dataset used for this project is the daily reported cases of COVID-19 in Nigeria from February 28th till July 20th and it can be downloaded here. Shall we?

After some minor data cleaning, our dataset is ready to use and we can install sktime with the pip install command. We will be using the ReducedRegressionForecaster, which is based on tabular regression. Reduced regression is simply reducing the task of time-series forecasting (extrapolation) to simple tasks of regression and combining the unique solutions to each regression (interpolation) task into a solution for the original problem, using a sliding window. Before we start building our model, we need to split the data into training and testing data then set the forecasting horizon, which is the period you are forecasting for. In our case, the forecasting horizon will be the period covered by the testing data.

The temporal_train_test_split method splits the training data without the danger of data leakage, in one line of code! Pretty good compared to what you would have had to do otherwise. At default, the method uses a test size of 36 points, with each point in time representing a day in our case, specifically from June 15th to July 20th. In other words, we will be attempting to forecast for this period. The forecasting horizon is therefore defined as an array of integers from 1 to 36, using the size of the testing data as a reference. As mentioned earlier, sktime is compatible with sci-kit learn models, and so, all we need to do is define a standard RandomForestRegression model and input that into the ReducedRegressionForecaster model. This will then use a specific sliding window length and fit the training data. Once fitted, the model can predict using the forecasting horizon.

As always, you can’t read this far without receiving a reward for your efforts, and I know just what you need! A good Python joke!

Now that you have enjoyed that! Let’s crack on! You probably noticed smape_loss in the piece of code earlier. This stands for symmetric mean absolute percentage error. It is used to quantify the accuracy of our forecasts and the lower it is, the higher the accuracy. With that in mind, you can see that our forecasts are good, and thus our model is performing well. Let us visualise our training data, testing data and predictions to get a different picture of our model’s performance:

We see that our model stays fairly accurate by predicting the average number of COVID-19 cases over the forecasting horizon. So there you go! With short lines of code, sktime can make time-series data a joy to work with! You can learn more about the sktime toolbox here: https://sktime.org/index.html.

With regards to the COVID-19 situation in Nigeria, it is clear, that once testing capability ramped up in the second month of the pandemic, the daily number of cases went up. It is imperative that the government have an economic safety net in place and I will talk about this more in my next article.

If you read this far, you are amazing! Please like, share, comment and check out my code for this project here: https://www.kaggle.com/chizzi25/data-science-project-3-covid-19-forecasting-ng?scriptVersionId=39772568#Data-cleaning-and-exploration.

You can also follow me on Medium and connect with me on Linkedin. Thank you for reading! :)

Osarenren Igbinoba

Key Client Management | Business Development | Strategic Project Manager | Product Development & Innovation Enthusiast.

4 年

Always fascinating to read your articles.

1 次回应

查看更多评论

要查看或添加评论，请登录

Chidi Akurunwa的更多文章

The art of hyperparameter tuning

2020年8月26日

The art of hyperparameter tuning

Building an effective machine learning model can be quite challenging as there are many aspects to it. However, it is…
Transfer Learning with ResNet50

2020年8月12日

Transfer Learning with ResNet50

Deep learning is one of the breakthroughs in artificial intelligence and it has made room for some of the amazing…
Understanding forms of violence against women with machine learning

2020年7月21日

Understanding forms of violence against women with machine learning

I have a deep conviction that AI can be a voice for social and economic change. The power of AI lies in the simple fact…
Dealing with extremely small datasets

2020年7月14日

Dealing with extremely small datasets

I am passionate about using data science tools and techniques to analyse social and economic issues in Sub-Saharan…

4 条评论

Time-series regression with sktime

Chidi Akurunwa

AI Technology Leader at SCSK & Sumitomo Europe | ex-CGI | AI Strategist | Helping business leaders and organisations transform AI concepts into business solutions

Chidi Akurunwa的更多文章

社区洞察

其他会员也浏览了

Early adopter version of my book - explaining machine learning algorithms as a hidden function that maps x and y

Complete Data Science BootCamp!

Mathematical foundations of Data Science: Understanding machine learning algorithms as a function f(x) that maps your inputs and outputs

How are Jacobian and Hessian matrices used in machine learning?

Artificial Intelligence No 50: Machine learning v.s. Statistics

Issue #203 - THE ML ENGINEER ??

Heatmaps: FiftyOne Computer Vision Tips and Tricks – Oct 6, 2023

Using Generative Adversarial networks (GANs) to augment data

Top Trending AI tools for 2023

Top Data Science and Machine Learning Methods Used

Chidi Akurunwa的更多文章

The art of hyperparameter tuning

Transfer Learning with ResNet50

Understanding forms of violence against women with machine learning

Dealing with extremely small datasets

社区洞察

其他会员也浏览了

Early adopter version of my book - explaining machine learning algorithms as a hidden function that maps x and y

Complete Data Science BootCamp!

Mathematical foundations of Data Science: Understanding machine learning algorithms as a function f(x) that maps your inputs and outputs

How are Jacobian and Hessian matrices used in machine learning?

Artificial Intelligence No 50: Machine learning v.s. Statistics

Issue #203 - THE ML ENGINEER ??

Heatmaps: FiftyOne Computer Vision Tips and Tricks – Oct 6, 2023

Using Generative Adversarial networks (GANs) to augment data

Top Trending AI tools for 2023

Top Data Science and Machine Learning Methods Used