Lessons Learned from My Energy Demand Forecasting Project
I recently concluded my biggest data science portfolio project. The goal of this project was develop demand forecast models for energy demand in Spain. I found this dataset on Kaggle with four years of hourly energy data: https://www.kaggle.com/datasets/nicholasjhana/energy-consumption-generation-prices-and-weather
As a part of this project, I made a quasi-weekly series of posts on LinkedIn documenting my progress. Whenever I do these projects, I like to finish by writing these summary articles. This demand forecasting project is my favorite one I've done over the years. Demand forecasting (in general, not necessarily for electricity) is the area of data science I hope to get involved in the long term. I also think I achieved the most through this project in terms of skills I used and the lessons I learned. In this final article, I want to discuss the challenges I faced, the lessons I learned, the ultimate outcome of the project, and the resources I used to put it all together.
The Challenges
From the beginning, I knew that seasonality would be a major factor in developing an accurate forecast. It's fairly intuitive that energy demand and usage will be higher during the midday than in the middle of the night when everyone is asleep. The challenge is that this data has multiple kinds of seasonality. Demand is likely to be higher during certain hours of the day, but it is also likely to be higher during certain months of the year. I initially did not have a good solution for this problem. The forecasting techniques I started with could only account for one type of seasonality. They could account for hour of the day, day of the week, and the month of the year, but not all three at the same time.
To demonstrate that all three kinds of seasonality are at play, I created these three box plots to show the variation in demand over each time interval.
Looking at these three box plots, it looks like the hourly seasonal effect is strongest. Without a way to account for all three kinds of seasonality, I thought that the best I could do at the beginning was to develop a forecast that just used the hourly seasonality, and perhaps also include an adjustment for weekends.
The other major challenge was on how to present this project, especially since I committed to making a weekly series of posts on it. Figuring out the right amount of detail to share with a larger audience was always tricky. My more technical posts generally didn't do as well as posts that just cut to the chase. That's something I think I should have anticipated, and the experience lines up well with what an actual job would look like. Stakeholders are ultimately much more interested in the key idea and results than all the technical details.
The last major challenge with this project was more personal. I became discouraged about the project after a while due to a lack of progress. After I settled on an ARIMA model for forecasting demand, I struggled to improve on it. Other more complex models tended to either have worse results, take much longer to run, or both. It became really hard to keep up the weekly posts too once it felt like I had hit a wall.
Lessons Learned
This project was an incredible learning experience above all else. I got to learn a bunch of forecasting techniques in R. Putting these techniques into practice in this kind of open-ended problem is very different than just following a textbook. It required me to find my own solutions to issues and figure out the best course of action for myself rather than just follow someone else's instructions. This project also helped me learn what works and doesn't work when presenting intermediate steps on LinkedIn, as I discussed earlier. But the biggest lesson from this project came when I finally made a breakthrough that significantly improved my forecasts.
Domain Expertise is Critical
I found my best forecasting model by implementing a method I found in an academic journal that discussed established methods for forecasting electricity demand. The journal's model was a significant improvement over any of the forecasts I had done previously. It should go without saying that I'm not at all an expert in this industry. Therefore, it was necessary for me to find out what the experts had already been doing. Doing so gave much better results than trying to develop increasingly complex and slower models. As someone who hopes to go into demand forecasting, I am probably not going to be a domain expert in whichever subject area I go into, so speaking with other data scientists and people more generally knowledgeable of the subject will be far more productive than building increasingly complex and unexplainable models. I have seen this kind of sentiment before on LinkedIn it reflects my experience in my current position. Demand planners and forecasters should probably spend most of their time speaking with other experts and stakeholders to better understand the industry's conventions and what forecasting techniques have already been shown to work.
领英推荐
The Impact of the Project
Even though I wrote earlier that audience members and stakeholders generally are not receptive to the technical details of the project, I still want to discuss how my final forecast model works. I think it's important for me and any key stakeholders to understand what the model I come up with does what it does rather than treat it as a black box. This model takes a regression between hourly temperatures and the corresponding hourly demand. The regression model is rather unusual. Instead of fitting demand to a straight line, it fits demand to a cubic function. The regression doesn't use a typical straight line because electricity demand is not linearly related to temperature. Demand instead increases at the extremes. Both colder and warmer temperatures lead to higher demand. A cubic function helps account for this nonlinear relationship. Additionally, the regression includes terms for hour of the day, day of the week, and month of the year, which addresses the challenge I introduced earlier of multiple kinds of seasonality.
As I show in this table below, the regression model is much more accurate than any other model I tried. Additionally, the model feels very intuitive. It describes very well how we should expect electricity demand to behave in response to hotter and colder temperatures and at different time intervals.
Is this the best possible model for this data? Absolutely not. There is still a lot of room for improvement, and I might come back to this project at a later date if I come up with new ideas. As I discussed in a LinkedIn post, this project was meant as a learning exercise. It was not meant to develop some revolutionary forecast that's better than what industry experts are already doing.
There is one issue with the model I want to discuss in more detail. The model relies on temperatures taken during the hour I am trying to forecast, but in a real world setting, I'm not going to have access to that information. Imagine trying to forecast demand at February 11th at midnight, which hasn't happened yet at the time I am writing this article. We are not going to know the exact temperature beforehand. Therefore, in order to put this model into practice, I think it needs to be combined with a separate weather forecasting model, which is way beyond the scope of this project. The good news on this front however is that weather forecasts have become much more accurate over time. This article from Our World in Data goes into more detail. Since weather forecasts have inherent uncertainty and error, I expect that this combined model would be less accurate than what I showed in my table above.
My Resources for This Project
To wrap up, I wanted to list the books and journal article that I used to put this project together.
First, the book Forecasting: Principles and Practice was where I got my initial ideas for forecasting models. It was also an excellent resource for learning the R programming code.
Second, this is the journal article I mentioned where I learned about the regression model.
Third, I wanted to mention the book Introduction to Statistical Learning (ISLR). This book does a great job teaching about a lot of statistical models and showing corresponding R code. I could not have implemented the regression model without consulting this book first.
Conclusion
Thanks so much for reading. I have published several other project articles before, but this one is my personal favorite. I hope that the next stage of my career is in a demand forecasting role, though not specifically for energy demand. I'm not 100% done with this project yet. Like I said before, I might revisit it if I come up with new ideas. I also want to see if I can create a presentation in Quarto that summarizes this article. I think that setting this kind of goal for myself is a better motivator for learning a new skill than just taking some online course. To end this article, I have linked my R code and my GitHub.
Data Scientist
2 周Cool post! Next steps I would consider are auto ARIMA (fable), maybe a model with Fourier terms, and measuring the accuracy against the distribution of the data, not just the average.?