Electricity Load Forecasting Using Machine Learning applied to New York City
1.???Foreword
I want to thank Dr. Liping Ma and Dr. Lisa Ponti for the review and feedback of drafts of this paper. Their feedback is quite appreciated.
2.???Introduction
This project seeks to provide a method to forecast electricity loads for a defined homogeneous region, for example, the city of New York. Homogeneous region means an area that is affected by the same variables with about the same values. An area which population is rapidly growing and is been developed would be hard to track. If we chose to model a full state, i.e. the state of California, conditions would be largely different within the state. It can be inferred that if we would like to obtain a full forecast of a state, we would need to add all the forecasts of its smaller components.
Scientific literature has well documented that power consumption is correlated with many
factors, many of those factors were considered in this study.
?
3.???Overview of electricity markets in the USA
In a deregulated electricity market, competitors buy and sell electricity by investing in power plants and transmission lines. Generation owners sell electricity to retail suppliers. Retail suppliers set price for consumers and consumers may choose between available suppliers
Figure 1. Map of Deregulated Energy States and Markets, 2018, www.electricchoice.com
As it is shown on Figure 1, the electricity and gas markets in the USA vary from state from state. Some states, like CA and NY, have deregulated gas and electric markets. In others, like TX, only the electricity market is deregulated. In FL only the?gas market is deregulated, while in OK both the electricity and the gas markets are fully regulated.
Under 89 FERC 61,285 18 CFR Part 35 Docket No RM99-2-000, Regional Transmission Organizations, the Federal Energy Regulatory Commission, FERC, proposed a non-profit ISO system which would direct the operation of the transmission system and run day- ahead and real-time power markets coupled with a grid entity that owns and maintains the transmission in the area operated by the different ISOs.
The intraday and real time markets are managed and operated by independent system operators (ISO). There are seven ISOs in the USA (see Figure 2 below), they are non-profit and their area of services vary. New York ISO, NYISO, covers mainly one state, while Midcontinent ISO, MISO, covers several. Figure 2 also shows that some ISOs cover one state, while others cover several regions which may not be attached to one another.
Figure 2. USA Energy Power Markets, FERC.GOV
At the wholesale level, electricity is constantly balanced in real time. The lack of storage and other more complex factors lead to very high volatility of spot prices.
All FERC jurisdictional RTOs and ISOs, deregulated markets, including NYISO, have what is called a dual settlement system, which include a combination of the following:
Day Ahead Energy Market, DAM
In order to hedge some of the volatility, generators and load servicing entities sign contracts for delivery of energy at a later date, usually one day out, or Day-Ahead Market. This market produces one financial settlement. If market participants adhere to day-ahead schedules, they need not participate in the real-time markets. The day ahead market process balances supply offers (physical and virtual) against demands (physical and virtual).
Real Time Energy Market
The real time energy market is run by real time traders and allows market participants to buy and sell wholesale electricity during the course of the operating day, typically in five minute intervals. Prices are determined by real time dispatch quantities. It balances the differences between day ahead commitments and the actual real time demand for and production of electricity. The real time energy market produces a second and separate financial settlement.
In February 2018, members of the Mid Continent ISO, MISO, recommended the development of a multi-day market forecast with target implementation of 2021, the reasoning is that the current day-ahead market construct is not designed to forecast economic commitments beyond the next day, resulting in the inability to economically commit long-lead (or high startup cost) units and could cause uneconomic cycling of certain units across a certain period. In their presentation they quantified significant economic benefits of changing to a multi-day market forecast.
4.???Estimation Methods
Several different methods will be used to compare results with the goal of determining?which would be the best one for this specific cause. First traditional forecasting models such as a multiple linear regression model and an ARIMA model are used, then machine learning algorithms including a Random Forest regressor, Gradient Booster and Neural Networks/Deep Learning method with 2 layers will also be used. Electricity companies generally use proprietary models to create their own forecasts.
It is important to notice something important about the different methods. Time series methods, such ARIMA, only require a trend, their output will be a mean value plus a range. They will not require any more inputs, so they will not change to new conditions. Models such as multi-linear regression model and all the machine learning models require input variables and an output trend to mimic. If the data is varied enough, including extreme situations, the models will take all those points in their regression. In this study we took several years of continuous data.
4.1??????Linear Regression Model
Multiple linear regression models the relationship between several variables and a response variable by fitting a linear equation. Every value of the independent variable x is associated with a value of the dependent variable y. The slope of the line is given by
my=b0+b1x1+b2x2+…bpxp????????????????????????????????????????????????????????????????????????????????????????????(1)
The model for multiple linear regression, given n observations is:
yi=b0+b1xi1+b2xi2+…+bpxip+ei??for i=1,2,…n????????????????????????????????????????????????????????(2)
4.2??????ARIMA model
The ARIMA model is one of time series models to examine the time dependence of the predicted variable over time periods. In the model, AR is Auto Regressive term which represent how the lagged values affect the dependent variable itself, and MA is Moving Average term denoting the effects of lagged error terms. The important assumption in time series model is that the time series is covariate stationary. If the data has some time trend or integrated over time, we need to take a difference between the current values and the lagged values to stabilize the data. The ARIMA model only uses the load trend, so no input data is used during the modeling of an ARIMA.
An ARIMA (p,d,q) (Auto Regressive Integrated Moving Average with orders p,d, q) model is a discrete time linear equations with noise, of the form?
Where p and q are the orders of the autoregressive moving average, d is the number of differentiations, a1..k and b1..k are the parameters or coefficients (real numbers) for the autoregressive and moving average coefficients respectively, et is an error term (usually white noise), L is a time lag operator or backward shift and Xt are the initial conditions.
?
4.3??????Random Forest
Random Forest is a supervised learning algorithm. It builds multiple, random decision trees and merges them together to get a more accurate and stable prediction.?An in-depth model explanation can be found at the PhD thesis by Louppe and a paper by Klusowski. In essence, Random Forest is a bagging model of trees where each tree is trained independently on a group of randomly sampled instances with randomly selected features.?
The training of a Random Forest is as follows:
For t = 1, …, T,
1.???Sample ntry instances from the dataset with replacement.
2.???Train an unpruned decision or regression tree ft on the sampled instances with the following modification: at each node, choose the best split among mtry features randomly selected rather than among all features.
Both ntry and mtry are predefined constants.
The Random Forest comes from integrating all the trees together
??????????????????????????????????????????????????????????????
?
When M (number of trees) is large, the law of large numbers justifies using
?
??????????????????????????????????????????????????????????????????????????????????????????
?
where???is the expected value with respect to the random parameter q, conditionally on X (the desired values) and the dataset Dn. The sequence {qm)1<m<M consists of i.i.d. realizations of?a random variable q, which governs the probabilistic mechanisms that builds each tree.
4.4??????Gradient Boosting Algorithm
Gradient Boosting algorithm is based on gradient descent plus boosting, where the predictors are not made independently but sequentially. This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and the ones with the highest error appear most.
Gradient Boosting is considered an ensemble method and combines different predictors in a sequential manner with some shrinkage on them and also provides variable selection.
Gradient Boosting, as an ensemble method, may be described as follows:
????????????????????????????????????????????????????????????????????????????????????????????????????????
where y is the vector of observed phenotypes, m is a population mean, v is a shrinkage factor, hm is a predictor model as described earlier, X is the matrix of corresponding genotypes and e is the vector of residuals. Each predictor is added in a sequential manner, and is applied consecutively to the residuals from the committee formed by the previous ones, weighted by ci≠0=v.
4.5??????Neural Networks/Deep Learning
Neural networks/deep learning algorithms are designed to recognize patterns and are loosely based on neurons. They are formed by networks composed of several layers, the layers are made of nodes, which is the place where computation happens. Nodes combine input from the data with a set of coefficients or weights that either amplify or dampen that input. The network combines all the nodes and the final output is determined using activation functions.
A neural network is usually made of several units, also known as neurons, of the form:
????????????????????????????????????????????????????????????????????????????????????????????????
Figure 3. Neural Network
Where s is a non-linear activation function, such as the sigmoid function. These units are typically structured into successive layers, where the outputs of a layer are directed through weighted connections, or synapses, to the inputs of the net layer. In the figure a 3 layer neural network is shown. The first layer is the input layer, which transmits the input values x={x1,..xp) to the second layer. The second layer is made of activation units hj, taking as inputs the weighted values of the input layers and producing non-linear transformations as outputs. The third layer is made of a single activation unit. It takes as inputs the weighted outputs of the second layer and procures the predicted value y.
?
5.???Model results
Below are the results of the linear regression, ARIMA model and the three machine learning models: Random Forest, Gradient Boosting and Neural Networks/Deep Learning.
?
?
领英推荐
?
?
?
?
5.1??????Linear Regression Model
?
In Figure 4 we can observe that the hourly data is quite noisy, and that there is a positive relationship between the model variables and consumption as residuals are kept within a certain range.
Figure 4. Linear Regression model graphs
?
The graph “Residuals vs Fitted” in Figure 4 shows a linear model, which residuals increase in magnitude as the electricity values become higher. The scale location plot shows if the residuals are spread equally along the predictor range, the line does move up and deviates from its horizontal line, which means it is not homoscedastic. The QQ plot shows that there’s a non-normal distribution with the end quantiles as the plot diverges from the line.?The Cook’s Distance shows that only a few points could be influential outliers.
In conclusion, a linear regression model is not able to simulate with precision the electricity demand changes as the underlying system is highly non-linear so a different method is required. This same conclusion could be drawn by looking at the original data.
?
5.2??????ARIMA model
The ARIMA model and forecast were generated using R.
We used the electricity load data to obtain the ARIMA coefficients using R, then we computed an Autocorrelation Function, (ACF), to review the residuals of the model found.
The found model, ARIMA (5,1,4), means the data had some time trend and are not stationary, as expected.
?After performing the ARIMA (5,1,4) simulation, we compute an ACF of the residuals which is reported in Figure 6.
Figure 6 shows that there is some correlation in the data, so the -Ljung Box test using the residues was performed to corroborate the results of the graph. The test gave a p-value less than 2.2E-16. As the p-value is lower than the significant level of 0.05, therefore the test failed and the ARIMA model is not a good way to forecast electricity in this case, as the data is correlated with each other.
For illustration purposes, a forecast of an ARIMA (5,1,4) model was obtained and shown on figure 7
Figure 7. Forecasts from ARIMA (5,1,4)
?
ARIMA forecasts provide a central value and then a range where the next points may end. The inner range is at 80% confidence at the outer range is at 95% confidence. As it can be seen, the forecast of an ARIMA gives a central value surrounded by a confidence interval that expands with time. It does not vary according to inputs, the obtained forecast is not a high precision forecast.
?
5.3??????Machine Learning Algorithms General Procedure and Numerical Results
Good machine learning algorithms, such as the ones used in this paper, function like high-power regression models and can handle non-linear variables if enough data is provided. Regression data needs to encompass extreme situations that are wanted to be predicted, for instance, very hot and humid summers and cold winters.
The algorithms were programmed using Python and machine learning libraries such as Sklearn, Tensorflow, and Keras.
The data is then randomly divided in two datasets, a training dataset and a testing dataset. In machine learning algorithms, data needs to be partitioned in two and used as training and testing data. The training data is used by the algorithm to fit the model parameters to predict the trend. The testing data is used by the algorithm to compare the output from the algorithm against data it has never seen before. If too much data is used in the training section, we would get what is known as overfitting, which means the algorithm will fit all of the points with the training data, but it will not be able to perform well when new data is given.
The mean square error (MSE) and mean absolute percentage error (MAPE) were calculated for all three algorithms in order to compare their results. Testing and training scores were calculated for both Random Forest and Gradient Boosting algorithms.
Figure 12 shows the results of running the different machine learning algorithms using 10% to 80% of the available data for training. Random Forest and Gradient Boosting were found to perform best at approximately 70% of the data, as overfitting started at 80%. Neural Networks performed at its best at about 40% of the data. The MSE and MAPE score reflected the difference between the calculated values and showed that the error in the predictions was diminishing.
The three algorithms, which are among the best in machine learning, gave very similar results, Gradient Boosting and Random Forest are found to be almost equivalent and show slightly better results than Neural Networks.
The resulting graphs of the three different algorithms were also found to be almost identical. The graphs are going to be shown according to their numerical accuracy using train and test data plus MSE/MAPE, first Gradient Boosting, then Random Forest and last Neural Networks.
The small sensitivity seen by the MSE and MAPE numbers when the data is split from 10% to 80% to train the algorithm shows that the amount of data provided even at 10% split is enough to train the different algorithms. If the provided data would not be enough we would see high sensitivity in the indicators.
5.4 Graphs Comparison among Gradient Boosting, Random Forest and Neural Networks
Several graphs were created for all three machine learning algorithms. All three compare the predicted values against the original test values and all of them look quite similar.?
Figure 9. NYC Measured predicted electrical loads using Gradient Boosting, Random Forest and Neural Networks
Figure 9 shows an XY plot of the relationship between the measured load and the electricity load predicted by the model. It shows the models are effective within the range. It also shows that between 4000 and 8000 MW the forecast is more accurate, while its predictions are less accurate for values higher than 8000 MW as the difference between the predicted and the measured value increases.
Figure 10. Actual Loads vs Gradient Boosting, Random Forest and Neural Networks Results.
Figure 10 illustrates that the results of all algorithms resemble the behavior of the actual load over the 7 year period, going up and down, behaving just like the original data does. Neither time series nor multiple linear regression can predict the data with such precision.
Figure 11. Residuals Fraction in Gradient Boosting, Random Forest and Neural Networks
Figure 11 shows how large the fraction difference between the calculated results and the loads are, the tendency to negative values shows that calculated loads are typically higher than the actual load when there is a difference. As a power company that would help, as it is better to have energy than not to have. The tendency to overpredict is probably because there are other variables that need to be taken into consideration.
?
5.5??????Cumulative Differences among Machine Learning Algorithms
We reviewed all the data pairs, and found that the difference between the calculated and the original values was quite small most times, the values found are shown in Figure 12.
Figure 12. Percentage of population, maximum difference among machine learning algorithms
????????
While this table would help traders figure out how different the values are when there is a difference, it also shows that all three methods perform well and deliver very similar results.
6.???Conclusions
Energy forecasts based on multilinear regressions are not accurate as the underlying phenomena is not linear. Multilinear regressions are typically used in data science as a first approach to model outcomes but it is important to verify if the underlying phenomena is linear or not.
Typical forecasting methods such as ARIMA are not quite applicable as some variables in electricity demand are not purely random phenomena and follow a trend, this violates the randomness requirement for the use of time series. Differentiation was tried to improve results, but did not work. In addition, forecasts based on time series give a median value plus a range, these forecasts are not meant to give accurate numbers as they only receive a trend as an input. In addition, they don’t adapt to new conditions.
While I cannot compare my results against the methods used by the industry, as they are proprietary, the model used creates a viable tool to produce accurate hour by the hour forecasts. It will allow power companies to propose FERC the use of a multiday market forecast to improve their economic commitment decisions, just like the proposal given in 2018 by MISO, expand their energy hedging contracts from 1 to maybe 3 days or more and supply most of its energy based on these hedging contracts. Real time traders will still be required, but the major sourcing of energy will be able to be done through contracts that were negotiated days in advance. Longer contracts means less risk and therefore lower prices as unit with longer lead times or high startup costs or other factors may be taken into consideration in the mix. All these factors, as shown in the MISO study, will improve the profits of electricity companies.
Update:
#machinelearning #electricity #forecast Machine Learning Model to predict Electricity Demand for the city of New York
I've updated my model to predict electricity demand. The data below shows how a machine learning algorithm predicts the electricity demand for the city of New York during the period between 2013 and 2017. Why NYC? Well, the data was public! The first graph shows a 45 degree line between the original data and the predicted load, a sign the model is behaving correctly, the second graph shows that during summer (the peaks) the model follows the increase in demand, which is what it should do. The third graph shows the residuals. While the error level increases during summer, the values with less than 2.5% difference were 66.9%, the values with less than 5% difference were 89.9% and with less than 10% were 98.5%
References
[1]???????US Energy Information Administration website https://www.eia.gov/tools/faqs/faq.php?id=96&t=3
[2]???????Map of deregulated energy states and markets, 2018
[3]???????89 FERC 61,285, 18 CFR part 35 Regional Transmission Organizations, Federal Energy Regulatory Commission
[4]??????Price Formation in Organized Wholesale Electricity Markets Docket No. AD14-14-000 Operator Initiated Commitments in RTO and ISO Markets, FERC, December 2014
[5]??????Using Market Optimization Software to Develop a MISO Multi Day Market Forecast, Chuck Hansen et al, MISO FERC Technical Conference, AD10-12-009, June 26th 2018
[6]??????2017 State of the Market Report for the New York ISO Markets
[7]??????2017 State of the Market Report for the New York ISO Markets
[8]???????Electricity Power Markets: National Overview