登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

How can LSTMs help chemical engineers in predicting strategic risks?

ícaro Augusto Maccari Zelioli

AI Engineering Lead | Machine Learning Specialist | Data Scientist | Big Data Analyst | Data Engineer | Analytics Engineer | Python | Scikit-Learn | Tensorflow | Power BI | Hybrid AI | Generative AI

发布日期: 2023年8月14日

+ 关注

Introduction

Although Industry 4.0 advancements have provided numerous benefits for the process industry in recent years, it is essential to acknowledge that the range of applications is continually expanding. New AI and Machine Learning technologies are being invented and applied every day to assist us in overcoming our business challenges.

As engineers and scientists, we must keep an eye out for these new applications because they can enhance our ability to predict risks in our company, thus providing the opportunity to implement mitigating actions. Every business decision-making process involves weighing risks and profits, and having tools that support our analysis systematically and quantitatively makes this process less uncertain.

No other recent event has presented more risk and uncertainty than the COVID-19 global pandemic. With countries implementing lockdowns to prevent the spread of the virus, businesses faced significant challenges in continuing their operations, especially those dependent on external feedstock.

Take the Brent Oil Prices dataset as an example, shown in Figure 1 below. The dataset displays values since 1987, with the red dashed line marking January 10th, 2020 – the day appointed by the World Health Organization (WHO) as the official start of the COVID-19 pandemic. Before this date, the price of an oil barrel was approximately USD 60, but it suddenly plummeted to less than USD 20.

No alt text provided for this image — Figure 1. Evolution of the Brent Oil price from 1987 to 2022

However, it is noteworthy that the price rebounded to over USD 120 after a few months, all while the pandemic was still ongoing (remember that the WHO only declared the end of the pandemic in May of 2023).

Indeed, these two "moves" represented a variation of -67% initially and a staggering 500% increase afterward. Such extreme fluctuations signify significant volatility for an asset like oil. It is undeniable that oil refineries, processing units, and downstream chemical industries faced severe challenges in planning their operations during these periods. The drastic and unpredictable price swings had a profound impact on their businesses, making it highly challenging to make informed decisions and manage operations effectively.

However, let's mark another significant moment in our plot. The date is January 1st, 2009 – Figure 2 – represented by the orange line. This date marks approximately the end of the "Oil Shock" crisis in 2007-2008. Among the reasons for these events, as mentioned by Hamilton (2009), was the unmatched growth of global oil demand, which outpaced oil production. At that time, Saudi Arabia, the main oil supplier, experienced a decline in its oil production in 2007, leading to a remarkable depletion of the world's oil supply. As a consequence, oil prices surged significantly during that period.?

Indeed, apart from the political and economic motivations that influence the oil market, the observed variation during the "Oil Shock" crisis was staggering. At the beginning of the crisis, there was an almost 133% increase in oil prices, reflecting the severity of the supply-demand imbalance and the impact of geopolitical factors. However, as the crisis unfolded, there was a drastic turnaround, with prices plummeting by approximately 71%, marking a dramatic shift in market dynamics.

Nevertheless, one question remains: if we had today’s processing power back in 2010-2015, could we have used the data collected from the “Oil Shock” crisis to prepare ourselves for COVID-19 pandemic effects?

We must consider the differences between the motivations of the two events. Additionally, many other events beyond our control have significant effects on oil prices. However, this research aims to answer the question through a purely data-driven approach. This means that we will investigate whether an artificial intelligence algorithm can predict the values of the oil price using only its historical evolution, independent of other political, economic, and environmental factors.

Methodology

Data Source

The data used to develop this study was obtained on the Kaggle website, at the Brent Oil Prices repository (Kaggle, 2023). It represents a dataset obtained from the U.S. Energy Information Administration database.

Algorithm Chosen

To model a time series problem with such a degree of complexity, we chose the recurrent artificial neural networks approach, mainly because of their ability to deal with hidden correlations between input variables. More specifically, we’ll show an application of the Long Short-Term Memory (LSTM) algorithm.

Recurrent neural networks differ from traditional multi-layer perceptrons because they have connections between the neurons of the same hidden layer. This provides a “short-memory” feature to the algorithm. However, it presents an additional obstacle to convergence in the Gradient Descent algorithm.

If the distance between the neurons is big, the error propagation through them can lead to either Vanishing or Exploding Gradient phenomena. LSTMs algorithms come into play to solve such problems, by regulating how much of the predictions should be composed of “memories” (i.e. – the information from the last timesteps), and how much will be composed of “new information” (i.e. – the information from the observation of the present timesteps). Figure 3 shows an example of an LSTM basic cell.

The LSTM cell controls the “composition” of its outputs by adjusting the weights of the three “main gates”. The “Forget gate” regulates how much of the past information is to be used in the predictions; the “Input gate” regulates if new information coming from the present observations will be added to the memories to form the outputs; the “Output gate” - (it is mistakenly labeled as "Forget gate" in Figure 3, right next to o(t)) regulates if the memory of the present cell will be passed on to the next cell.

Data Preprocessing

The first step in the data preprocessing process was to split the dataset into train, validation, and test sets. Instead of using the traditional approach that uses a fixed percentage of the original dataset for each resulting set, this study has chosen specific dates to split the dataset.

The training part kept most of the data, from the first date of the data (May 20th, 1987) until the end of the “Oil Shock” crisis, on January 1st, 2009. The validation set was sampled from January 1st, 2009 until the beginning of the COVID-19 pandemic on January 10th, 2020. The rest of the dataset was put in the test set.

The final proportions of each dataset were: 61 % for the training set; 31 % for the validation set, and 8 % for the test set.

After the dataset splitting, the feature lag transformation was performed. This was made to allow the algorithm to use the information available in the past observations to make predictions of the next period value. This study used 30 past values (the last month of observation) to predict the next day's value.

Finally, the last step of data preprocessing was to scale the feature. In this study, the normalization approach was chosen as the feature scaling algorithm. This was because artificial neural networks work well when the features have a limited range of variation.

Model Architecture and Training

Since this study aimed to explore the abilities of neural networks to understand hidden and non-trivial correlations between the data, a deep architecture of four LSTM layers was proposed to model the problem (Figure 4). Moreover, we wanted to predict two different and non-correlated extreme events in oil prices using only the available data, and hence, the algorithm should be complex enough to understand such correlations.

The dropout layers were applied to regularize the network training and avoid model overfitting. All the layers’ dropout probabilities were kept at a lower value (p=0.1) to avoid underfitting.

The training of this model was performed on the training set, with concomitant validation in the validation set. Although no early stopping criteria were set, the model was first trained for 1,500 epochs, using the random batch training approach (batch size of 8). The loss function chosen was the mean squared error and the optimization of weights was performed using Adam’s algorithm.

After the first training, a visual analysis of the learning curve was performed to identify overfitting, through the gradual distancing of the validation set loss curve. The final value of the epochs chosen was 75 and, with it, the neural network was trained again.

No other hyperparameter, such as the number of neurons or the architecture,?was explored during the training because this is only a preliminary study.

Model Validation

The validation of the model was performed in two steps. The first used the entire dataset to validate the results of the predictions, and the prediction performance was evaluated using the following metrics: Mean Absolute Error (a measure of bias in regression problems) and R2 Score (a measure of variance).

In this case, we’re assuming the real values would be always available to be used as new inputs for the models, and the forecasts would be performed only for the next day. In this case, a 30-day forecast would take 29 days to be accomplished.

The second form of validation used a feedback form to make the forecasts. After the day (t+1) was predicted, its value was used as a new input for the model. In this case, we’re assuming that we don’t have access to the actual data after the predictions, and thus, if we wanted to do the same 30-day forecast, we could do it in the present day, at risk of using the model’s predictions (and their uncertainties) as new inputs.

The performance of the second case was assessed with new real data, for the next 30 days after the last timestamp of the used row (November 14th, 2022). Hence, we’ve used data until December 14th, 2022. The validation was performed by looking at the predictions' 95 % confidence interval bands.

Results and Discussion

Learning Curves

The learning of time correlations was very quick for the training set, as its loss function presented an intense decrease after only a few epochs of training (Figure 5). For validation loss, although, the learning was more noisy, because of the high variations in loss function behavior.

With a deeper exploration of the reasons for the noisy behavior of the validation set’s loss function, we’ve found that it is due to the significant difference between the distributions that originate them.

Note the validation set is composed of the period between two oil great oil prices crisis, being significantly distinct in their nature from the previous period, where oil prices were in a gradual, but smoother increase.

Nevertheless, there’s no sign of model overfitting. If we take the moving average of the loss function for the validation set, we can visualize its decreasing trend, which indicates the model was learning the correlations.

A similar behavior was observed for a neural network trained for fewer epochs (Figure 6). This means that 1,500 epochs were more than necessary to allow the model’s learning.

Note that other reasons for validation loss instability might be: too small training batch, which was not verified because the batch size hyperparameter effect was not explored; or a model too complex for the problem, which is unlikely the case because the learning curves did not show evidence of overfitting.

Prediction of the Next-Day Value

After the training of the final 75-epochs neural network, it was validated against the entire dataset. This was done to preserve the time dependence between the samples in all sets. The proposed RNN structure can predict the value of oil barrel prices for the next day, with good accuracy and low bias (Figure 7).

The model presented significantly good metrics for a regression scenario: the R2 score was 0.9971, which means the model was capable of explaining over 99 % of the target variable variance.

Figure 7 x-axis is expressed in work days after the first date of the dataset (May 20th, 1987). Note that even the significant drops in oil prices in 2008 (Oil Shock) – 5500 work days approximately), ?and 2020 (COVID-19 global pandemic – 8500 work days, approximately) were successfully predicted by the algorithm proposed, which starts to answer our hypothesis question.

In addition, the mean absolute error (MAE) was USD 1.36, which indicates the expected error from a future prediction. The error analysis of this algorithm (data not shown) pointed out that 95 % of the prediction errors are within USD -1.19 and USD 4.29. The distribution of the errors was approximately symmetric, and then, it can be considered normally distributed. Additionally, it showed this algorithm has a slight bias of USD 0.95, which means it tends to overestimate oil prices by almost USD 1. For a 99 % confidence level, the errors fall within – USD 2.97 and + USD 6.14.

Considering the entire range of variation of oil prices (USD 134.85), the maximum error observed within the confidence boundaries (USD 6.14) represents 4.5 % in relative terms, which is considered a significantly good result.

The results indicate that if we continuously collect real data from the oil prices, the algorithm is capable of predicting the next day reliably. This is technically feasible through the implementation of a data collection and preprocessing pipeline, which is a very common situation in the present analytics engineering field.

Feedback Forecast

If the algorithm is applied in a feedback form to forecast longer times, then it is limited to about 13 days (Figure 8) – for a 95 % confidence level.

Note in Figure 8, how the real expected oil prices presented a more constant trend for the next 30 days after the last timestamp of the dataset (November 15th, 2022 to December 14th, 2022), while the LSTM’s predictions went down in a decreasing line. After day 13, the real data points fall outside the 95 % confidence band. If we change the confidence level to 99 % then, the predictions are reliable for about 20 days ahead (Figure 9).

Additionally, note that the prediction errors for future days are significantly higher than those observed for training, validation, and test data, considering a single-day forecast. For instance, on day 12 of Figure 8, the predicted value was about USD 75 while the true value was next to USD 85. This represents an offset of about 13 %. The farther we go into the future, the worse it gets as on day 20 of Figure 9, the prediction was approximately USD 58, while the true value was almost USD 80, which means a 38 % offset.

The reason behind this behavior is called error propagation. Since each prediction replaces real data as inputs for the next prediction, so does its associated uncertainty. This causes the model to be biased toward its predictions as predicted values start to compound the major part of the input vector. This also leads to smoothing in the predictions, diminishing the ability of the model to capture the real variance of the data.

This means this model should be retrained periodically, to understand new correlations as new data comes into the storage. This is a very common issue in machine learning models because data evolves with time in a manner that was not predicted by the historical data set. This is called data drift.

Fortunately, there are manners to overcome such challenges. In modern machine learning deployment architecture, engineers can create triggered re-training actions, based on a fixed period or in a particular event such as the relative prediction error. This can help the model to keep updated as new data is generated.

It’s important to highlight that the results observed depend heavily on the neural network architecture, training plan, and preprocessing steps. However, if predictions are used as new inputs, error propagation occurs, and the recommendations apply whatsoever.

Conclusions and Future Works

This study showed an example of how to apply these models to predict the Brent Oil Prices time series. LSTM neural networks can be used to forecast critical changes in strategic assets valuation, provided the right conditions and quality are met. The study explored two possibilities: the first assumes access to real data online, so the input variables will always be composed of real values. In this case, the performance of prediction was significantly good, even in the substantial variations (crisis periods). In the second case, the assumption was no immediate access to real data, so the model would need to use its predictions to forecast longer periods. In this case, the model application is limited by error propagation. In either case, present state machine and data engineering tools are enough to provide the right updates and retraining necessary.

Future works include the exploration of new architectures of LSTM nets, the exploration of the effects of each hyperparameter, and the analysis of different types of forecasting procedures. Nevertheless, the study showed there’s potential for applying neural networks in the strategic planning of chemical process operations.

References

Brent Oil Prices competition. Kaggle. https://www.kaggle.com/datasets/mabusalah/brent-oil-prices. Access in July and August 2023.

Hamilton, James. D. Causes and Consequences of the Oil Shock of 2007-08. National Bureau of Economic Research, Cambridge MA, 2009.

World Health Organization website. https://www.who.int/. Access in July and August 2023.

Ruy Prado

SAP Sales at BRAIN Consulting

1 年

Arthur Fernandes Prado

Tamires D.

1 年

great!

1 次回应

Thiago Rodrigues

Cientista de Dados | Engenheiro de Dados | Physics-Informed AI | P&D | SQL & Python

1 年

Victor Hugo

Rafael Levy

Head of Data Science Innovation at Stefanini US | Kaggle Competitions Expert

1 年

This is awesome, Icaro! You could try not using Dropout. It is not advisible to use Dropout in Regression Networks because it causes this instability behavior in validation curves. Without it you might get the same results with much less epochs and time, and better generalization More about it here: https://www.kaggle.com/competitions/commonlitreadabilityprize/discussion/260729 Amazing article and congrats!

1 次回应

查看更多评论

要查看或添加评论，请登录

ícaro Augusto Maccari Zelioli的更多文章

How can Data-Driven Models help chemical engineers to estimate KPIs?

2022年10月18日

How can Data-Driven Models help chemical engineers to estimate KPIs?

Suppose you work as a process engineer for a great oil and gas company, specifically in a “Xylene Isomerization” unit…

15 条评论
Intersec??es de conhecimento - Analista de Negócios x Cientista de Dados

2020年12月8日

Intersec??es de conhecimento - Analista de Negócios x Cientista de Dados

Recentemente, eu entrei num debate promovido por um Bootcamp de Ciência de Dados a respeito da compara??o entre o…

4 条评论
Identifica??o e Detec??o de Anomalias de Processo usando Análise de Componentes Principais

2020年10月21日

Identifica??o e Detec??o de Anomalias de Processo usando Análise de Componentes Principais

Introdu??o A qualidade é uma das coisas mais importantes que é necessário trabalhar nas indústrias e nos processos…

8 条评论
Ciência de Dados e a Engenharia Química - Um estudo de caso - parte II

2020年2月29日

Ciência de Dados e a Engenharia Química - Um estudo de caso - parte II

Bem vindos à segunda parte do artigo por mim para ilustrar a aplica??o de ciência de dados e Machine Learning na…

5 条评论
Ciência de Dados e a Engenharia Química - Um estudo de caso - parte I

2020年2月22日

Ciência de Dados e a Engenharia Química - Um estudo de caso - parte I

1.Introdu??o Uma vez eu li num livro da faculdade que “a energia é cara”.

11 条评论

See all articles