登录查看更多内容

Prediction of Oil Production by applying Machine Learning on Volve Field Production Data.

Jaiyesh Chahar

Machine Learning Specialist | AI Products and Solutions, Deep Learning | IIT(ISM) Dhanbad

发布日期: 2020年12月17日

For Production engineer one of the most important work is to estimate the oil production from the operational parameters such as bottomhole pressure, tubing differential pressure, wellhead pressure etc. either using nodal analysis or by using empirical correlations. But most of these workflows assume the physics in some sort, which involves a lot of assumptions itself. So, in this article, we will try to use a totally different approach The Data-Driven Approach, where the computer will learn physics purely based on data. No Assumptions, No Complex Physics, Just Pure Data Based. Using daily production data of the Volve field, we will apply linear and polynomial regression to build a model and predict oil production.

Volve is a Hydrocarbon reservoir located in the Norwegian North Sea and was operational from 2005 to 2016 and gave a good 54% recovery. The Volve dataset is the most complete open-source Exploration and Production data available. Equinor disclosed all field data for benefit of students to research and use data for new perspectives. You can find the Volve Production data in my repository: Here

Workflow

Not only for this project, but you can also follow the above steps for any end to end machine learning model.

1. Importing Data

Firstly, we will import three important libraries- NumPy, Pandas, Matplotlib. We will use Pandas library to import the production data and create DataFrame.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

newdf = pd.read_excel('Volve production data.xlsx')

The dataset contains date of production, average downhole pressure, average downhole temperature, average drill pipe tubing, average annulus pressure, average choke size, average well-head pressure, average well-head temperature, oil volume, gas volume, water volume, type of flow (production or injection), and well type (oil production or water injection). The production data was recorded on daily basis.

2. Exploratory Data Analysis

The most important and time taking part of any Machine Learning Project is the Exploratory Data Analysis, it is our first direct encounter with the data, and our first step towards understanding it. It includes the process of performing initial investigations on data so as we discover patterns in data, spot anomalies and check assumptions with the help of summary statistics and visualization.

Start with getting the information about the data set using info function of pandas:

From here, we got information that some columns contain null values, and we need to impute them before training our model.

Now, next we will do analysis per well, so we will get the count of observations per well by using value_counts() function on the NPD_WELL_BORE_CODE column of data.

Using this we got to know that there are wells of code - 5693, 5769, 5599, 5352, 7078, 7289, 7405. Now, we will make dataframe for each well by slicing the data, and then get the information per well as shown below:

We will use Empirical Cumulative Distribution Function to plot the Oil Production from all wells in order from least to greatest and see the distribution of Oil Production per well.

def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1,n+1)/n
 
    return x,y

The above code block is for defining ECDF. Now we will get ECDF of Oil Production per well and plot them.

From the ECDF plots of Oil Production, it can be seen that in Well 1 40% of total data points has 0 Bore Oil Production, and for Well 5 20% of total data points has 0 Bore Oil Production, and these are not NA values these are 0 production points (well is not flowing). So, we will not take these two wells for our model training. Also, it can be seen that ECDF of well 6 is empty, and missing values for Well 7 also, which make the possibility that these wells are injectors. So, we will plot the ECDF of Water Injection volume for these wells.

By the above analysis, it is clear that Well 6 is an injector well, and Well 7 is a producer as well as an injector.

Now next step is to get the boxplots of features for each wellbore for properly understanding the statistics of the dataset, knowing the distribution of features.

From the box plot it is clear that the data is highly skewed and depends on well so while dealing with the null values we cannot use the mean method.

Note: All boxplots are not shown above, you can find the code of the whole process of Exploratory Data Analysis in my Github Repository: Here

3. Data Preprocessing

Now after the data analysis we need to preprocess the data. Some common examples are:-

Empty Columns (NaNs -Nulls or Not A Number type).
String format stored in a numerical column (& vice versa).
Negative values stored in porosity or permeability columns etc.
Theoretically incorrect values.

Here we have to deal only with NaN/Missing Data and as seen from the boxplots that the data is highly skewed and depends on the well, so forward filling is used for filling the data. As in skewed data, we can not use mean for filling the skewed data. In forward filling null value is filled with a value just above it.

In the next article we will use interpolation and filling the missing value by ourself, but for now, let us use "Forward" filling.

newdf['ON_STREAM_HRS'] = newdf['ON_STREAM_HRS'].fillna(method='pad')
newdf['AVG_DOWNHOLE_PRESSURE'] = newdf['AVG_DOWNHOLE_PRESSURE'].fillna(method='pad')
newdf['AVG_DOWNHOLE_TEMPERATURE'] = newdf['AVG_DOWNHOLE_TEMPERATURE'].fillna(method='pad')
newdf['AVG_DP_TUBING'] = newdf['AVG_DP_TUBING'].fillna(method='pad')
newdf['AVG_ANNULUS_PRESS'] = newdf['AVG_ANNULUS_PRESS'].fillna(method='pad')
newdf['AVG_CHOKE_SIZE_P'] = newdf['AVG_CHOKE_SIZE_P'].fillna(method='pad')
newdf['AVG_WHP_P'] = newdf['AVG_WHP_P'].fillna(method='pad')
newdf['AVG_WHT_P'] = newdf['AVG_WHT_P'].fillna(method='pad')
newdf['DP_CHOKE_SIZE'] = newdf['DP_CHOKE_SIZE'].fillna(method='pad')
newdf['BORE_OIL_VOL'] = newdf['BORE_OIL_VOL'].fillna(method='pad')
newdf['BORE_GAS_VOL'] = newdf['BORE_GAS_VOL'].fillna(method='pad')

newdf['BORE_WAT_VOL'] = newdf['BORE_WAT_VOL'].fillna(method='pad')

4. Data Scaling

Data Scaling/Normalization is applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. We transform the data such that the features are within a specific range [0, 1].

#Scaling dataset to remove difference in distributions within columns

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

newdf[['ON_STREAM_HRS','AVG_DOWNHOLE_TEMPERATURE','AVG_ANNULUS_PRESS','AVG_CHOKE_SIZE_P','AVG_WHP_P','AVG_WHT_P']] = scaler.fit_transform
                                          (newdf[['ON_STREAM_HRS',                        

                                          'AVG_DOWNHOLE_TEMPERATURE',
                                          'AVG_ANNULUS_PRESS',          

                                          'AVG_CHOKE_SIZE_P','AVG_WHP_P',  

                                          'AVG_WHT_P']])

5. Feature Selection

In this step called the Feature Engineering, the various input features are inspected and checked for their importance or their need in contributing to the results. Feature selection is the process of choosing the features that contribute to getting the prediction value right and are correlated to the desired output, either automatically or manually. Employing irrelevant features in the model may decrease the accuracy of the model and make the model learn from irrelevant parameters. Highly correlated parameters, on the other hand, may have the shortcoming defect of not adding a new feature to the process of training the model. Selecting highly correlated parameters as features may lead to reduction of model accuracy due to the lack of variation in the input data, or sometimes also results in data leakage to the model, and model will perform unbelievably good. The correlation heat-maps were generated in this work to identify the highly correlated parameters.

It can be seen that BORE_OIL_VOL and BORE_GAS_VOL are highly correlated; Pearson correlation coefficient of 0.99. It means that including both of these parameters will not add a new feature to the model and these parameters have basically linearly correlated data.

Note: Including BORE_GAS_VOLUME in testing data will lead to data leakage, as it is physically impossible to get Gas Production before getting the Oil Production Volume.

'ON_STREAM_HRS','AVG_DOWNHOLE_TEMPERATURE','AVG_ANNULUS_PRESS', 'AVG_CHOKE_SIZE_P', 'AVG_WHP_P', 'AVG_WHT_P' are selected as training features.

6. Applying Machine Learning Algorithm:

Linear Regression: One of the simplest kind of supervised model is linear regression. Linear Regression will try to build a linear relationship between Oil Production and the training features. The linear equation will assign a coefficient to each training feature, and also an intercept is also added to the equation. Input instance –feature vector: x = (x0,x1,…,xn)

Predicted Output : y= w0x0+w1x1+?wnxn+ b

So, here by using the Linear Regression from Scikit Learn Library we will train our model on the data, and then get the R2 score by predicting the data from our trained model by applying it on test dataset.

from sklearn.linear_model import LinearRegression
reg_all = LinearRegression()
reg_all.fit(x_train_final, y_train_final)
y_pred = reg_all.predict(x_test_final)

R2 score is a statistical measure of how close the real data are to the fitted regression model. It ranges from 0 to 1, 0 means the model is showing no variance around the mean(always predicting the mean), 1 means model explains all the variability of the response data around its mean(predicting data accurately).

print("The R2 value for linear regression for oil volume production is", reg_all.score(x_test_final, y_test_final))

So, from the Linear Regression we get an R2 score of 0.554, that is okay for a setting a base level. Now from this Linear Regression model we will develop a Correlation for Oil Production for this field. This correlation will be purely Data Based, No Physics No assumptions, only Data.

print("The Correlation from linear model is: BORE_OIL_VOL = {:.5} + {:.5}*ON_STREAM_HRS + {:.5}*AVG_DOWNHOLE_TEMPERATURE  {:.5}*AVG_ANNULUS_PRESS + {:.5}*AVG_CHOKE_SIZE_P + {:.5}*AVG_WHP_P + {:.5}*AVG_WHT_P + {:.5}*DP_CHOKE_SIZE {:.5}*BORE_WAT_".format(reg_all.intercept_, reg_all.coef_[0], reg_all.coef_[1], reg_all.coef_[2],reg_all.coef_[3], reg_all.coef_[4], reg_all.coef_[5], reg_all.coef_[6], reg_all.coef_[7]))

Now we will try for improving our Project by using Polynomial Regression.

2. Polynomial Regression: Linear regression model assumes the linear relationship between the Oil Production and the features. But as we know that the relationship is not that simple and the model is underfitting, So, for solving this problem and getting a more accurate correlation Polynomial Regression comes in picture. Basically, we will convert our features into their higher orders and then apply linear regression on these high order feature terms.

So, we will use PolynomialFeatures() function from the scikit learn library for converting features into high order features.

from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)

X_poly = poly_reg.fit_transform(x_train_final)
x_pol_test = poly_reg.fit_transform(x_test_final)

So, here we converted the training and testing features to 4th order features. As we are getting the highest score of R2 for our dataset at 4th order polynomials. Below and above 4th order the R2 is decreasing. And the PolynomialFeatures is converting our 8 features into 495 high order features.

Now we will apply Linear Regression on our high order features same as we do for single order feature, and get R2 score by using test data.

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y_train_final)
y_pred = lin_reg.predict(x_pol_test)
print("The R2 value for Polynomial regression(4th order) for oil volume production is",lin_reg.score(x_pol_test, y_test_final))

So, as we can see using polynomial features has improved our R2 to 0.96, that represents a great model.

Now we will get the name of high ordered features that we have used and their coefficients making a correlation for Oil Production. As we know that 495 features are used, so in place of a correlation we will make a dataframe of features and coefficients.

a=np.array(poly_reg.get_feature_names(['ON_STREAM_HRS','AVG_DOWNHOLE_TEMPERATURE', 'AVG_ANNULUS_PRESS', 'AVG_CHOKE_SIZE_P','AVG_WHP_P','AVG_WHT_P','DP_CHOKE_SIZE' ,'BORE_WAT_VOL']))

b = np.array(lin_reg.coef_)

Correlation_Poly = pd.DataFrame({'Coefficients':b, 'Feature Name':a})

Correlation_Poly

So, the dataframe of all 495 features is made with their coefficients.

Comparison between Actual Oil Production and our Models Predicted Oil Production

Compare the actual Oil Production with the models predicted oil production for each well's test data separately using scatterplot of Matplotlib library.

1.Well No. - 5599

X_test_5599 = X_test[X_test["NPD_WELL_BORE_CODE"] == 5599]

y_test_5599 = y_test[y_test["NPD_WELL_BORE_CODE"] == 5599]

x_test_5599final = X_test_5599.drop(['DATEPRD',"NPD_WELL_BORE_CODE"],axis = 1)

y_linear = reg_all.predict(x_test_5599final)


x_pol_test_5599 = poly_reg.fit_transform(x_test_5599final)
y_poly = lin_reg.predict(x_pol_test_5599)

plt.style.use('fivethirtyeight')
plt.figure(figsize = (14,8))
plt.scatter(X_test_5599["DATEPRD"].tolist(),y_linear,label='Linear Regression Model predicted')
plt.scatter(X_test_5599["DATEPRD"].tolist(),y_poly,label='Polynomial Regression(4 degree) predicted', color = 'green')
plt.scatter(X_test_5599["DATEPRD"].tolist(),y_test_5599['BORE_OIL_VOL'],label='actual', color = 'orange')
plt.legend()
plt.xlabel("Year")
plt.ylabel("Bore oil volume")
plt.title('Actual v/s Model prediction for Bore Oil Volume for Well No. - 5599')

print("The R2 value for linear regression for oil volume production in well 5599 is", reg_all.score(x_test_5599final,y_test_5599_final ))

print("The R2 value for Polynomial regression(Degree - 4) for oil volume production in well 5599 is", lin_reg.score(x_pol_test_5599,y_test_5599_final ))

2. Well No. - 5351

X_test_5351 = X_test[X_test["NPD_WELL_BORE_CODE"] == 5351]

y_test_5351 = y_test[y_test["NPD_WELL_BORE_CODE"] == 5351]

x_test_5351final = X_test_5351.drop(['DATEPRD',"NPD_WELL_BORE_CODE"],axis = 1)

y_linear = reg_all.predict(x_test_5351final)


x_pol_test_5351 = poly_reg.fit_transform(x_test_5351final)
y_poly = lin_reg.predict(x_pol_test_5351)

plt.style.use('fivethirtyeight')
plt.figure(figsize = (14,8))
plt.scatter(X_test_5351["DATEPRD"].tolist(),y_linear,label='Linear Regression Model predicted')
plt.scatter(X_test_5351["DATEPRD"].tolist(),y_poly,label='Polynomial Regression(4 degree) predicted', color = 'green')
plt.scatter(X_test_5351["DATEPRD"].tolist(),y_test_5351['BORE_OIL_VOL'],label='actual', color = 'orange')
plt.legend()
plt.xlabel("Year")
plt.ylabel("Bore oil volume")
plt.title('Actual v/s Model prediction for Bore Oil Volume for Well No. - 5351')

print("The R2 value for linear regression for oil volume production in well 5351 is", reg_all.score(x_test_5351final,y_test_5351_final ))

print("The R2 value for Polynomial regression(Degree - 4) for oil volume production in well 5351 is", lin_reg.score(x_pol_test_5351,y_test_5351_final ))

3. Well No. - 7078

X_test_7078 = X_test[X_test["NPD_WELL_BORE_CODE"] == 7078]

y_test_7078 = y_test[y_test["NPD_WELL_BORE_CODE"] == 7078]

x_test_7078final = X_test_7078.drop(['DATEPRD',"NPD_WELL_BORE_CODE"],axis = 1)

y_linear = reg_all.predict(x_test_7078final)


x_pol_test_7078 = poly_reg.fit_transform(x_test_7078final)
y_poly = lin_reg.predict(x_pol_test_7078)

plt.style.use('fivethirtyeight')
plt.figure(figsize = (14,8))
plt.scatter(X_test_7078["DATEPRD"].tolist(),y_linear,label='Linear Regression Model predicted')
plt.scatter(X_test_7078["DATEPRD"].tolist(),y_poly,label='Polynomial Regression(4 degree) predicted', color = 'green')
plt.scatter(X_test_7078["DATEPRD"].tolist(),y_test_7078['BORE_OIL_VOL'],label='actual', color = 'orange')
plt.legend()
plt.xlabel("Year")
plt.ylabel("Bore oil volume")
plt.title('Actual v/s Model prediction for Bore Oil Volume for Well No. - 7078')

print("The R2 value for linear regression for oil volume production in well 7078 is", reg_all.score(x_test_7078final,y_test_7078_final ))

print("The R2 value for Polynomial regression(Degree - 4) for oil volume production in well 7078 is", lin_reg.score(x_pol_test_7078,y_test_7078_final ))

As you can see that the R2 score for linear regression inWell 7078 is negative, which means that linear regression model fits worse than a horizontal line of mean value.

But even in this case after the Polynomial Transformation of Features we are getting a good R2 score of 0.70.

Conclusion

Prediction of Bore Oil Volume using Production Parameters is one of the most important work for a production engineer. The conventional way to do this is using empirical correlations which have a lot of assumptions and also these empirical correlations are derived from some specific field, so it is possible that these are not valid for some other fields. So, if we have sufficient production data from a field we can derive DATA BASED CORRELATIONS for that field by using machine learning algorithms. These correlations will be helpful for predicting bore oil volume for some new well at various production parameters in that field, and also these don't have any assumptions, just PURE DATA and STATS.

The models can be improved further by doing some more Feature Engineering and Trying other algorithms like Neural Networks, Support Vector etc.

Reference and Links

Data-Driven Hydrocarbon Production Forecasting, Using Machine Learning Techniques, Masoud Safari Zanjani, Mohammad Abdus Salam, and Osman Kandara,Department of Computer Science, Southern University, Baton Rouge, Louisiana, USA
Link to codes-