Data science and machine learning in petroleum geostatistics - Part 1
"A journey of a thousand miles begins with a single step"
Lao Tzu
Introduction
We start this topic with a simple problem taken from the book Applied Geostatistics for Reservoir Characterization, written by Mohan Kelkar and Godofredo Perez. It is on page 18 in Chapter 2, Principles of Statistics. By solving this problem, you will learn the most elementary lesson about how to interact with data in Microsoft Excel and Python. At the same time, you will also understand and know how to calculate frequency and plot a histogram. These are the most fundamental concepts in statistics and data science. In addition, although it is assumed that you already have some knowledge of descriptive statistics, I will also review the basic concepts and how to calculate them in the problem-solving method section. Now let's begin. ?
Problem 1: Analyze porosity values using a frequency-distribution analysis
????The following porosity samples are measured in a wellbore: 0.141, 0.124, 0.152, 0.156, 0.113, 0.167, 0.194, 0.142, 0.133, 0.149, 0.106, 0.137, 0.147, 0.159, 0.174, 0.129, 0.153, 0.173, 0.189, 0.16, 0.193, 0.156, 0.149, 0.135, 0.145, 0.171, 0.101, 0.151, 0.176, 0.191, 0.121, 0.148, 0.153, 0.171, 0.183, 0.108, 0.123, 0.169, 0.185, 0.153, 0.117, 0.127, 0.145, 0.141, 0.165, 0.14, 0.143, 0.178, 0.179, 0.157.
??? Analyze these porosity, ?, values using a frequency-distribution analysis.
Problem-solving method
For the convenience of comparing the calculation results, I will follow the problem-solving method in the book cited above. We also divide the values into 5 classes with the following range:
1. 0.10 < ? ≤ 0.12
2. 0.12 < ? ≤ 0.14
3. 0.14 < ? ≤ 0.16
4. 0.16 < ? ≤ 0.18
5. 0.18 < ? ≤ 0.2
Next, we summarize the basic concepts related to frequency and how to calculate them.
??Class intervals: the range of the data is divided into intervals. For this problem, we have five class intervals.
??Class frequency, fi: the number of measurements falling within a particular class, i. For example, the class 5 has six measurements falling within a range of 0.18 < ? ≤ 0.2.
??Relative frequency, fRi
where,
n = total number of samples. In this problem, we have 50 samples.
For example, the class 2 has nine measurements falling within a range of 0.12 < ? ≤ 0.14 (fi = 9). As a result, the relative frequency of class 2 is 9/50, or 0.18.?
Note that the sum of all relative frequencies is one.
??Cumulative relative class frequency, Fj
where,
Fj = cumulative frequency for class j that results from the sum of all the relative frequencies from class 1 to class j.
Problem-solving by using Microsoft Excel
In this section, we will learn how to use Microsoft Excel, a world-famous spreadsheet developed by Microsoft, to solve this problem. Through this work, we'll learn some useful little tips and tricks on how to work with data.
The first is data entry. With the usual way of thinking, we will enter each porosity value into each cell according to a certain column. But this is quite time-consuming. The fastest way is to take advantage of some tricks in Excel. Here, I would like to give you step-by-step instructions so that you can input data quickly and solve the problem.
Step 1: Enter the data in column form.
Copy all porosity values. Open a new Excel file, in cell A1, right-click and select Match Destination Formatting (M) in Paste Special.
The result in Excel is shown as follows:
In the Data tab, select Text to Columns. Click Delimited.
Then, in the box, click Comma to separate values into cells. Click Next and Finish.
The result in Excel is shown as follows:
Use the key combination Ctrl+A and Ctrl+C to select and copy all porosity values in cells. In cell A2, right-click and select Paste Special.
In Paste Special box, click Transpose.
All porosity values from Row 1 are transposed to one column. The result in Excel is shown as follows:
Right-click Row 1 and select Delete.
Now, all porosity values are in column A.
领英推荐
Step 2: Prepare the spreadsheet and use “Define Name”.
Before starting to demonstrate the implementation of Step 2, I have a few words about a way of thinking and working in Excel. That is naming a cell or array as a variable in Excel with Define Name. With this approach, we can save a lot of time checking cell references and almost completely eliminate the use of the F4 key to fix cells or arrays. For example, we can count the total number of samples, n, by using “Define Name”.
Select all porosity values in cells, right-click, and select Define Name.
Enter the text "Data" into the Name cell and click OK.
The result in Excel is shown as follows. You can see “Data” in the “Name Box”.
Enter n in B1 cell. In C1 cell, enter the function:
=count(Data)
The result is shown to be that n, the total number of samples, is equal to 50.
In C1 cell, right-click, and select Define Name. Click OK.
?You can see “n” in the “Name Box”.
Note that we have 5 classes divided into intervals of values as mentioned above. Prepare the spreadsheet as follows:
Step 3: Calculate the parameters related to frequency.
In F2 cell, enter the function:
=COUNTIFS(Data,">0.1",Data,"<=0.12")
The result is shown to be that fi1, the class 1 frequency, is equal to 5.
Do the same for cells F3, F4, F5, and F6, we have the values fi2, fi3, fi4, and fi5 of 9, 20, 10, and 6, respectively.
Set the variable names from cells F2 to F6, respectively, fi1_, fi2_, fi3_, fi4_, and fi5_. Now you can calculate the relative class frequency.
In G2 cell, enter the function:
=fi1_/n
The result is shown to be that fRi1, the class 1 relative frequency, is equal to 0.1.
Do the same for cells G3, G4, G5, and G6, we have the values fRi2, fRi3, fRi4, and fRi5 of 0.18, 0.4, 0.2, and 0.12, respectively.
Set the variable names from cells G2 to G6, respectively, fRi1_, fRi2_, fRi3_, fRi4_, and fRi5_. Now you can calculate the cumulative relative class frequency.
In H2 cell, enter the function:
=SUM(fRi1_)
The result is shown to be that Fi1, the class 1 cumulative relative frequency, is equal to 0.1.
In H3 cell, enter the function:
=SUM(fRi1_,fRi2_)
The result is shown to be that Fi2, the class 1 to class 2 cumulative relative frequency, is equal to 0.28.
Do the same for cells H4, H5, and H6. By adding cumulatively, we have the values Fi3, Fi4, and Fi5 of 0.68, 0.88, and 1.0, respectively.
Step 4: Plot the relative class frequency and cumulative relative class frequency.
For convenient plotting of relative class frequency and cumulative relative class frequency, you can copy the Class and fRi?columns into columns J and K and the Class and Fi?columns into columns M and N.
Select array J2:K6. In the Insert tab, select Clustered column.
You can edit the title plot name with “Relative class frequency”. The result is shown as follows:
Do the same for the cumulative relative class frequency plot, and the result is shown as follows:
Part 1 ends here. I hope you enjoy it, and the first drum is not so low. I will upload this spreadsheet to my personal Github page. You can download it to practice if you want. If the article has errors or omissions, I hope to receive feedback and suggestions to make it better. I'll see you in the next post with a solution to Problem 1 using Python.
Note: You are free to share the content, but please cite the source. Thank you so much for this kindness.