Application of Data Mining in Oil & Gas Facilities
Khaled AlHamouri, Ph.D.
Senior Consultant at EY - Author of MCPI TCO 2024, WFP MSTs/TAMs 2021, TQM 2019 & 2017 UAE & Lean Manufacturing Dubai 2017 - UTexas Austin
Authors: Dr. Vassilina Demetracopoulou & Dr. Khaled AlHamouri, Ph.D.
1.0 Problem Description & Background
Labor productivity is an area of great importance in the construction industry and although many researchers have attempted to identify the labor productivity trends over the past years and the principle factors affecting low productivity, it remains a highly disputable area. Construction industry has a vast applicability in many sectors, like infrastructure, industrial, oil and gas, and accounts for a significant part of most countries’ GDP around the world. Thus, the effort for optimizing productivity and efficiency is a priority in every market. The projects that will be analyzed are oil and gas facilities located in Brazil, where the Oil and Gas sector represents 4.6% of the Gross Domestic Product according to IBGE and the Brazilian Central Bank. In view of this fact, our scope of work consists of analyzing productivity data of the construction of such facilities. To analyze further, our data concern the pile installation in oil & gas facilities and they represent the productivity/ pile. The sequencing of the activities whose productivity is measured is drilling, pouring concrete and inserting the steel cage. The main objective of the data collection is to accurately estimate the schedule, according to the respective productivity. In order to estimate the schedule some of the data that need to be accurately measured involve:
·????? Number of piles
·????? Length of piles
·????? Diameter of piles
·????? Number of equipment
·????? Resources
·????? Productivity unit rate [man-hours/ meter of pile]
The scope is limited to the productivity of the pile installation but can provide valuable insights that can help in modeling and predicting the overall construction productivity and ultimately an accurate schedule estimate. As mentioned, investigating and predicting labor productivity is directly linked to accurate schedule estimates and consequently timely turnover to the stakeholders and earlier cash flows from the facility operation. The problem at hand becomes increasingly significant in a market like Brazil, since investments in the oil and gas industry from 2015 to 2019 are estimated at $130 billions.
2.0 Framework of Analysis
To efficiently analyse the data and acquire meaningful results, a framework of analysis has been established. The analysis will consist of 4 main steps:
·?????? Data Exploration
·?????? Data Pre-processing
·?????? Data Mining
·?????? Conclusions and recommendations
3.0 Data exploration?
3.1 Needs and Objectives
Dealing with and understanding the numerous factors that affect craft productivity is typically seen as a quite difficult task. Despite the fact that the current body of knowledge has various ways to offer in which labor efficiency loss can be quantified, the proposed methods and procedures have failed in different aspects. A comprehensive understanding of the interconnected relationships between the unit rate of productivity and the numerous factors that affect productivity is a must. This report aims at exploring the data set in-hand in addition to implementing different data mining techniques which might result in an increased understanding of the interconnectivity amongst the variables which thereafter can be utilized to develop more sophisticated techniques and provide valuable recommendations to better enhance areas of project controls and productivity measurement.
The main goals of the analysis are (a) to recognize patterns and provide valuable insights on the pile productivity, (b) evaluate the effectiveness of the different data mining algorithms performed by the data mining software Weka 3.8, and (c)identify the most meaningful relationships among the data and suggest improvement strategies based on the findings and the existing literature.
3.2 Target Data Set
The data set in-hand was collected throughout a duration of several years in Brazil. Labor productivity related to certain piling activities was measured in 26 different construction sites. The construction sites have very similar characteristics (e.g. phase, size and typology). These projects were controlled by the same owner organization, presenting identical execution procedures and similar quality, health, environmental and safety demands. The initial size of the acquired data was 91608 instances, however, after performing several data preprocessing tasks (refer to the Data Preprocessing section for detailed description of the performed tasks), the data set size was reduced significantly. The data set details are provided in the following table.
Table 1. Data Set Details
It must be noted that the acquired data set was not readily available to immediately implement data mining techniques as it needed a significant effort to perform data pre-processing in order to prepare the data to implement the suggested techniques and to perform proper analysis which can help in generating useful conclusions and recommendations.
3.3 Summary Statistics
To better understand a data set of that size we tried to extract basic information about the type of attributes, the attributes’ mean, and range. These basic summery statistics allowed us to familiarize with the data, observe patterns and relationships that could be analyzed through the appropriate data mining techniques. As discussed, the analysis will be limited to direct work, drilling and pouring concrete. These activities have different properties so even in this exploratory phase we examined them separately.
One the first tasks we needed to fulfil was to calculate the Productivity Unit Rate (Work Hours /Pile Length). This numeric relationship implies the analogy of the 2 variables and we wanted to check the validity of that assumption.
As seen in the figures, the representation of the productivity unit rate entails some inaccuracy, since an assigned metric for productivity isn’t easily assigned. It is evident that when the productivity is expressed as a class and not as a numeric figure, the results will better reflect the reality and can be handled more easily.
3.4 Visualization?
This section presents the initial ideas and graphs we used in our effort to identify meaningful patters and explore potential interrelations between attributes. Some of the observations done in this stage were confirmed by the association rule mining.
Some conclusions that can be drawn from this graph:
§? The productivity unit rate for drilling is higher than the productivity unit rate for concrete at all times.
§? The productivity unit rate has some ups and downs during the day which are more evident for the drilling.
§? The mean unit rate per hour can give us an indication of the more proper way to categorize the working times within a day.
Of course, more thorough analysis is needed. Some of the challenges include:
§? Not all data will be used after the pre-processing and cleaning. These trends may be somewhat different.
§? Each hour had different number of measured instances.
§? This graph doesn’t take into consideration differentiation between other attributes like the pile diameter and crew size.
This graph was created to investigate the relationship between the crew size (1-5) and the productivity unit rate. Some conclusions that can be drawn are:
§? The smallest crew has an overall higher productivity than larger groups (in the productivity unit rate the smallest the number the higher the productivity.
§? When the crew consists of 2 craft workers the productivity has less variability and is maintained in high levels.
§? The trends identified are similar both for the concrete and the drilling activities.
Similar limitations apply as before since:
§? The amount of instances for crews of 3 and 4 is significantly greater than those for 1,2 and 5.
§? Although the mean higher productivity is noticed for crew of 2, for the concrete both the crews of 3 and 4 have a number of instances with higher productivity.
3.5 Selection
From the initial dataset, some attributes where disregarded for the analysis from the beginning, e.g. the “Comments” attributes, that while it could provide some valuable information, however, it wouldn’t be useful in the data mining analysis. Some other data like the “Pile ID”, where used during the pre-processing phase to fill in missing values, and then where deleted since they didn’t provide any meaningful information for the analysis.
In this stage, we didn’t delete any attribute but only through the exploration and the basic statistics we could identify which attributes wouldn’t contribute in the understanding of the problem. For example, while the proximity factors, like the proximity to the entrance or cafeteria seemed that could provide some correlation with the productivity, in some cases they were so disproportional that one value dominated the analysis and didn’t provide any valuable insights.
4.0 Data Preprocessing
The acquired data set has a variety of quality problems in addition to not being readily available to implement data mining techniques which mandated performing several data preprocessing tasks. The following section provides an overview on the preprocessing tasks conducted.
4.1. Data Integration: Different sources of data
The data were collected from 26 different sites. A database that includes the data from the 26 sites was created. The total number of records after the integration is 91608 instances.
4.2. Data Reduction
Sampling
The total number of original records amounts to approximately 91608 record. In order to reduce the amount of time and effort needed for data preprocessing and analysis, a representative sample was selected. First, the dimensionality was reduced by feature subset selection through filtering approach as only the instances with relation to direct work were selected (i.e. the instances which have the work status as either Drilling or Concrete Pouring were selected). This resulted in reducing the data set size to from 91608 to 33032 instances.
Second, another sample from the original sample was chosen to reduce the amount of time and effort to preprocess and analyze the data. This sampling took into consideration the distribution of the data in regards to several factors to ensure that the data sample is representative (i.e. It has approximately the same property (of interest) as the original set of data). The second sampling ensured that the data: Represents the 26 different construction site equally; Includes sufficient instances from each month of the year; Includes sufficient instances from each year the study was conducted (i.e. years 2011 till 2015); Includes sufficient instances for each diameter size (i.e 40, 50, 60 cm); Includes sufficient instances for each pile length category; and includes sufficient data for all involved contractors and subcontractors at the 26 projects (i.e. 21 contractors, 12 subcontractors) to eliminate bias in addition to avoid domination by specific attribute categories in the analysis.
Finally, a total of 6 data subsets was generated to reflect three diameter sizes (i.e. 40, 50, and 60 cm). Each diameter has two subsets (i.e. one for concrete and one for drilling) to avoid handling the productivity of two different activities (i.e. drilling and concrete casting) with the same parameters (i.e. quartiles and outliers) which will lead to misleading conclusions.
4.3. Data Transformation:
Aggregation
As seen in the following tables, the data is collected individually for each craft worker. Therefore, the productivity is calculated for each worker separately. However, this does not represent the productivity of each pile as several craft workers are working to install each pile which mandates calculating the crew productivity rather than calculating it individually for each worker. Thus, as seen in the following tables, the workers’ productivity values were aggregated for each crew per activity. Rows 1, 2, and 3 in Table 2 were aggregated into row 1 in Table 3. Similarly, rows 4, 5, and 6 in Table 2 were aggregated into row 2 in Table 3. It must be noted that this aggregation did not only help in expressing the productivity values correctly but it also helped in reducing the size of the data set further which helps in reducing the time and effort required to preprocess and analyze the data.
Table 2 Aggregation Example 1
Table 3 Aggregation Example 1
Similarly, Rows 1 through 6 in Table 4 were aggregated into row 1 in Table 5 and rows 7, 8, and 9 in Table 4 were aggregated into row 2 in Table 5.
Table 4 Aggregation Example 2
Table 5 Aggregation Example 2
Transforming Attributes and Attributes Construction
Date Attribute
The acquired data set had the date set in mm/dd/yyyy format. The authors transformed the format and created two different attributes (i.e. Month and Year). The Month attribute values were inserted as January, February, etc.. As for the Year attribute, the values were inserted as Y2011, Y2012, etc..
Work Hours Attribute
The Work Hours attribute was created from the combination of two different attributes (i.e. Starting Time and Ending Time). The starting time and ending times were in Hours:Minutes format (e.g. starting time 13:10, ending time 13:16) which indicates a total of 6 minutes worth of work performed which is equivalent to 0.1 hours. Therefore, the value inserted in the Work Hours attribute is the multiplication of 0.1 hours by the number of craft workers involved in that timeframe. Assuming a crew size of 3 craft workers, this would result in a total value of 0.3 hours.
Moreover, an additional attribute (i.e. Work Hours Category) was constructed for the work hours which divides the work hours into 4 distinct categories based on quartile ranges (i.e. Q1, Q2, and Q3). Each dataset had different splitting points, however, the values were categorized into: (1) A (values less the Q1); (2) B (values less than Q2); (3) C (values less than Q3); and finally (4) D (values more than Q3). It must be noted that this transformation might only help when using association rule mining which does not accept numerical values.
Crew Size and Pile Length Attributes
Mainly the crews were formed of either 3, 4, or 5 craft workers. In order to be able to use the attribute when implementing association rule mining the numerical values of 3, 4, and 5 were transformed to C3, C4, and C5, respectively which were inserted in a new attribute named Crew Size Category.
Similarly, the pile length was dealt with in a similar manner. The lengths varied from 8 meters to 24 meters which were transformed to values that vary from L8 to L24. However, the original length attribute was utilized in computing the productivity.
Working Time Attribute
According to literature review and personal experience, the productivity of employees and craft workers differ according to the time of the day. It’s hypothesized by the authors, based on experience, that the least productive periods are the first and last hours of a working day in addition to the 30 mins before and after lunch-time. In order to test this hypothesis, the working day was divided into 7 categories based on the activity starting time. The 7 categories are as follows: (1) First hour (8:00-9:00); (2) 9:00-11:30; (3) 30 mins before lunch (11:30-12:00); (4) Lunch (12:00-13:00); (5) 30 mins after lunch (13:00-13:30); (6) 13:30-17:00; and finally (7) Last hour (17:00-18:00).
Productivity Unite Rate and Class Attributes
The productivity unit rate attribute was created by combining two different attributes (i.e. Work Hours and Length). The productivity is computed by dividing work hours by the length in meters which results in unit rate of wh/m. This indicates that the higher the unit rate is, the lower the productivity gets.
Moreover, another productivity attribute (i.e. Productivity Class) was created which is the productivity class. This attribute transformed the productivity unit rates into letter grades based on quartile ranges. The letter grades adapted are similar to the letter grades given for school courses (i.e. A, B, etc.). The productivity values were divided into 4 different categories as following: (1) A (values less than Q1); (2) B (values less than Q2); (3) C (values less than Q3); and finally (4) D (values more than Q3).
4.4. Data Cleaning:
Handling Duplicate Data
As part of the aggregation efforts done earlier, several data instances were deleted from the data set since they were regarded as redundant/duplicate data points. As shown in Table 6, instances 1 through 4 are already aggregated in instance 5, however, both the individual and aggregated instances are both recorded which formed duplicate data. Therefore, instances 1 through 4 were regarded as duplicates of instance 5, hence, they were deleted as shown in Table 7. Instances 6 through 9 were dealt with in a similar manner.
Table 6 Duplicate Data Example
Table 7 Duplicate Data Example
Handling Missing Values
Several data instances had missing values with regards to pile length and pile diameter. Moreover, some data instances had missing values related to the concrete pouring activity starting time. Each type of missing values was handled differently. As for the missing pile length and diameter values, the values were inferred from the pile type (if available), otherwise, the instance was deleted from the data set in case the pile type was missing. With regards to missing starting times for the concrete pouring activity, the starting time was inferred to be the ending time of the drilling activity for the specific pile under examination.
Furthermore, several data points had missing values with respect to the proximity and land factors (i.e. proximity to site entrance, proximity to cafeteria, proximity to concrete supplier, and land type). Due to the inability to infer the value of these missing values, the records that had missing values were deleted.
Handling Outliers
In order to avoid bias due to unusual job conditions, outliers had to be removed. When it comes to studying piling craft productivity, many unusual factors can lead to extremely unjustifiable poor productivity. The cases that incorporates such unusual factors are considered abnormal, therefore, it’s necessary to analyze the data without considering such factors to maintain consistency and to increase the generalization capabilities of the drawn conclusions and recommendations.
With respect to the datasets in-hand, productivity outliers were identified separately for each of the 6 developed datasets. This is due to the fact that piling productivity differs amongst different pile diameter sized (i.e. 40, 50, and 60 cm) and it also differs amongst both types of piling operations (i.e. drilling and pouring concrete). For each of the 6 datasets, box plots were generated to identify the outliers based on quartile ranges and interquartile range (IQR=Q3 - Q1). The outliers were defined as the values that are larger than 1.5*IQR + Q3 or smaller than Q1 - 1.5*IQR. The identified outliers were removed prior to the start of the analysis.
领英推荐
4.5. Finalizing Datasets
After the removal of the outliers, which is the last step in the data preprocessing efforts, the 6 data sets were finalized as shown in table 8 below. The letters C and D denotes Concrete and Drilling while the numbers 40, 50, and 60 denotes the piles diameter in cm. The total number of records in all of the 6 datasets amounts to 3398 records.
Table 8. Final Datasets Details
5.0 Data Mining:
As for the data mining tasks, three methods have been utilized. First, Decision trees have been utilized to classify the productivity class (i.e. A, B, C, and D). Second, Na?ve Base has been utilized to classify the productivity class (i.e. A, B, C, and D). Finally, Association Rule Mining has been utilized to identify meaningful patterns.
5.1 Decision Trees
5.2 Na?ve Base?
Na?ve Base has been implemented to the six data sets utilizing the same combination of attributes that was utilized in the Decision Trees for the first six runs. The following table summarizes the accuracy of the six runs and compares it to the accuracy achieved using the Decision Trees.
Table 9. Na?ve Base Accuracy vs. Decision Trees Accuracy
As shown in the above table, the accuracy of the Na?ve Base is relatively lower than that of Decision Trees. Moreover, Na?ve Base assumes independency amongst the variables which does not hold in our case as there are multiple variables that interdependent (i.e. proximity to site entrance, proximity to cafeteria).
5.3 Association Rule Mining
This data mining algorithm can identify meaningful relationships that will validate our observations from the data exploration as well as the decision trees. For the purpose of the analysis, both for concrete and for drilling the data were analyzed per diameter. The incorporation of all the data in the analysis didn’t result in meaningful relationships, and so a new approach had to be investigated.
As described in the attribute construction sector, new attributes had to be created to perform the analysis. The Apriori algorithm was used in Weka, and for each case 60 rules were created with minimum confidence level 70%. Below are presented the ones that conveyed a meaningful message and aligned (or contradicted) with previous observations. All the rules can be found in Appendix B.
5.3.1 Concrete – Diameter 40
7. WH Category=C Length Category=L20 79 ==> Productivity Class=C 79??? <conf:(1)> lift:(2.4) lev:(0.11) [46] conv:(46.11)
9. WH Category=C Crew Size Category=C4 Length Category=L20 79 ==> Productivity Class=C 79??? <conf:(1)> lift:(2.4) lev:(0.11) [46] conv:(46.11)
10. WH Category=C Length Category=L20 79 ==> Crew Size Category=C4 Productivity Class=C 79??? <conf:(1)> lift:(2.7) lev:(0.12) [49] conv:(49.79)
19. WH Category=C Length Category=L20 Proximity to Concrete Supplier=On-site 68 ==> Productivity Class=C 68??? <conf:(1)> lift:(2.4) lev:(0.09) [39] conv:(39.69)
21. WH Category=C Crew Size Category=C4 Length Category=L20 Proximity to Concrete Supplier=On-site 68 ==> Productivity Class=C 68??? <conf:(1)> lift:(2.4) lev:(0.09) [39] conv:(39.69)
22. WH Category=C Length Category=L20 Proximity to Concrete Supplier=On-site 68 ==> Crew Size Category=C4 Productivity Class=C 68??? <conf:(1)> lift:(2.7) lev:(0.1) [42] conv:(42.86)
52. WH Category=B Proximity to Concrete Supplier=On-site 73 ==> Productivity Class=A 64??? <conf:(0.88)> lift:(3.12) lev:(0.1) [43] conv:(5.25)
Comments:
§? Productivity classes A and C have more instances and thus dominate the analysis
§? The results align with observations made in the exploratory phase
5.3.2 Concrete – Diameter 50
19. WH Category=D Crew Size Category=C4 Land=Dry Productivity Class=D 53 ==> Proximity to Concrete Supplier=On-site 53??? <conf:(1)> lift:(1.85) lev:(0.05) [24] conv:(24.33)
27. Crew Size Category=C4 Productivity Class=B 50 ==> Proximity to Concrete Supplier=On-site 49??? <conf:(0.98)> lift:(1.81) lev:(0.05) [21] conv:(11.48)
41. WH Category=D Proximity to Concrete Supplier=On-site Land=Dry Productivity Class=D 56 ==> Crew Size Category=C4 53??? <conf:(0.95)> lift:(2.35) lev:(0.06) [30] conv:(8.36)
Comments:
§? The relationships extracted in this case were more like observations and not meaningful results
5.3.3 Concrete – Diameter 60
1. WH Category=A 101 ==> Productivity Class=A 101??? <conf:(1)> lift:(4.2) lev:(0.1) [76] conv:(76.98)
4. Length Category=L21 Productivity Class=C 84 ==> Crew Size Category=C4 79??? <conf:(0.94)> lift:(1.08) lev:(0.01) [5] conv:(1.83)
12. Productivity Class=D 170 ==> Crew Size Category=C4 155??? <conf:(0.91)> lift:(1.05) lev:(0.01) [7] conv:(1.39)
Comments:
§? The best rule given for concrete activity with pile diameter 60 is that when the activity duration is short the productivity is high. This can be explained since a short activity doesn’t bring fatigue and the crew can maintain its pace of work more easily.
§? Other rules generated correlated lower productivity with bigger crews, that can also be explained through the lack of coordination and the misalignment of the individual productivity of crew members.
5.3.4 Drilling – Diameter 40
4. WH Category=D Length Category=L20 73 ==> Productivity Class=D 73??? <conf:(1)> lift:(3.74) lev:(0.12) [53] conv:(53.49)
8. WH Category=D Crew Size Category=C4 Length Category=L20 73 ==> Productivity Class=D 73??? <conf:(1)> lift:(3.74) lev:(0.12) [53] conv:(53.49)
12. Working Time=13:30-17:00 WH Category=A 53 ==> Productivity Class=A 53??? <conf:(1)> lift:(3.97) lev:(0.09) [39] conv:(39.64)
15. WH Category=C Length Category=L20 51 ==> Productivity Class=C 51??? <conf:(1)> lift:(4.26) lev:(0.08) [39] conv:(39.02)
19. WH Category=C Crew Size Category=C4 Length Category=L20 51 ==> Productivity Class=C 51??? <conf:(1)> lift:(4.26) lev:(0.08) [39] conv:(39.02)
23. Working Time=13:30-17:00 WH Category=A Crew Size Category=C3 46 ==> Productivity Class=A 46??? <conf:(1)> lift:(3.97) lev:(0.07) [34] conv:(34.4)
35. WH Category=D Crew Size Category=C4 120 ==> Productivity Class=D 109??? <conf:(0.91)> lift:(3.4) lev:(0.17) [76] conv:(7.33)
Comments:
§? In the drilling for the first time the working time appears in the association rules. The rule indicated that 13:30-17:00 and especially for short activities the productivity is always high. The appearance of the working time in the drilling and not in concrete can be interpreted also through the initial graphs, since the drilling activity productivity had higher variation throughout the day than the concrete one.
5.3.5 Drilling – Diameter 50
14. WH Category=D Land=Dry 108 ==> Productivity Class=D 94??? <conf:(0.87)> lift:(3.51) lev:(0.14) [67] conv:(5.42)
34. WH Category=D 131 ==> Productivity Class=D 104??? <conf:(0.79)> lift:(3.2) lev:(0.15) [71] conv:(3.52)
35. WH Category=A Crew Size Category=C3 100 ==> Productivity Class=A 79??? <conf:(0.79)> lift:(3.39) lev:(0.12) [55] conv:(3.48)
Comments:
§? In this case the land type was incorporated in the analysis, since it was a rare case where the type “muddy” and “dry” had almost equal instances.
§? Here the productivity is clearly correlated with the work hours, which is expected since that’s how the productivity unit rate has been defined (work hours in the nominator).
5.3.6 Drilling – Diameter 60
3. WH Category=A Crew Size Category=C4 Length Category=L19 85 ==> Productivity Class=A 85??? <conf:(1)> lift:(4.05) lev:(0.08) [63] conv:(64)
12. WH Category=A Crew Size Category=C4 144 ==> Productivity Class=A 138??? <conf:(0.96)> lift:(3.88) lev:(0.13) [102] conv:(15.49)
33. WH Category=B Length Category=L21 96 ==> Productivity Class=B 81??? <conf:(0.84)> lift:(3.51) lev:(0.08) [57] conv:(4.56)
50. WH Category=B Crew Size Category=C4 172 ==> Productivity Class=B 123??? <conf:(0.72)> lift:(2.97) lev:(0.11) [81] conv:(2.61)
51. WH Category=B 202 ==> Productivity Class=B 143??? <conf:(0.71)> lift:(2.94) lev:(0.12) [94] conv:(2.56)
Comments:
§? These rules contain some contradiction with the prior observations. Firstly, the correlate high productivity with a crew size 4, which is relatively large, and contradicts with the rule that smaller crews have higher productivities. Of course, the drilling with a pile of diameter 60 has certain crew requirements.
§? In addition, the majority of the rules concern Productivity classes A & B, however class C has the highest number of instances.
Conclusions and Recommendations
After conducting different data mining tasks and techniques, the following conclusions are noted:
Moreover, the following recommendations are proposed:
Appendix A – Decision Trees
1. C40
2. D40?
3. C50
?
4. D50?
?
5. C60
?
6. D60
7. C40-50-60
??
8. D40-50-60
?