Artificial Intelligence in Commodities Estimation for Construction Projects.

1. Introduction:

Artificial intelligence (AI) is a general term for describing when a machine mimics human cognitive functions, like problem-solving, pattern recognition, and learning. Machine learning is a subset of AI. Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" from big historical data, without being explicitly programmed. A machine becomes better at understanding and providing insights as it is exposed to more data.

1.1 Applicability of Machine Learning and AI in Construction:

· Prevent cost overruns

Artificial Neural Networks are used on projects to predict cost overruns based on factors such as project size, contract type, crew composition, labors supervision, material availability, drawings availability, sub-contractors status, project location, political issues and the competence level of project teams and project managers. Historical data are used by predictive models to predict realistic timelines for future projects.

· Risk Mitigation

Every construction project has some risk that comes in many forms such as Quality, Safety, Time, and Cost Risk. The larger the project, the more risk, as there are multiple sub-contractors working on different trades in parallel on job sites. There are AI and machine learning solutions depending on Historical data that can monitor, prioritize, and categorize risk and giving solutions to project managers to focus and work closely on high-risk items.

· AI for Construction Safety

Construction workers are killed on the job five times more often than other labors. According to OSHA, the leading causes of private sector deaths (excluding highway collisions) in the construction industry were falls, followed by struck by an object, electrocution, and caught-in/between.

A typical construction project can have thousands of open issues, hundreds of RFIs, and numerous change orders that are open on any given day. Imagine a smart assistant who can analyze this mountain of project data and alert you about the top 10 critical things that need your attention today? Machine learning is that smart assistant, helping teams identify the most critical risk factors from a construction safety and quality perspective that need immediate attention.

Performing AI algorithms such as Real-Time Object detection techniques can be used to identify and analyze safety hazards, categorize and tag site photography, and send notifications when PPE is not being properly used on the job site. It can even be used to identify who is violating safety standards, and tag them and/or their supervisors to address the problem.

· Commodities Estimation and Productivity Improvement

At a time when an overwhelming amount of data is being created every day, AI Systems are exposed to an endless amount of data to learn from and improve every day. This presents an opportunity for construction industry professionals to analyze and benefit from the insights generated from the data with the help of AI and machine learning systems.

Nowadays, Machine Learning and AI became the cornerstone in the estimation of Commodities and productivity / Norms such as concrete, steel fixing, shuttering, piping fabrication, piping erection, equipment installation, steel structure fabrication /erection, Fa?ade Systems, Finishing Works,….etc.,our goal is to identify the features importance that impacting the productivity/Norms for different construction activities and establishing standard rates for different commodities.

1.2 Problem Statement:

In this research, I have utilized the Machine Learning and AI techniques such as Artificial Neural networks, Boosting, regression and Natural Language Processing Techniques to predict the Piping spools Erection daily production and the features that impacting the piping erection productivity in industrial and oil and gas projects depending on historical data. The features which have been selected as listed below (23 Features):

1-Country: The data has been collected for construction projects in oil and gas and petrochemical projects in five countries: Egypt, Saudi Arabia, Qatar, UAE and Oman in the period from 2005 till 2017 with the average number of projects 35 Projects as below:

·        Egypt:                          average number of Projects 14 Projects.

·        Saudi Arabia:             average number of Projects 10 Projects.

·        Qatar:                          average number of Projects 3 Projects.

·        UAE:                             average number of Projects 4 Projects.

·        Oman:                         average number of Projects 4 Projects

2-Drawings Availability: Two Categories have been selected of drawings availability (Low and High)

3-Fabricated Spools availability: Two Categories have been selected of drawings availability (Low and High)

4-Working at Heights: Two Categories have been selected of drawings availability (yes if the piping erection is on pipe racks with height more than 4.60 Meter, and otherwise NO)

5-HSE and Security Restrictions: Two Categories have been selected (yes in case of HSE and security restrictions and that has been found severely in Gulf countries and oil and gas life areas in Egypt , and otherwise NO)

6-Heat Index and Temperature: Two Categories have been selected (High in June, July and August in Gulf Countries, and otherwise Low)

7-Political issues: Two Categories have been selected (Yes, in financial crisis in 2008 and Arab spring (series of anti-government protests) in 2011, as many projects have been impacted, and otherwise No)

8-Crews Nationality: Divided between Two Categories (Arab and others)

9-Number of Pipe Fitters.

10-Number of Argon Welders.

11-Number of CS Welders.

12-Number of cranes.

13-Number of Riggers.

14-Number of Grinders.

15-Holidays: Two Categories have been selected (Yes and No)

16- Distance between the spools fabrication Workshop and site (Low in case of the distance is within the site boundary, and otherwise High)

17-Crew Experience: Two Categories have been selected (Low in case of the average experience is less than 5 years and, otherwise High)

18-Material of Pipes: Four Categories have been selected (Carbon Steel, Stainless steel, Low Temp, Duplex)

19-Pipes Diameter: Three Categories have been selected (Low if the pipes Diameter less than 10 D.I, Medium if Diameter between 10 and 22 D.I and Large if the Diameter More than 22 D.I)

20-Material Availability: Two Categories have been selected (Yes and No)

21-Number of NDT test inspectors.

22-Crew Supervision: Two Categories have been selected (Normal if the ratio of Supervision to direct labors is within the range 10 % to 15% and low if it is less than 10 %, and otherwise high)

23-Work Front availability: Two Categories have been selected (Yes and No)

Now, we will study the impact of these features on the Piping erection production, then we want to train and test our models with the data we have, and then obtain the best model that can predict our daily production.

The Below figure showing our Process from Features selection till predicting the data passing by training the Models, testing, and Evaluation.

No alt text provided for this image

1.3 Solution statement:

We will follow the below process in our Problem Solution:

NOTE: The Programming Language used in the research is Python(3.7)

No alt text provided for this image

?      Fetching the Data:

The data has been collected from Daily reports, which encompasses the daily production of piping erection, manpower status, Equipment status, Heat index/Temperature, and reasons for the delay.

?      Clean /preparation Data:

  1. Wrangle data and prepare it for training.
  2. Using web scraping to collect the official holidays in Egypt, Qatar, Saudi Arabia ,UAE and Oman for the period from 2005 till 2017
  3. Remove Duplicates and outliers and dealing with missing data, convert categorical data, normalizing the float and integer values.
  4. Using Natural Language Processing Techniques to collect the area of Concern from daily reports which impacting the production rate.

?      Data Visualizing and analysis:

  1. Visualize data to help detect relevant relationships between variables.
  2. Split into training and evaluation sets

?      Taring Model:

The goal of training is to make a prediction correctly as often as possible, the model becomes better as it is trained to more data.

?      Evaluating the Model:

  1. Using some metric or combination of metrics to measure the performance of model.
  2. Shuffling the data and selecting 15/85 ratio for test/train data set.
  3. Hyper-parameter tuning, which is a corner stone for Model efficiency and performance improvement.
  4. Using test set data which have to predict the output.

1.4 Evaluation Metrics:

Our Problem is Regression Problem that will lead us to use the following Metrics:

Root Mean Squared Error: Root mean squared error (RMSE) is the square root of the mean of the square of all of the error. The use of RMSE is very common, and it is considered an excellent general-purpose error metric for numerical predictions.

No alt text provided for this image

Where Oi are the observations, Si predicted values of a variable, and n the number of observations available for analysis. RMSE is a good measure of accuracy, but only to compare prediction errors of different models or model configurations for a particular variable and not between variables, as it is scale-dependent.

2.Data Analysis:

2.1 Data Exploration:

We will dig more and more in our data and make our statistics to see what the nature of our historical data is:

No alt text provided for this image

The Shape of our Data is 15,868 rows (inputs) and 29 Columns (Features) with total 460,172 records

The Features name and type are as below:

No alt text provided for this image

We will drop some features like project name, Month, Year, Day as it will be misleading in model training.

Our Target will be the Erection Column.

The below table showing some statistics about our data:

No alt text provided for this image

The Maximum piping Erection  per Day is 32945 D.I/Day.

The Minimum piping Erection  per Day is 466 D.I/Day.

The Mean piping Erection  per Day is 6938 D.I/Day.

The 75% of the records have piping Erection  9216 D.I/Day.

The below table showing first five rows in our dataset:

No alt text provided for this image

As seen above a sample of our data exploring Erection production in Egypt in 2005

No alt text provided for this image

As seen above a sample of our data exploring Erection production in Qatar in 2005

No alt text provided for this image

As seen above a sample of our data exploring Erection production in Saudi Arabia in 2015

No alt text provided for this image

As seen above a sample of our data exploring Erection production in Oman in 2011

No alt text provided for this image

As seen above a sample of our data exploring Erection production in UAE in 2017

2.2 Exploratory Visualization:

Pair Plot between pipe fitter, Argon Welder, C.S Welders per Country

 The Pair plot will create a grid of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

No alt text provided for this image

Pair Plot between Cranes, Riggers, Grinders per Country

No alt text provided for this image

Pair Plot between Erection and Daily rate per Country

No alt text provided for this image

Apparently, the above pair plot figure and the Gaussian Distribution theirs is a big standard deviation for the Erection records in Egypt, which vary from Zero to 35,000. The Erection production in Egypt which is more than 20,000 D.I/Days is around 5% from the overall records of Egypt, and the production more than 10,000 D.I/Day is around 45% and the maximum is 32,945 D.I/Day, and the minimum is 465 D.I/Day, and average 9,775 D.I/day.

Regarding the Daily rate of Erection, we can obtain the same results, Egypt has a big standard Deviation with the average per Crew is 31 D.I/Day and Maximum 55 D.I/Day and Minimum 15 D.I/Day

On the Contrary, the Gaussian Distribution in Gulf countries has small standard Deviation, which varies from 0 to 18,000. The Erection production which is more than 10,000 D.I/Days is around 13% from the overall records of Gulf countries, and the production more than 5,000 D.I/Day is around 89% and the Maximum is 18,463 D.I/Day and the minimum is 610 D.I /Day, and average 5,352 D.I/Day

Regarding the Daily rate of Erection, we can obtain the same results, gulf Countries has a big standard Deviation with the average per Crew is 22 D.I/Day and Maximum 37 D.I/Day, and Minimum 11.5 D.I/day

Average Erection production per Year for Each Country 

No alt text provided for this image

Obviously, there is a drop in production in piping erection in Egypt starting from 2011 till 2017, although the average number of the projects are almost same within that period.

Average Daily production Rate per crew per Year for Each Country 

No alt text provided for this image

As seen in the above-mentioned figure, the average daily production rate per crew in Egypt is around 31 D.I /Crew , Oman 21 D.I/Crew, Saudi Arabia 23.5 D.I/Crew, Qatar 20 D.I /Crew, and UAE 18.5 D.I /Crew.

Obviously, Crew Productivity has been dropped in the summer season in all countries.

Average Daily production Rate per crew per Year for Each Country 

No alt text provided for this image

Obviously, Crew Productivity has been dropped in Egypt starting from 2011 till 2017.

Average Daily production Rate per crew per Year for Each Country Excluding Unusual Circumstances

No alt text provided for this image

In case the Un-usual circumstances like Heat index is high, availability of material is low , availability of shop drawings is low, political issues is yes, and Working at height is yes have been excluded, we will find that the productivity rate per crew in all countries are increased by average 30% 

The average number of Pipe Fitters per Year for Each Country 

No alt text provided for this image

The average number of Cranes per Year for Each Country 

No alt text provided for this image

The average number of Riggers per Year for Each Country 

No alt text provided for this image

The average number of Grinders per Year for Each Country 

No alt text provided for this image

The average number of Pipe Welders (C.S) per Year for Each Country 

No alt text provided for this image

The average number of Pipe Welders (Argon) per Year for Each Country 

No alt text provided for this image

Heat Map Features

A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors and reflecting the relationship between different features of the Data

No alt text provided for this image

Features which impacting the Erection Production Positively listed in descending order as below:

No alt text provided for this image

Obviously, the skilled manpower especially pipe fitters, and equipment like cranes, the location like Egypt, the absence of HSE restrictions, the Low Heat index,the absence of Holidays, the availability of material, and Drawings impacting the Erection Production rate positively.

Features which impacting the Erection Production Negatively listed in ascending order as below:

No alt text provided for this image

3.  Benchmark Model:

Now, we will train our data into different models comparing our results with the Benchmark Model.

We will use the Linear Regressor model as a benchmark in which to compare our models’ performance to because it is fast and simple to implement.

We will implement the RMSE (root mean squared error) as a metric to Compare other Models’ Results.

4. Algorithms and Techniques:

As we are implementing a Regression Problem, our strategy to implement the models below and comparing their results using our Evaluation metrics to our Benchmark model. Hence, we can assess the best model to be implemented in our Problem.

Admittedly, we will concentrate on the ANN Models like KerasRegressor and Gradient Boosting Models, which Often provides predictive accuracy that cannot be beaten, Lots of flexibility - can optimize on different loss functions and provides several hyperparameters tuning options that make the function fit very flexible, No data pre-processing required - often works great with categorical and numerical values as is and Handles missing data.

4.1 Linear Regressor:

Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results.

When implementing linear regression of some dependent variable ?? on the set of independent variables ?? = (???, …, ???), where ?? is the number of predictors, you assume a linear relationship between ?? and ??: ?? = ??? + ?????? + ? + ?????? + ??. This equation is the regression equation. ???, ???, …, ??? are the regression coefficients, and ?? is the random error.

4.2 Elastic Regressor:

Is a Linear regression with combined L1 and L2 priors as regularize.

  a * L1 + b * L2

  where :

  alpha = a + b and l1_ratio = a / (a + b)

4.3 Ridge Regressor

Is a Linear least squares with l2 regularization.

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm

4.4 Lasso Regressor

Is a Linear Model trained with L1 prior as regularizer

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

4.5 KerasRegressor

The basic architecture of the deep learning neural network, which we will be following, consists of three main components.

1) Input Layer: This is where the training observations are fed. The number of predictor variables is also specified here through the neurons.

2) Hidden Layers: These are the intermediate layers between the input and output layers. The deep neural network learns about the relationships involved in data in this component.

3) Output Layer: This is the layer where the final output is extracted from what’s happening in the previous two layers. In case of regression problems, the output later will have one neuron.

No alt text provided for this image

4.6 RandomForestRegressor

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size.


Is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return a very good results with relatively less data, unlike DL models that need to learn from a massive amount of data.

4.8 LGBRegressor

Light GBM is a fast, distributed, high-performance gradient boosting framework based on the decision tree algorithm, used for ranking, classification, and many other machine learning tasks.

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms

4.9 XGBRegressor

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

No alt text provided for this image


3.1 Data Pre-processing:

·      We will normalize our numerical data for computing speed.

The below is our numerical data which will be normalized:

No alt text provided for this image

·      We will implement dummies(0’s and 1’s) for our Categorical data

  The below is our dummies List:

No alt text provided for this image

·      Preparation of our Data for Models Training and testing

No alt text provided for this image

We will shuffle our data, then it will be divided by 90 % for train data an 10 % for test data

3.2 Implementation:

Firstly - after the Preparation of our training and testing data sets -We Will implement our Benchmark model (Linear regression Model) and calculating the Metrics that we have discussed before.


No alt text provided for this image

3.2.2 Elastic Regressor:

No alt text provided for this image

3.2.3 Ridge Regressor:

No alt text provided for this image

3.2.4 Lasso Regressor:

No alt text provided for this image

Features importance for the Linear, Ridge, Lasso and Elastic Regressors:

No alt text provided for this image

Obviously, the highest features importance are the skilled labours like, Pipe fitters, Grinders and Welders

3.2.5 Keras Regressor:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

As seen, by increasing the number of Epochs, the loss is decreased, till reach the optimum at 50 epochs.

Features importance for Keras Regressor:

No alt text provided for this image

Clearly, the highest features importance are the skilled labors like Pipe fitters, Grinders and Welders

3.2.6 RandomForestRegressor:

No alt text provided for this image

Features importance for RandomForestRegressor:

No alt text provided for this image

3.2.7 CatBoostRegresso :

No alt text provided for this image

3.2.7 LGBRegressor :

No alt text provided for this image

3.2.7 XGBRegressor :

No alt text provided for this image

Features importance for XGB,LGB and Cat regressors:

No alt text provided for this image

Obviously, the highest features importance are the skilled labors like, Pipe fitters, Grinders and Welders

4. Results:

4.1 Models Evaluation:

Firstly, we will Combine all the models together and sort their Results according to our Metric (RMSE)

No alt text provided for this image

Obviously, The Keras regressor is the best Model with least RMSE (1330), and this RMSE is acceptable as we are working with the target in thousands, then the Boosting Models (Cat, LGB, and XGB) come in the second rank with RMSE (1341) which are so close to the Keras regressor.

Secondly, we will test our Model with new data and we will observe the output:

The first selection it will be as below:

We will select Country “Egypt”, with the features as below

No alt text provided for this image
No alt text provided for this image

The Second Selection will change the Country to (Qatar) , With HSE restriction (YES) , and we will fix other features as below:

No alt text provided for this image
No alt text provided for this image

The Third Selection will keep the same features as per second selection except, that we will change the Heat index to high

No alt text provided for this image
No alt text provided for this image

Obviously, the drop in production due to the Heat index is high in the summer season.

The Fourth Selection, we will keep the same features as per third Selection except we will change the Working at height to (YES)

No alt text provided for this image
No alt text provided for this image

Clearly, there is more drop-in piping production in case we are working at heights.

4.2 Justification and Conclusion:

·        The Skilled Labours are the most important features according to most of the models, then the project location, HSE restriction, Temperature /Heat index, political issues, Type of Material,… consecutively as shown in the features importance.

·        Artificial intelligence and Machine Learning considered being the Future Success Key for any Company. The Pioneers in the Future are those who take into account the Artificial intelligence and Machine Learning in their decisions, and in developing management techniques.

·        The Historical Data is the cornerstone to build a good model with good results, Companies has to collect and collect and collect data and records, everything to be collected: daily reports, Area of concerns, manpower reports, equipment reports, accidents, incidents, Daily temperatures,….etc.

·        The research can be applied for different trades like Shuttering, steel fixing, piping fabrication, equipment installation, concrete pouring, steel erection,…etc. , and then we can implement the standard rate for all trades.

