From Pit to Profit: Data Analytics in Mining with Python
Carlos Braschi
Animal Scientist/ Data Analyst/ Driving business growth through data-driven insights/ Experience analyzing large datasets with Excel/ SQL/ Tableau/ Power BI/ Python/ Precision Livestock Farming Enthusiast
Introduction
It comes as no surprise that iron ore is considered the strength behind the world’s leading industries. As an integral component in the formation of steel, iron ore is the world’s most commonly used metal and is critical to nearly every industry, from energy and construction to transportation and equipment manufacturing.
Beyond its applications, iron ore production is an economic stimulant essential to U.S. growth.?The United States produced an estimated 44 million metric tons of iron ore in 2023.
This represents a 13% increase from the 39 million tons produced in 2022. The majority of US iron ore production is concentrated in Michigan and Minnesota and the iron ore industry contributed approximately $7.5 billion to the US economy in 2023. ?While iron ore production increased, overall steel production remained relatively stable at around 80 million tons.
As part of the Data Analytics Accelerator program mentored by Avery Smith at Data Career Jumpstart, I collaborated with Metals R' Us to optimize flotation processes through data analysis. This project provided hands-on experience applying Python to manufacturing challenges
?
?
The Dataset
This data is real data taken from March 2017 to September 2017. The data is a bit messy in some ways with some columns sampled every 20 seconds, and other sampled every hour.?Click here to access dataset. The data was downloaded as a CSV file and then imported into Python, through Deepnote, wich is is a cloud-based platform and an interactive development environment.
?
Understanding the Iron Extraction
?
The mining company digs a big hole. In that hole, they collect big clumps of dirt. In those clumps, there is iron, which is the main thing they are trying to get so they can sell it. But the problem is that iron is surrounded by impurities like dirt, sand, and silica.?
They put it through a flotation plant to come up with cleaner Iron. They flow these clumps as a pulp, mix it with Starch and Amina (which strip the dirt away from the iron) & then shoot air bubbles at the liquid mixture to get the metals to rise to the top, while the?minerals remain at the bottom.
?
Here's a little YouTube video explaining the process.??For additional details, please refer to this Mining Process Data Dictionary. It explains the important columns, and it was constructive to reference this to decode what data was collected, and what it means for the overall goal.
?
To begin this analysis, I imported the Python libraries Pandas, Matplotlib, Seaborn, and NumPy, as shown in the code below:
To initiate the analysis, I imported the CSV dataset into a Pandas DataFrame named 'df' for efficient data manipulation.
?
To get a preliminary understanding of my data, I inspected the initial rows using df.head(). I quickly noticed that numeric values were formatted with commas instead of decimal points, indicating a potential data cleaning issue. To fix this, simply tell pandas that the CSV has commas.
"Great! Our data now has the correct format!"
?
Now, I would like to know the number of rows and columns in our dataset.
With df.shape? we discover that our dataset has 737,453 data points collected during the flotation process, represented by rows and 24 variables capturing input parameters such as iron and silica content, chemical reagents, airflow settings, and froth conditions, as well as output metrics like final iron and silica concentrate percentages, represented by columns
?
?I also practiced some basic coding functions to allow me to look at just a particular column, ?df['% Iron Concentrate'] or to look at some rows, e.g. df[100:105,:] to look at the data from rows 100 to 104. In this command, the first number is inclusive and the second number is exclusive.
To ensure data integrity and accuracy in analysis, I initially examined the data types within the dataset. I discovered that date columns were frequently represented as strings, which can hinder data manipulation and analysis
‘str’ means the data within this column is categorized as text. Since dates must correlate to a timestamp, I wrote the following code to make this adjustment:
And now , we just need to make sure that the date is in the correct format, so we run this code.
As we can see, now the dates are in the correct format ("timestamp")
With df.describe() we can know the average & median, as well as the min & max for every column, and all of that ?can be done with just one line! It'll return all sorts of descriptive analytics values for use
?
The target pH for optimal flotation performance is between 7 and 11. Analysis of the 'Ore Pulp pH' column reveals consistent values within this range (8.8 to 10.8), indicating stable flotation plant conditions during the data collection period
?
With a mean iron concentrate of 65% and a standard deviation of 1.12%, the flotation process demonstrates consistent and effective iron ore purification, achieving a 10% increase in iron content compared to the feed material
?
Our Manager has flagged June 1, 2017, as a date of interest. To begin our investigation, we'll determine the dataset's timeframe by identifying the earliest and latest dates
To isolate the period of interest, we filtered the data to include only the first week of June.
?
We need to know how these variables all relate, and the best way to find out this is with a Scatterplot, but since we want to look at all these variables & their relationships, that would require 6 different plots.
Initial analysis of the data revealed no apparent trends or patterns. Typically, visual representations would highlight correlations or anomalies, but in this case, the data points seem to be dispersed without a clear pattern.
A correlation matrix revealed minimal relationships between variables, as indicated by the low correlation coefficients.
?
Now, to better explain my manager our findings, I ?want to see how the % Iron Concentrate changes throughout that day.?I make ?a line plot in seaborn and the result is this graph.
?While that particular day, the iron concentration readings exhibit some variability, they fall within the dataset's historical range of 62.05 to 68.01 with a mean of 65.05, indicating normal operating conditions.
?
While very specific relationships have not been determined yet on that specific date we are looking for, no news is also news. There is as much value in finding that there is not a relationship as in finding that there is one
?
Insights
?
There was a very weak correlation between the 5 main variables in the process and was able to look at the % Iron Concentrate throughout the day.
Consistent ore pulp pH levels, ranging from 8.8 to 10.8, indicate optimal flotation plant operating conditions throughout the monitored period.
?
Through this data analysis project, I applied Python to explore iron ore processing data. While predictive modeling for iron concentrate and silica content proved challenging with the current dataset, I gained valuable insights into operational trends and potential areas for improvement
?
Thank you for taking the time to review my work. Your feedback and insights are valuable. Please feel free to leave a comment below or connect with me on Linkedin for further discussion
Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R
7 个月Good job Carlos ??????