Mining for Mining Data with Python
Cynthia Clifford
Strategic Energy Management Data Analyst at CLEAResult -- Creative Problem Solver | Data-Driven Insights | Client-Centric Solutions Specialist
Introduction
In this project, I imagine working as a data analyst for a mining company called Metals R' Us. I’ve been given data from their flotation plant, where they separate iron from impurities like dirt, sand, and silica. I've been asked to mine for insights (pun intended) and to answer a few specific questions for my boss.
The Data
This data is real data taken from March 2017 to September 2017. More information about the data set can be found on Kaggle. There are 737,454 rows in the Excel csv file, and there are 24 columns.
The data is a bit messy in some ways with some columns sampled every 20 seconds, and other sampled every hour.?The date column has the day, month, year, and hour, but doesn't show minutes. There are 24 columns.
To work effectively with the data, it is important to understand how the flotation process works to extract iron. Iron is surrounded by impurities like dirt, sand, and silica that need to be separated from the iron after it is mined. This is done at a flotation plant, mixing water into the iron/dirt clumps to create a slurry.?The slurry is mixed with starch and Amina to separate the dirt from the iron. This is then aerated to get metals to rise to the top while the minerals remain at the bottom.
In the columns, flow is how fast something is moving. The level represents how tall the frothing is from all the bubbles.
The second to last variable "% Iron Concentrate" is the one to focus on. That is how pure the iron is.
Here's a YouTube video explaining the process of separating iron at a flotation plant.??
Key Insights
Getting to Know the Data
I began the analysis by importing the CSV file into DeepNote, the browser-based Interactive Development Environment (IDE) I used for the project. I then installed 3 libraries-pandas, seaborn, and matplotlib using !ipip commands.
Next I “connected” the data to the panda library. I needed to add the decimal ="," command because the CSV file had commas and I wanted to present the data as numbers with decimal points.
The output looked like this:
After importing the data and the appropriate libraries into Python and getting a preview displayed, I did some exploratory data analysis. I used df.shape to find the number of rows and columns in the data set.
There are 737453 rows and 24 columns, however, this was easily discernible from the CSV file as well.
As this is my first foray into Python, I also practiced some basic coding functions to allow me to look at just a specific column, e.g. df['% Iron Concentrate'] or to look at specific rows, e.g. df[100:105,:] to look at the data from rows 100 to 104. In this command, the first number is inclusive and the second number is exclusive.
The outputs looked like this:
and like this:
Most of the 24 columns of data are numerical (float) columns. But there is a date column.?The code on the left lets us determine the format of the date column. It returned the result below:
This tells us that the date is a string. We need to convert it to date and time. I did this by using the pandas library ?to.datetime command as follows:
The second part of the output verifies that the date column has been successfully converted to a Timestamp format.
After this, I performed some introductory descriptive analysis of the data using the df.describe() function.
It returned the count, mean, standard deviation, min, Q1, median, Q3, and the max for each column in the data set as shown below:
From this, we can see that there is a fair amount of variability in the purity of the iron ore being processed at the plant and in the percent of silica feed. Starch flow also varies quite a bit, as does Amina Flow, which makes sense since the latter 3 are largely determined by the purity of the iron ore being processed. However, by the end of the extraction process, there is not a lot of variability in the % Iron Concentrate, indicating the processes at the plant are working fairly well.
The Scenario --- Context and Analysis
The % Iron Concentrate is the most important variable and factory quality is determined by the consistency and quality of this output. But an engineering colleague tells us that the % Silica Concentrate, Ore Pulp pH, & Flotation Column 05 Level are all really important to monitor as well. Of course, the date is also important.
Our boss says something weird happened on June 1, 2017, and wants us to investigate.?
Our df.describe() command did not provide summary statistics of the date column. So we will need to look at that column first. We want the earliest and latest dates in that column.
The following command
tells us that the earliest date is 2017-03-10 01:00:00 and the latest date is 2017-09-09 23:00:00.
Now we will filter this to get a date near June 1.?Next, let's filter the rows with a boolean mask & create a new data frame?df_june.
After this, we will create a variable which is just the important columns, which are: date, % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level. Then we will create a new data frame with only the important columns so we can take a deep dive into the data around the potential incident on June 1.
We now have 432 pages of data to sort through to determine what happened and why on June 1. Our results, at this point, look like this:
But 4320 rows on 432 pages are still too many to scroll through to find our anomaly. My boss wants to know how these variables all relate.
Using the Seaborn library, I can pair each of these variables and create all 16 scatter plots at once (Due to the nature of the pair-plot command the same two variables are presented as a pair twice, with the axes reversed). A portion of the resulting output looks like this:
In the scatter plots above, there does appear to be something “off” with the percent of silica. There is an outlier point where the percentage of silica is over 4%. This merits further investigation. This has been conveyed to my boss and to the engineering department.
There do not appear to be any specific patterns or discernible relationships, however. I decided to confirm this by calculating the correlation coefficients. I created a correlation matrix and verified that there were no significant correlations within the important data columns around June 1 as shown below. This is good information for my boss to have.
I also looked to see if there are any significant correlations in the original data frame, using all columns and the only one found was self-evident---a strong negative correlation between percent iron feed and percent silica feed. Clearly when one goes up, the other goes down. There are, thus, no meaningful correlations in the full data set either. This can be seen below.
However, the boss is still interested in June 1 so we need to dig more deeply into the data on that day. I decide to make a line graph to show how the percent of iron concentrate changes throughout that day using the seaborn library.
The graph looked like this.
Despite the spikes in this day's % Iron Concentration, in the full dataset there is a minimum of 62.05 and a maximum of 68.01, and a mean of 65.05, so the spikes in this day's percent iron concentration are well within normal limits.
The engineers found this graph useful and asked me to create time-series graphs for the other variables.
I used this code:
The results did not bring any important relationships to light but merely confirmed relationships for the engineers, such as the inverse relationship between the percent iron concentrate and the percent silica concentration.
While very specific relationships have not been determined yet on the date the boss is focused on, no news is also news. There is as much value in finding that there is not a relationship as in finding that there is one. The boss should be even more reassured.
Because I didn’t find too many insights for the engineers, I persisted and tried to look at the interplay of key reagents to see if anything important jumped out.?I used the following code to create a new data frame for important reagents (starch flow, amina flow) and their interplay with ore pulp flow and ore pulp pH.
I then created pair plots and a correlation matrix for these reagents and factors and once again did not find any significant relationships.
This was confirmed by the correlation matrix
and by the line plots, an example being shown below. There is variability but no distinct association.
Still trying to find something significant for my boss, I continued with a few further explorations, using chatgpt to help me generate the code.
I created histograms of ore pulp ph levels using Seaborn. I used the following code:
The resulting histogram shows that ore pulp levels for the entire data set are normally distributed around a mean of approximately 9.75.
I then looked at ore pulp pH just for June using this code:
The histogram again shows that ore pulp pH levels in June are approximately normal, and centered at 10. The data for June is somewhat left skewed and less clearly normal than the full set of data. The variability in June is less than the variability of the entire set, however, again lending credence to the idea that perhaps there was not an incident around June 1.
I decided to make one more attempt at relationship finding and used the following code:
The resulting bubble plot shows that there is little variation in the percent of iron feed and a bit more variation in the resulting percent Iron Concentrate, but that the variation in percent Iron Concentrate does not appear to be connected to the ore pulp pH, so it must be attributable to other characteristics of the raw ore or to characteristics of flow.
Recap and Insights
Overall, it appears extraction and processing are happening consistently well at this plant. By the end of the extraction process, there is not a lot of variability in the % Iron Concentrate, although there is significant variability in the % Iron Feed.
Very few relationships exist between the various columns of data but there is an outlier point where the percentage of silica is over 4%. This merits further investigation.
Overall, no news is good news. This data analyst was unable to discern any real disturbance on June 1. My boss will be relieved to know the plant is operating consistently without any aberrant events.
I really enjoyed my first foray into Python and look forward to continuing to use it and to improve in my use of it.
Thank you so much for reading my project. Suggestions are welcome. If you have any questions, feel free to comment below or reach out to me via email at [email protected] or connect with me on?LinkedIn?or check out?my portfolio?to see my other projects.
I'm currently looking for opportunities as a data analyst. If you know of any opportunities in your network, please reach out. Thank you!
Professional Mining Engineer. Mine Planning Engineer EXPERIENCED AND SKILLED; Mine Geologist, Mine surveyor Mine Production Supervisor Wagagai Gold Mine Certified Data Analyst (Python,SQL and Excel)
10 个月And I would like to connect with you for more link ups as far as data analysis is concerned
Professional Mining Engineer. Mine Planning Engineer EXPERIENCED AND SKILLED; Mine Geologist, Mine surveyor Mine Production Supervisor Wagagai Gold Mine Certified Data Analyst (Python,SQL and Excel)
10 个月Impressive. High Cynthia am actually a mining Engineering graduate and a data analyst certified. I would like you to share with me the data if you don't mind for my practice
Data Analyst| | Data Visualization | SQL | Tableau | Excel
1 年Great work, Cynthia! This looks amazing!
Data Analyst | Power BI Consultant | Transforming Data into Strategic Business Decisions | SQL | Excel
1 年Great work Cynthia! Its amazing that this was your first foray into Python! ??
Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)
1 年Very well done looks good.