Mining for Mining Data with Python

Mining for Mining Data with Python

Introduction

In this project, I imagine working as a data analyst for a mining company called Metals R' Us. I’ve been given data from their flotation plant, where they separate iron from impurities like dirt, sand, and silica. I've been asked to mine for insights (pun intended) and to answer a few specific questions for my boss.

The Data

This data is real data taken from March 2017 to September 2017. More information about the data set can be found on Kaggle. There are 737,454 rows in the Excel csv file, and there are 24 columns.

The data is a bit messy in some ways with some columns sampled every 20 seconds, and other sampled every hour.?The date column has the day, month, year, and hour, but doesn't show minutes. There are 24 columns.

To work effectively with the data, it is important to understand how the flotation process works to extract iron. Iron is surrounded by impurities like dirt, sand, and silica that need to be separated from the iron after it is mined. This is done at a flotation plant, mixing water into the iron/dirt clumps to create a slurry.?The slurry is mixed with starch and Amina to separate the dirt from the iron. This is then aerated to get metals to rise to the top while the minerals remain at the bottom.

In the columns, flow is how fast something is moving. The level represents how tall the frothing is from all the bubbles.

The second to last variable "% Iron Concentrate" is the one to focus on. That is how pure the iron is.

Here's a YouTube video explaining the process of separating iron at a flotation plant.??

Key Insights

  • There is a fair amount of variability in the purity of the iron ore being processed at the plant and in the percent of silica feed.
  • By the end of the extraction process, there is not a lot of variability in the % Iron Concentrate, indicating the processes at the plant are working fairly well.
  • The pair plots indicate that there appears to be something “off” with the percent of silica. There is an outlier point where the percentage of silica is over 4%. This merits further investigation.
  • There were no significant correlations within the important data columns around June 1, the day the boss is most interested in. There were no meaningful correlations in the full data set either.
  • The spikes in June 1st's percent iron concentration are well within normal limits.
  • The variation in the percent Iron Concentrate does not appear to be connected to the ore pulp pH.

Getting to Know the Data

No alt text provided for this image

I began the analysis by importing the CSV file into DeepNote, the browser-based Interactive Development Environment (IDE) I used for the project. I then installed 3 libraries-pandas, seaborn, and matplotlib using !ipip commands.





No alt text provided for this image

Next I “connected” the data to the panda library. I needed to add the decimal ="," command because the CSV file had commas and I wanted to present the data as numbers with decimal points.

The output looked like this:

No alt text provided for this image

After importing the data and the appropriate libraries into Python and getting a preview displayed, I did some exploratory data analysis. I used df.shape to find the number of rows and columns in the data set.

No alt text provided for this image

There are 737453 rows and 24 columns, however, this was easily discernible from the CSV file as well.




As this is my first foray into Python, I also practiced some basic coding functions to allow me to look at just a specific column, e.g. df['% Iron Concentrate'] or to look at specific rows, e.g. df[100:105,:] to look at the data from rows 100 to 104. In this command, the first number is inclusive and the second number is exclusive.

The outputs looked like this:

No alt text provided for this image

and like this:

No alt text provided for this image
No alt text provided for this image

Most of the 24 columns of data are numerical (float) columns. But there is a date column.?The code on the left lets us determine the format of the date column. It returned the result below:



No alt text provided for this image

This tells us that the date is a string. We need to convert it to date and time. I did this by using the pandas library ?to.datetime command as follows:

No alt text provided for this image

The second part of the output verifies that the date column has been successfully converted to a Timestamp format.

After this, I performed some introductory descriptive analysis of the data using the df.describe() function.

It returned the count, mean, standard deviation, min, Q1, median, Q3, and the max for each column in the data set as shown below:

No alt text provided for this image

From this, we can see that there is a fair amount of variability in the purity of the iron ore being processed at the plant and in the percent of silica feed. Starch flow also varies quite a bit, as does Amina Flow, which makes sense since the latter 3 are largely determined by the purity of the iron ore being processed. However, by the end of the extraction process, there is not a lot of variability in the % Iron Concentrate, indicating the processes at the plant are working fairly well.

The Scenario --- Context and Analysis

The % Iron Concentrate is the most important variable and factory quality is determined by the consistency and quality of this output. But an engineering colleague tells us that the % Silica Concentrate, Ore Pulp pH, & Flotation Column 05 Level are all really important to monitor as well. Of course, the date is also important.

Our boss says something weird happened on June 1, 2017, and wants us to investigate.?

Our df.describe() command did not provide summary statistics of the date column. So we will need to look at that column first. We want the earliest and latest dates in that column.

The following command

No alt text provided for this image

tells us that the earliest date is 2017-03-10 01:00:00 and the latest date is 2017-09-09 23:00:00.

Now we will filter this to get a date near June 1.?Next, let's filter the rows with a boolean mask & create a new data frame?df_june.

No alt text provided for this image

After this, we will create a variable which is just the important columns, which are: date, % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level. Then we will create a new data frame with only the important columns so we can take a deep dive into the data around the potential incident on June 1.

No alt text provided for this image

We now have 432 pages of data to sort through to determine what happened and why on June 1. Our results, at this point, look like this:

No alt text provided for this image

But 4320 rows on 432 pages are still too many to scroll through to find our anomaly. My boss wants to know how these variables all relate.

Using the Seaborn library, I can pair each of these variables and create all 16 scatter plots at once (Due to the nature of the pair-plot command the same two variables are presented as a pair twice, with the axes reversed). A portion of the resulting output looks like this:

No alt text provided for this image

In the scatter plots above, there does appear to be something “off” with the percent of silica. There is an outlier point where the percentage of silica is over 4%. This merits further investigation. This has been conveyed to my boss and to the engineering department.

There do not appear to be any specific patterns or discernible relationships, however. I decided to confirm this by calculating the correlation coefficients. I created a correlation matrix and verified that there were no significant correlations within the important data columns around June 1 as shown below. This is good information for my boss to have.

No alt text provided for this image

I also looked to see if there are any significant correlations in the original data frame, using all columns and the only one found was self-evident---a strong negative correlation between percent iron feed and percent silica feed. Clearly when one goes up, the other goes down. There are, thus, no meaningful correlations in the full data set either. This can be seen below.

No alt text provided for this image

However, the boss is still interested in June 1 so we need to dig more deeply into the data on that day. I decide to make a line graph to show how the percent of iron concentrate changes throughout that day using the seaborn library.

The graph looked like this.

No alt text provided for this image

Despite the spikes in this day's % Iron Concentration, in the full dataset there is a minimum of 62.05 and a maximum of 68.01, and a mean of 65.05, so the spikes in this day's percent iron concentration are well within normal limits.

The engineers found this graph useful and asked me to create time-series graphs for the other variables.

I used this code:

No alt text provided for this image

The results did not bring any important relationships to light but merely confirmed relationships for the engineers, such as the inverse relationship between the percent iron concentrate and the percent silica concentration.

No alt text provided for this image

While very specific relationships have not been determined yet on the date the boss is focused on, no news is also news. There is as much value in finding that there is not a relationship as in finding that there is one. The boss should be even more reassured.

Because I didn’t find too many insights for the engineers, I persisted and tried to look at the interplay of key reagents to see if anything important jumped out.?I used the following code to create a new data frame for important reagents (starch flow, amina flow) and their interplay with ore pulp flow and ore pulp pH.

No alt text provided for this image

I then created pair plots and a correlation matrix for these reagents and factors and once again did not find any significant relationships.

No alt text provided for this image

This was confirmed by the correlation matrix

No alt text provided for this image

and by the line plots, an example being shown below. There is variability but no distinct association.

No alt text provided for this image

Still trying to find something significant for my boss, I continued with a few further explorations, using chatgpt to help me generate the code.

I created histograms of ore pulp ph levels using Seaborn. I used the following code:

No alt text provided for this image

The resulting histogram shows that ore pulp levels for the entire data set are normally distributed around a mean of approximately 9.75.

No alt text provided for this image

I then looked at ore pulp pH just for June using this code:

No alt text provided for this image

The histogram again shows that ore pulp pH levels in June are approximately normal, and centered at 10. The data for June is somewhat left skewed and less clearly normal than the full set of data. The variability in June is less than the variability of the entire set, however, again lending credence to the idea that perhaps there was not an incident around June 1.

No alt text provided for this image

I decided to make one more attempt at relationship finding and used the following code:

No alt text provided for this image

The resulting bubble plot shows that there is little variation in the percent of iron feed and a bit more variation in the resulting percent Iron Concentrate, but that the variation in percent Iron Concentrate does not appear to be connected to the ore pulp pH, so it must be attributable to other characteristics of the raw ore or to characteristics of flow.

Recap and Insights

Overall, it appears extraction and processing are happening consistently well at this plant. By the end of the extraction process, there is not a lot of variability in the % Iron Concentrate, although there is significant variability in the % Iron Feed.

Very few relationships exist between the various columns of data but there is an outlier point where the percentage of silica is over 4%. This merits further investigation.

Overall, no news is good news. This data analyst was unable to discern any real disturbance on June 1. My boss will be relieved to know the plant is operating consistently without any aberrant events.

I really enjoyed my first foray into Python and look forward to continuing to use it and to improve in my use of it.

Thank you so much for reading my project. Suggestions are welcome. If you have any questions, feel free to comment below or reach out to me via email at [email protected] or connect with me on?LinkedIn?or check out?my portfolio?to see my other projects.

I'm currently looking for opportunities as a data analyst. If you know of any opportunities in your network, please reach out. Thank you!

James Omirambe

Professional Mining Engineer. Mine Planning Engineer EXPERIENCED AND SKILLED; Mine Geologist, Mine surveyor Mine Production Supervisor Wagagai Gold Mine Certified Data Analyst (Python,SQL and Excel)

10 个月

And I would like to connect with you for more link ups as far as data analysis is concerned

回复
James Omirambe

Professional Mining Engineer. Mine Planning Engineer EXPERIENCED AND SKILLED; Mine Geologist, Mine surveyor Mine Production Supervisor Wagagai Gold Mine Certified Data Analyst (Python,SQL and Excel)

10 个月

Impressive. High Cynthia am actually a mining Engineering graduate and a data analyst certified. I would like you to share with me the data if you don't mind for my practice

回复
Tarannom Haque

Data Analyst| | Data Visualization | SQL | Tableau | Excel

1 年

Great work, Cynthia! This looks amazing!

回复
Charles Ojukwu

Data Analyst | Power BI Consultant | Transforming Data into Strategic Business Decisions | SQL | Excel

1 年

Great work Cynthia! Its amazing that this was your first foray into Python! ??

回复
Trevor Maxwell

Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)

1 年

Very well done looks good.

回复

要查看或添加评论,请登录

Cynthia Clifford的更多文章

  • Should I Stay or Should I Go?

    Should I Stay or Should I Go?

    Introduction We’ve all had jobs and we’ve all left jobs. When thinking about leaving, we often draw up pros and cons…

    20 条评论
  • The NBA 2021-22-- What a Year!

    The NBA 2021-22-- What a Year!

    Introduction In this project, I’m pretending to be interviewed by the Boston Celtics of the NBA for a Junior Data…

    3 条评论
  • Patient Profiles Healthcare Analysis

    Patient Profiles Healthcare Analysis

    Introduction I have two brothers but I took care of all the hospital visits for the three of us. You name it--broken…

    23 条评论
  • Who Gets the Money?

    Who Gets the Money?

    Introduction I've lived in 12 countries outside of the United States. All but one has been a "developing" country.

    12 条评论
  • What's Up With Schools?

    What's Up With Schools?

    "Education is not preparation for life; education is life itself." John Dewey An Analysis of Massachusetts Public…

    6 条评论
  • Door Dash Market Analysis

    Door Dash Market Analysis

    Door Dash Market Analysis using Excel--Who's ordering? by Cynthia Clifford I work in Vietnam and use the Grab app…

    13 条评论

社区洞察