Iron Mining Python Data Analysis

Iron Mining Python Data Analysis

For this project, I analyzed data for a hypothetical mining company called Metals R’ Us. One of the metals they mine is iron. In the mining process, Metals R’ Us digs up big clumps of dirt which contain iron along with impurities such as dirt, sand, and silica. Iron ore pulp is run through a flotation plant. Starch and amina are mixed in order to strip dirt away from the iron. Air bubbles are added to the liquid mixture so the metal rises to the top and minerals remain at the bottom, which increases the purity of the iron concentrate. This video explains the flotation process.

The boss at Metals R’ Us was concerned that something unusual happened on June 1, 2017. The data was cleaned, manipulated, and analyzed to determine if this was the case.

Insights

  • The pairplot shows no clear relationships between % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level.
  • The correlation is weak between % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level.
  • Line plots for % Iron Concentrate, % Silica Concentrate, and Ore Pulp pH show nothing concerning for June 1, 2017.
  • The line plot for Flotation Column 05 Level shows a sharp drop off around 7:00pm on June 1, 2017.

Dataset

The dataset used for this project came from Kaggle. It is real data taken from March – September 2017. The data was downloaded as a csv file and analyzed using Python, a free open-source programming language used by many large companies.

Important columns for this analysis were:

  • Date = date & time stamp of sample reading
  • % Iron Concentrate = % of iron at end of flotation process
  • % Silica Concentrate = % of silica at end of flotation process
  • Ore Pulp pH = pH on scale from 0 - 14
  • Flotation Column 05 Level = froth level in the flotation cell with lower the level, the higher the grade of concentration

Analysis

I used Python to analyze the dataset. There are thousands of packages available to use in Python. Common packages used for data analysis include:

  • NumPy for mathematical operations
  • Pandas for data manipulation and analysis (built on top of NumPy)
  • Matplotlib for data visualization
  • Seaborn for data visualization (built on top of Matplotlib)?

For this project, I used DeepNote to run Python. DeepNote is a brower-based IDE (integrated development environment) that is used to write and run the code for a project in a notebook.

To begin, I imported the packages I needed. Note that the packages are given an abbreviation (such as pd for Pandas) so it is easier to refer to them later in the code.

No alt text provided for this image

The csv file of the dataset was read into a Pandas dataframe using the pd.read_csv function. The function df.head() gives a preview of the dataframe.

No alt text provided for this image

The preview shows that the data has commas as the separator in the numbers instead of periods. To fix this, the csv file is re-read into the dataframe, this time specifying that the commas signify decimals. The df.head() function is re-run to show that this is now fixed.

No alt text provided for this image

To get the number of rows and columns in the dataframe, I used df.shape. There are 737,453 rows of data and 24 columns.

No alt text provided for this image

Dates sometimes are not imported in the format that is best suited for analysis. I used the print(type(df)) function to check the variable type. This shows that dates were imported as strings instead of datetime. This was corrected using pd.to_datetime.

No alt text provided for this image

The df.describe() function gives summary descriptive statistics for each numeric column:

  • Count = number of non-empty values
  • Mean = average value
  • Std = standard deviation
  • Min = minimum value
  • 25% = 25th percentile
  • 50% = 50th percentile
  • 75% = 75th percentile
  • Max = maximum value

No alt text provided for this image

Now that the preliminary work is done, the boss said that something weird happened on June 1, 2017 and wants me to investigate. First, I want to check the data range of the data by finding the maximum and minimum dates which are 9/9/17 and 3/10/17, respectively. Then, I filter the data by creating a new dataframe called df_june1 to just contain June 1st.

No alt text provided for this image

According to the engineering department, the most important variables are % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level. I created another dataframe called df_june_important which has just these columns along with the datetime stamp for only June 1st.

No alt text provided for this image

A pairplot was run using Seaborn to determine if any of the variables correlate with each other. The pairplot results show no clear relationships.

No alt text provided for this image
No alt text provided for this image

I also ran a correlation matrix. All of these correlation coefficients are pretty low which confirms weak correlation. Values over +/-0.3 are slightly correlated, over +/-0.6 are moderately correlated, and +/-0.8 are strongly correlated. The sign of the correlation coefficient indicates the direction of the relationship, with positive values indicating positive correlation (as one increases, the other increases) while negative values indicate a negative correlation (as one increases, the other decreases). The highest correlation in this matrix is 0.30 between % Iron Concentrate and Ore Pulp pH.

No alt text provided for this image

The boss also wants to see line plots for each of the important variables to see how they changed throughout the day on June 1st. I used a for loop, which iterates over each variable and creates a plot. The plots for % Iron Concentrate, % Silica Concentrate, and Ore Pulp pH show variation throughout the day but nothing concerning. The plot for Flotation?Column 05 Level shows fairly steady values around 500 most of the day, with a sharp dropoff to about 300 around 19:00 (7:00pm) that recovered within a couple hours. It is unknown if this is unusual so it will be brought to the boss’ attention for possible further investigation.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Key Takeaways

  • The pairplot shows no clear relationships between % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level.
  • The correlation is weak between % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level.
  • Line plots for % Iron Concentrate, % Silica Concentrate, and Ore Pulp pH show nothing concerning for June 1, 2017.
  • The line plot for Flotation Column 05 Level shows a sharp drop off around 7:00pm on June 1, 2017.

Conclusion

Thank you for reading my article! Leave a comment below or connect with me. You can also check out my data analysis project portfolio website here.

Caroline J.

Data Analyst | Business Intelligence | I help companies drive data informed decision making | Remote

1 年

Way to go, Christy! Did you have a favorite part of the project?

Stuart Walker

Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R

1 年

Good job Christy, another great project ??????

要查看或添加评论,请登录

Christy Ehlert-Mackie的更多文章

社区洞察

其他会员也浏览了