Iron Mining Python Data Analysis
Christy Ehlert-Mackie
Data Analyst | Bridging Business and Technical Sides to Power Data-Driven Decisions | MSBA, MBA | Excel, SQL, Power BI, Tableau | Background in Accounting and Finance
For this project, I analyzed data for a hypothetical mining company called Metals R’ Us. One of the metals they mine is iron. In the mining process, Metals R’ Us digs up big clumps of dirt which contain iron along with impurities such as dirt, sand, and silica. Iron ore pulp is run through a flotation plant. Starch and amina are mixed in order to strip dirt away from the iron. Air bubbles are added to the liquid mixture so the metal rises to the top and minerals remain at the bottom, which increases the purity of the iron concentrate. This video explains the flotation process.
The boss at Metals R’ Us was concerned that something unusual happened on June 1, 2017. The data was cleaned, manipulated, and analyzed to determine if this was the case.
Insights
Dataset
The dataset used for this project came from Kaggle. It is real data taken from March – September 2017. The data was downloaded as a csv file and analyzed using Python, a free open-source programming language used by many large companies.
Important columns for this analysis were:
Analysis
I used Python to analyze the dataset. There are thousands of packages available to use in Python. Common packages used for data analysis include:
For this project, I used DeepNote to run Python. DeepNote is a brower-based IDE (integrated development environment) that is used to write and run the code for a project in a notebook.
To begin, I imported the packages I needed. Note that the packages are given an abbreviation (such as pd for Pandas) so it is easier to refer to them later in the code.
The csv file of the dataset was read into a Pandas dataframe using the pd.read_csv function. The function df.head() gives a preview of the dataframe.
The preview shows that the data has commas as the separator in the numbers instead of periods. To fix this, the csv file is re-read into the dataframe, this time specifying that the commas signify decimals. The df.head() function is re-run to show that this is now fixed.
To get the number of rows and columns in the dataframe, I used df.shape. There are 737,453 rows of data and 24 columns.
Dates sometimes are not imported in the format that is best suited for analysis. I used the print(type(df)) function to check the variable type. This shows that dates were imported as strings instead of datetime. This was corrected using pd.to_datetime.
领英推荐
The df.describe() function gives summary descriptive statistics for each numeric column:
Now that the preliminary work is done, the boss said that something weird happened on June 1, 2017 and wants me to investigate. First, I want to check the data range of the data by finding the maximum and minimum dates which are 9/9/17 and 3/10/17, respectively. Then, I filter the data by creating a new dataframe called df_june1 to just contain June 1st.
According to the engineering department, the most important variables are % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, and Flotation Column 05 Level. I created another dataframe called df_june_important which has just these columns along with the datetime stamp for only June 1st.
A pairplot was run using Seaborn to determine if any of the variables correlate with each other. The pairplot results show no clear relationships.
I also ran a correlation matrix. All of these correlation coefficients are pretty low which confirms weak correlation. Values over +/-0.3 are slightly correlated, over +/-0.6 are moderately correlated, and +/-0.8 are strongly correlated. The sign of the correlation coefficient indicates the direction of the relationship, with positive values indicating positive correlation (as one increases, the other increases) while negative values indicate a negative correlation (as one increases, the other decreases). The highest correlation in this matrix is 0.30 between % Iron Concentrate and Ore Pulp pH.
The boss also wants to see line plots for each of the important variables to see how they changed throughout the day on June 1st. I used a for loop, which iterates over each variable and creates a plot. The plots for % Iron Concentrate, % Silica Concentrate, and Ore Pulp pH show variation throughout the day but nothing concerning. The plot for Flotation?Column 05 Level shows fairly steady values around 500 most of the day, with a sharp dropoff to about 300 around 19:00 (7:00pm) that recovered within a couple hours. It is unknown if this is unusual so it will be brought to the boss’ attention for possible further investigation.
Key Takeaways
Data Analyst | Business Intelligence | I help companies drive data informed decision making | Remote
1 年Way to go, Christy! Did you have a favorite part of the project?
Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R
1 年Good job Christy, another great project ??????