Python extracted Iron Ore from the Earth's Core: Mining Analysis using Python
Aksha Hrudhai K
Data Analyst @Parts For Trucks || Microsoft EXCEL || Microsoft SQL Server || Tableau Desktop || Power BI || Python || Agile || Data Visualization ||
Introduction:
Metals are used in almost all the products that we use in our daily life. Specifically, Iron is the most sellable metal out there in the market as it is used in many steel products. The iron ore is mined and then processed to extract the iron, which is in the form of iron oxides such as hematite and magnetite. Once extracted, the iron ore undergoes a series of processes, including crushing, grinding, magnetic separation, and sometimes flotation, to produce a concentrated form suitable for steelmaking. This concentrated form, often referred to as iron ore pellets or sinter, is then melted in a blast furnace along with other materials like coke (a form of carbon) and limestone. This process results in the production of molten iron, which is further refined in a basic oxygen furnace or electric arc furnace to produce steel.
In this study I acted as a Data Analyst for a mining company that digs big holes and extract Iron ore from the ground which is surrounded by impurities like dirt, sand, and silica.?They put it through a flotation plant to make come up with cleaner Iron. As a Data Analyst I utilized Python to examine the dataset from their flotation plant and came up with valuable insights.?
Key Findings:
?
Let’s Dive into Data:
This data is real data taken from March 2017 to September 2017.?Every row is a time point at 20 second intervals. The date column has the day, month, year, and hour, but doesn't show minutes. There are 24 columns and 75k+ rows. In the columns, flow is how fast something is moving. Level is how tall the frothing that occurs from all the bubbles. The second to last variable "% Iron Concentrate" is the one to focus on. That is how pure the iron is. If you wish to dive further more into the data set, here’s the link for the DATASET.
Here is the data dictionary explaining what every column represents.
Let’s Get started with Python Analysis:
For this Study an Interactive Development Environment called Deepnote is used. Deepnote is one of the best IDE’s out there which is super user friendly and the best part is it stores the Excel worksheet in its cloud.
Made use of Python libraries in order to analyze the data from the flotation plant. For data manipulation Pandas is installed and for data visualization seaborn, matplotlib are installed.
As said earlier firstly the dataset (CSV) file is imported into Deepnote. Then the dataset is read using the below command in python:
df = pd.read_csv('MiningProcess_Flotation_Plant_Database.csv')
Previewed the first 5 and last 5 rows of the data set using df.head() and df.tail()
Displayed the number of rows and columns in the dataset using df.shape()
The dataset is not displaying the numerical values in an appropriate manner which are supposed to be decimal values. Comma (,) is displayed in the place of dot (.)
This can be simply fixed by letting pandas know that the dataset (CSV) is using Comma (,) in the place of dot (.) by using below command
Replace this command with the above read command and then rerun the notebook. There you go the dataset is clean.
In the dataset from the mining floatation plant the data type of date column is string and this needs to changed to date time format because the dates are very important in extracting the valuable insights from the data set.
Firstly, Checked the data type of the dataframe and specifically date column and date value using below code.
Secondly converted the date column data type to datetime format using below code.
Now the date column in datetime/timestamp format. Let’s go to further analysis.
In Python, specifically in the context of data analysis using the Pandas library, df.describe() is a method used to generate descriptive statistics of a DataFrame. The DataFrame is a two-dimensional, tabular data structure provided by Pandas. This method provides a summary of the central tendency, dispersion, and shape of the distribution of a DataFrame's numerical columns.
Here's a breakdown of what df.describe() provides:
The manager asked to look up on few important columns (% Iron Concentrate, % Silica Concentrate, Ore Pulp pH, & Flotation Column 05 Level) from the data set on a particular date (1st June) at different hours expecting something unusual happened on that particular day.
Firstly, I listed out the date ranges from the given dataset using below functions:
Secondly, I created a new dataframe for the mentioned particular date (1st June).
Now the above command says, let's create a new dataframe called?df_june?that is actually just the old dataframe, but only where the date is larger than May 31, 2017 at midnight & less than June 2, 2017. The ‘&’ sign allows for those two conditions to be met, and those individual conditions are encased?in round parenthesis & then in square brackets to signify the filtering of rows.
Thirdly, I created one more dataframe based on df_june?dataframe including all the above-mentioned columns for particular date (1st June).
The new dataframe df_june_important consists of only the important 5 columns and 4,320 rows filtered down to one particular date (1st June).
Finally, to see the correlation among all the imp columns designed a scatter plot for all the above-mentioned columns using pair plot function in Python to see if there is anything unusual among the imp columns mentioned on the given date 1st June.
I personally do not see any correlation between the imp columns given by the manager on that particular date (1st June). We can confirm this by a correlation matrix. Python is a powerful tool which can do a lot more things and Correlation matrix is one among them. Correlation matrix in python can be plotted using corr() function.
A visualization gives even more better understanding of the above correlation so for the further confirmation a heat map is plotted using the below code.
Then the manager wants to see how the % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, & Flotation Column 05 Level are varying throughout the whole day (1st June alone) at different hours as he expected something unusual on that particular day. A line graph is perfect when it comes to time series. Plotted line graphs for all the above parameters at once on 1st June using a for loop in Python. Python is a great tool when it comes to visualizing the data.
?My notebook can be accessed here
Conclusion/Recommendations:
Call to Action:
My analysis on this dataset is more business driven. “Any questions or suggestions about the analysis? Want to discuss more about data analytics?” Kindly feel free to reach out to me on?LinkedIn?or write up an email to?[email protected] or catch me at MyData Portfolio.
Thank you for reading this article. You have a good day now!
I would like to extend my sincere thanks to?Avery Smith?for helping me on my data journey and guiding me in the right direction of this project.
Volunteer manager
1 年The addition of the heatmap was a great idea. It is much easier to view the lack of correlation with the colors than the numbers. Well done!
Network Security Engineer | CrowdStrike | Certified CCFA | Network Security and EDR Support
1 年Well done Aksha Hrudhai K
Business Analyst | Campaign Data Analytics | Financial Marketing | Tableau | Python | SQL | Excel | Data Visualization
1 年Nice job! It is vey easy to follow through.
Python | Excel | SQL | Tableau | ML & AI Enthusiast
1 年Impressive work.. and very inspiring. Thanks for sharing!
Academic Data Analyst | Assessments & School Accountability | Academic Impact | Pearson Virtual Schools
1 年Great work, Aksha Hrudhai K! Love your intro pics! ?? The heat map is a great way to see correlation or rather the lack of correlation. ??