Processing Plant Data with Python.
Jon Ekroth
Data Analyst ?Excel ?Tableau ?SQL ? Data ? Visualization? Problem solving ? Troubleshooting
In this project I have been recently “hired” as a data analyst for a manufacturing / engineering / science company. More specifically, I’ve been hired as a data analyst for a mining company called Metals R' Us & have been given data from their froth flotation processing plant. The main goal of this analysis is to find a possible issue that occurred on June 1, 2017. The plant manager wants an investigation to see if there is a problem that needs to be addressed. First, let's get an idea of what the froth process encompasses.
Froth flotation Process ?WIKI Explanation
The froth flotation process is widely used in mineral processing. This process is used to separate out unwanted products from dirt by using air or nitrogen in large water filled tanks to float desired materials, or concentrate as seen in Diagram 1. The pulp is a mixture of water and ore that is brought in to be processed. This process is important because the extraction of desired metals from larger amounts of lower grade materials is made possible. This process is also used in waste water treatment plants where water is separated from solids or oils.
The Data
The data used for this analysis is real, taken from Kaggle and used to predict quality in the froth flotation process. This data set covers the months of March, 2017 to September, 2017. Column readings are a bit uneven as some results are sampled every 20 seconds, and others sampled every hour. There are 24 columns and 737,453 rows in this dataset.
Three required libraries will be need to be loaded into Python. Pandas, Seaborn and Matplotlib. Pandas are being used for data upload and manipulation while Seaborn and Matplotlib are being used for data visualization.
In order to get an idea of what the dataset contains df.head() and df.shape are used to preview the first five rows of data and the number of rows and columns respectively.
The dates used in this dataset are in text form so they had to be converted to a date time column so they can be aggregated. The Python code used for this task is: df['date'] = pd.to datetime(df['date']). The data dictionary shown below describes the columns used in the dataset.
Another issue with the dataset is that commas were used for the numerical data. I updated the cells to contain periods instead of commas so the numbers would be formatted the same. I code I used to address this is df = pd.read_csv('MiningProcess_Flotation_Plant_Database.csv',decimal=",")
Getting the Results
In order get a handle on some statistics of the data I used df.describe(), to find the mean, max, min and other information on the different columns.
I’m going to filter my data for the month of June by creating a new dataframe called df_june. This is done is done speed up any searches that need to be made. I only want to concentrate on a few columns so I will create a new variable called important_cols. I then created a new dataframe,?df_june_important,?and set it equal to the older dataframe (df_june) with the columns in?important_cols.
The result of the code is shown below for June 1st.
领英推荐
Next, I called on the Seaborn library to simultaneously compare % Iron Concentrate, % Silica Concentrate, Ore Pulp pH and Flotation column 05 level. There were some suspicions that column5 may be having issues.
There doesn’t appear to be any correlation between these columns. Even though this may be true, it is still valuable information to keep on hand for future reference. Just to be sure I am seeing the above data correctly, I can run the .corr() command on the df_june_important dataframe. It is now easier to read that the correlation values are very low.
It can also be useful to view the same information in a line chart. Seaborn will be used again to create this graph. Different graphs had to be used because the unites of measure are too different.
Conclusion
There does not appear to be anything troubling happening on June 1 as all readings are running between normal ranges. The one outlier that needs to be researched further is the large drop for Flotation Column 05 Level, shown above, that dropped to a reading of 167.36. Not all data analysis is going to produce obvious results. Even if that is the case, the data can be useful for future comparisons. Python and its libraries are able to produce charts fairly easily to show the information you are trying to analyze and can keep businesses running smoothly.
Thank you for taking the time to read my analysis. Feel free to reach out if you have any questions or would like to talk analytics.
References: