Engineering Success with Python: A Data Analysis Journey in Manufacturing

Engineering Success with Python: A Data Analysis Journey in Manufacturing

I have been assigned by a manufacturing, engineering, and science company to analyze data from their flotation plant. To successfully extract and analyze the data from the CSV file, I have chosen to utilize the programming language Python and used Deepnote. This project presents an exciting opportunity to explore and uncover insights that can improve operational efficiency and drive profitability for the company.

The Manufacturing Data Set

This data is real from March 2017 to September 2017 and can be found on Kaggle.? The primary objective is to utilize the available data to forecast the level of impurities in the ore concentrate. By monitoring the silica (impurity) levels hourly, accurate predictions can aid engineers in taking swift and proactive action. This predictive approach will allow engineers to take preemptive measures to mitigate potential contaminations, which can reduce impurities in the ore concentrate, lower amounts of ore sent to tailings (as a result of reduced silica), and go a long way towards preserving the environment.

The sampling intervals for the columns vary, with some being taken every 20 seconds and others hourly. Among the columns, the second and third represent quality measures of the iron ore pulp just prior to entering the flotation plant. Columns 4 through 8 are the most influential variables impacting the final ore quality. Columns 9 through 22 display process data such as column level and airflow, which also have an impact on the ore quality. Finally, the last two columns represent the lab's final measurement of the iron ore pulp quality.

For the data that will be analyzed, I have included a visual of the data dictionary.

No alt text provided for this image
Data Dictionary and Explanation of the Manufacturing and Engineering Data Fields

The Mining Process

The engineering and science company employs an efficient mining technique to extract iron from dirt clumps found in a hole. However, these clumps are surrounded by impurities like dirt, sand, and silica. To obtain cleaner iron, they subject these clumps to a flotation plant.

Here, a mixture of pulp, starch, and amina are combined, which causes the dirt to strip away from the iron. Air bubbles are blasted at the liquid mixture to encourage the metals to move to the top, while the minerals remain at the bottom.

The flow of liquid in columns determines how quickly it moves, while the frothing level is about its height due to the bubbles. The key variable to consider is the "% Iron Concentrate," which represents the iron's purity. With this method, they can effectively extract iron for commercial purposes.

The Data Analysis

For the successful completion of this project, I utilized Deepnote and imported essential Python libraries including Pandas, Seaborn, and Matplotlib through the following code snippets:

  • # Data Manipulation
  • !pip install pandas
  • import pandas as pd
  • # Data Visualizaiton
  • !pip install seaborn
  • import seaborn as sns
  • # Data Visualization
  • !pip install matplotlib
  • import matplotlib.pyplot as plt

To know how many rows and columns are in the CSV file I used the code df.shape and the file has 737,453 columns and 24 rows.

No alt text provided for this image
DF.Shape code snippet from Pandas Library to Determine File SIze

I utilized the DataFrame.head method from the Pandas library to determine the accuracy of the displayed information in the columns.? The data initially showed misplaced commas in the data.

No alt text provided for this image
DF.Head Code Snippet to View Data in Fields

To clean the data by replacing the commas with decimals, I used the command:

No alt text provided for this image
DF code Snippet to Add Decimal in Place of Commas
No alt text provided for this image

To view a sample set and random columns and rows, I used the df.iloc code snippet to view rows 100-104:

No alt text provided for this image
DF.Iloc to See Random Sampling of Data After Replacing Commas with Decimals

After reviewing the date field format, I wanted to be sure if it was read as a text or numerical value.? To view, I used the following code snippet:

print(type(df))

print(type(df['date']))

print(type(df['date'][0]))

I discovered that the date field was input as a string of text and then I redefined it as a datetime field.

No alt text provided for this image

Descriptive Analytics

I have been asked to give some summary statistics for each of the columns and to show the average and median, as well as the min & max for every column ?To view that data, I used the df.describe() code snippet. This allowed me to quickly view the summary statistics for each of the columns, including calculated fields such as correlating data points that I had created within my dataframe.

No alt text provided for this image

Masking

As the team investigated a strange occurrence on June 1, 2017, we've identified key variables to consider: % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, Flotation Column 05 Level, and date. To streamline our analysis, we need to filter out unnecessary data and focus solely on the relevant columns and rows between May 31, 2017 at midnight and June 2, 2017.

To do this, I first determined the date range of our dataset by finding the earliest and latest dates using Python code:

max_date = df['date'].max()

print('The max date is ' + str(max_date))

min_date = df['date'].min()

print('The min date is ' + str(min_date))

Next, I will filtered the rows that fall within the specified date range and create a new dataframe, df_june, using a boolean mask:

df_june = df[(df['date'] > "2017-05-31 23:59:59") & (df['date'] < "2017-06-02")].reset_index(drop=True)

But even after filtering rows, I still had all the columns. To focus only on the important ones, I created a list of the relevant columns and create a new dataframe, df_june_important, by selecting only those columns from df_june:

important_cols = ['% Iron Concentrate', '% Silica Concentrate', 'Ore Pulp pH', 'Flotation Column 05 Level', 'date']

df_june_important = df_june[important_cols]

When I viewed this final dataframe, I had a clear and concise snapshot of the important data points from the specified time frame.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Pair Plot and Correlation

The supervisor was interested in understanding the interrelationships among the variables. A scatterplot was constructed to visualize these connections. However, due to the number of variables, it would take six separate plots to capture all relationships. Fortunately, the Seaborn data visualization library empowered me to achieve this with minimal effort using a single line of code - pair plot. By importing Seaborn as sns and supplying the data frame as an argument, I was able to promptly execute the pair plot.

No alt text provided for this image

After creating the pair plot, I didn't see a lot of correlation in the variables. Usually, I would want to see some sort of shape with these, but I did not see anything that stands out.?

To further confirm any correlation, I created a correlation matrix that showed all the correlation values to be low.

No alt text provided for this image

Line Charts

The supervisor wanted clarity and has requested data to aid in their comprehension. Specifically, they were interested in observing the fluctuations of % Iron Concentrate across the day. To fulfill this request, I created line charts using the Seaborn library.

No alt text provided for this image

The graph proved to be a valuable tool, prompting interest in exploring additional variables over the same time period. However, combining these variables on one graph is not feasible as the units of measurement are vastly different. For instance, Iron & Silica are measured in percentages which range from 0 to 100, whereas pH has a much lower scale and flotation has a higher scale. Therefore, I proposed creating separate graphs to display these variables effectively using a for loop. With this in mind, l utilized sns.lineplot with x='date', y=i, data=df_june, and also incorporated matplotlib.pyplot to produce individual graphs for each set of important_cols.

No alt text provided for this image
No alt text provided for this image


No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

The Insights Uncovered

  • The analysis using pair plot showed that the variables are not closely related. The correlation matrix further confirmed it. This means that changes in one variable do not cause changes in the other variables. This tells us that the process being studied is complicated and cannot be explained by only one thing.
  • There is a positive correlation between the %Silica, the Iron Ore PH, and the Floatation Column Level 05. However, there is a negative relationship with the %Iron Concentrate. This indicates that the %Iron Concentrate is impacted by changes in other variables, such as the %Silica, Iron Ore PH, and Floatation Column Level 05.

Thank you for reading my Python Manufacturing article. If you have any additional insights or how I could have further analyzed the data using Python, please let me and let's connect!

Aubri Williams (Nowowiejski)

Brand Storyteller. Event-Fluencer. Brewer's Daughter. ?? ? ?? aubrin.com

1 年

This is impressive!!

回复

要查看或添加评论,请登录

Yolanda Tates的更多文章

社区洞察

其他会员也浏览了