登录查看更多内容

Engineering Success with Python: A Data Analysis Journey in Manufacturing

Yolanda Tates

Digital Marketer & Marketing Analytics Enthusiast

发布日期: 2023年6月30日

I have been assigned by a manufacturing, engineering, and science company to analyze data from their flotation plant. To successfully extract and analyze the data from the CSV file, I have chosen to utilize the programming language Python and used Deepnote. This project presents an exciting opportunity to explore and uncover insights that can improve operational efficiency and drive profitability for the company.

The Manufacturing Data Set

This data is real from March 2017 to September 2017 and can be found on Kaggle.? The primary objective is to utilize the available data to forecast the level of impurities in the ore concentrate. By monitoring the silica (impurity) levels hourly, accurate predictions can aid engineers in taking swift and proactive action. This predictive approach will allow engineers to take preemptive measures to mitigate potential contaminations, which can reduce impurities in the ore concentrate, lower amounts of ore sent to tailings (as a result of reduced silica), and go a long way towards preserving the environment.

The sampling intervals for the columns vary, with some being taken every 20 seconds and others hourly. Among the columns, the second and third represent quality measures of the iron ore pulp just prior to entering the flotation plant. Columns 4 through 8 are the most influential variables impacting the final ore quality. Columns 9 through 22 display process data such as column level and airflow, which also have an impact on the ore quality. Finally, the last two columns represent the lab's final measurement of the iron ore pulp quality.

For the data that will be analyzed, I have included a visual of the data dictionary.

No alt text provided for this image — Data Dictionary and Explanation of the Manufacturing and Engineering Data Fields

The Mining Process

The engineering and science company employs an efficient mining technique to extract iron from dirt clumps found in a hole. However, these clumps are surrounded by impurities like dirt, sand, and silica. To obtain cleaner iron, they subject these clumps to a flotation plant.

Here, a mixture of pulp, starch, and amina are combined, which causes the dirt to strip away from the iron. Air bubbles are blasted at the liquid mixture to encourage the metals to move to the top, while the minerals remain at the bottom.

The flow of liquid in columns determines how quickly it moves, while the frothing level is about its height due to the bubbles. The key variable to consider is the "% Iron Concentrate," which represents the iron's purity. With this method, they can effectively extract iron for commercial purposes.

The Data Analysis

For the successful completion of this project, I utilized Deepnote and imported essential Python libraries including Pandas, Seaborn, and Matplotlib through the following code snippets:

# Data Manipulation
!pip install pandas
import pandas as pd
# Data Visualizaiton
!pip install seaborn
import seaborn as sns
# Data Visualization
!pip install matplotlib
import matplotlib.pyplot as plt

To know how many rows and columns are in the CSV file I used the code df.shape and the file has 737,453 columns and 24 rows.

I utilized the DataFrame.head method from the Pandas library to determine the accuracy of the displayed information in the columns.? The data initially showed misplaced commas in the data.

To clean the data by replacing the commas with decimals, I used the command:

To view a sample set and random columns and rows, I used the df.iloc code snippet to view rows 100-104:

After reviewing the date field format, I wanted to be sure if it was read as a text or numerical value.? To view, I used the following code snippet:

print(type(df))

print(type(df['date']))

print(type(df['date'][0]))

I discovered that the date field was input as a string of text and then I redefined it as a datetime field.

Descriptive Analytics

I have been asked to give some summary statistics for each of the columns and to show the average and median, as well as the min & max for every column ?To view that data, I used the df.describe() code snippet. This allowed me to quickly view the summary statistics for each of the columns, including calculated fields such as correlating data points that I had created within my dataframe.

Masking

As the team investigated a strange occurrence on June 1, 2017, we've identified key variables to consider: % Iron Concentrate, % Silica Concentrate, Ore Pulp pH, Flotation Column 05 Level, and date. To streamline our analysis, we need to filter out unnecessary data and focus solely on the relevant columns and rows between May 31, 2017 at midnight and June 2, 2017.

领英推荐

Learn Coding - Become a Bio-IT Professional

BioTecNika 1 年前

Artificial intelligence (AI) cannot completely replace…

TechXyte 10 个月前

LLMs for Management Roles

Amitava Ray Chaudhuri 6 个月前

To do this, I first determined the date range of our dataset by finding the earliest and latest dates using Python code:

max_date = df['date'].max()

print('The max date is ' + str(max_date))

min_date = df['date'].min()

print('The min date is ' + str(min_date))

Next, I will filtered the rows that fall within the specified date range and create a new dataframe, df_june, using a boolean mask:

df_june = df[(df['date'] > "2017-05-31 23:59:59") & (df['date'] < "2017-06-02")].reset_index(drop=True)

But even after filtering rows, I still had all the columns. To focus only on the important ones, I created a list of the relevant columns and create a new dataframe, df_june_important, by selecting only those columns from df_june:

important_cols = ['% Iron Concentrate', '% Silica Concentrate', 'Ore Pulp pH', 'Flotation Column 05 Level', 'date']

df_june_important = df_june[important_cols]

When I viewed this final dataframe, I had a clear and concise snapshot of the important data points from the specified time frame.

Pair Plot and Correlation

The supervisor was interested in understanding the interrelationships among the variables. A scatterplot was constructed to visualize these connections. However, due to the number of variables, it would take six separate plots to capture all relationships. Fortunately, the Seaborn data visualization library empowered me to achieve this with minimal effort using a single line of code - pair plot. By importing Seaborn as sns and supplying the data frame as an argument, I was able to promptly execute the pair plot.

After creating the pair plot, I didn't see a lot of correlation in the variables. Usually, I would want to see some sort of shape with these, but I did not see anything that stands out.?

To further confirm any correlation, I created a correlation matrix that showed all the correlation values to be low.

Line Charts

The supervisor wanted clarity and has requested data to aid in their comprehension. Specifically, they were interested in observing the fluctuations of % Iron Concentrate across the day. To fulfill this request, I created line charts using the Seaborn library.

The graph proved to be a valuable tool, prompting interest in exploring additional variables over the same time period. However, combining these variables on one graph is not feasible as the units of measurement are vastly different. For instance, Iron & Silica are measured in percentages which range from 0 to 100, whereas pH has a much lower scale and flotation has a higher scale. Therefore, I proposed creating separate graphs to display these variables effectively using a for loop. With this in mind, l utilized sns.lineplot with x='date', y=i, data=df_june, and also incorporated matplotlib.pyplot to produce individual graphs for each set of important_cols.

The Insights Uncovered

The analysis using pair plot showed that the variables are not closely related. The correlation matrix further confirmed it. This means that changes in one variable do not cause changes in the other variables. This tells us that the process being studied is complicated and cannot be explained by only one thing.
There is a positive correlation between the %Silica, the Iron Ore PH, and the Floatation Column Level 05. However, there is a negative relationship with the %Iron Concentrate. This indicates that the %Iron Concentrate is impacted by changes in other variables, such as the %Silica, Iron Ore PH, and Floatation Column Level 05.

Thank you for reading my Python Manufacturing article. If you have any additional insights or how I could have further analyzed the data using Python, please let me and let's connect!

Aubri Williams (Nowowiejski)

Brand Storyteller. Event-Fluencer. Brewer's Daughter. ?? ? ?? aubrin.com

1 年

This is impressive!!

要查看或添加评论，请登录

Yolanda Tates的更多文章

Breaking Down the Court: A Tableau Data Visualization Analysis of the 2021-2022 NBA Season

2023年6月30日

Breaking Down the Court: A Tableau Data Visualization Analysis of the 2021-2022 NBA Season

As an avid sports enthusiast, I was thrilled to be tasked to create a challenging data project in which I used Tableau…
Harnessing SQL for Hospital Healthcare Data

2023年6月30日

Harnessing SQL for Hospital Healthcare Data

In recent years, the role of data in healthcare has become increasingly critical. It is an essential tool for enhancing…

1 条评论
Unlocking the Potential: Massachusetts’ Mission to Analyze Education Data to Improve its School System

2023年2月28日

Unlocking the Potential: Massachusetts’ Mission to Analyze Education Data to Improve its School System

As a data analyst, I'm always excited to take on new challenges and explore ways to present complex information in an…
Food Delivery Services Marketing Analysis Using Excel

2023年1月23日

Food Delivery Services Marketing Analysis Using Excel

During the pandemic, my family and community embraced food delivery services like DoorDash to bring meals straight to…

7 条评论

Engineering Success with Python: A Data Analysis Journey in Manufacturing

Yolanda Tates

Digital Marketer & Marketing Analytics Enthusiast

领英推荐

Yolanda Tates的更多文章

社区洞察

其他会员也浏览了

Assessment 102- What to Assess?

The Importance of Basic Python Functions in the Workplace

The End of White Collar Jobs

Generating, Optimise and Debug Code for Your Software Application with ChatGPT

Tech Visionary Delivering ROI

AI is Reshaping Software Engineering – Where Should Developers Pivot?

Unlocking the World of Data Science: A New Learning Series

AI Training & Object Oriented Design

Software Engineering and Large Language Models: What university students need to know

Conventional Programming vs. Machine Learning

领英推荐

Yolanda Tates的更多文章

Breaking Down the Court: A Tableau Data Visualization Analysis of the 2021-2022 NBA Season

Harnessing SQL for Hospital Healthcare Data

Unlocking the Potential: Massachusetts’ Mission to Analyze Education Data to Improve its School System

Food Delivery Services Marketing Analysis Using Excel

社区洞察

其他会员也浏览了

Assessment 102- What to Assess?

The Importance of Basic Python Functions in the Workplace

The End of White Collar Jobs

Generating, Optimise and Debug Code for Your Software Application with ChatGPT

Tech Visionary Delivering ROI

AI is Reshaping Software Engineering – Where Should Developers Pivot?

Unlocking the World of Data Science: A New Learning Series

AI Training & Object Oriented Design

Software Engineering and Large Language Models: What university students need to know

Conventional Programming vs. Machine Learning