Data Science Project Steps
It is important to have a roadmap to follow when you are working in a Data Science project, otherwise, the project will take a time you do not have to waste when you are under pressure. The first step to solve any problem is separate it into little problems. So, which is the Data Science project roadmap?
Problem Statement
This is where all begins, in this step you must set a problem to solve, no matter what it is, for example: what is the correlation between the sales and other variables?
Understanding the Business
Now, you have to understand how you are going to solve this problem, if the problem can be solved just by Data Analysis or it needs a Machine Learning model. You have to translate the problem into the Data Science 'language'.
Data Collection
After that, you must start looking for the data you need to solve the problem. You can use the company's database, use public databases or even use web scraping to extract data from websites.
Data Cleaning
About 80% of the time is spent cleaning the data in a Data Science project, because this step is very important. A single missing value can affect any information you get from analysis or model you are building. Here, you will treat missing values, anomalies or outliers.
Exploratory Data Analysis
This is one of the most important steps because the problem can, sometimes, be solved here. It is possible to extract usefull and valuable information and have insights. Any data scientist should take your time to make a great data analysis. Here, you will generate plots, create tables, test your hyphotesis and others that can answear your questions.
Data Preparation
If the problem could not be solved in the previous step, it is time to prepare the data for building the Machine Learning model. Here, you will preprocess the data or even apply techniques such as Normalization or Scaling to the data, when it is necessary of course.
Machine Learning Algorithm
Here, you will compare Machine Learning algorithms and then choose the best one. You will also train the model and perform hyperparemeter tuning.
Evaluating the Model
Here, you will choose the right metrics to evaluate the model's performance. Many times you will be using several metrics.
Deploying thr Model
After the model has been built and evaluated, it is time to deploy the model into production. There are many tools you can use to do it, such as Flask, FastAPI and MLflow.
End Notes
In real-world situations, a Data Science project is not linear. You will have to repeat the steps many times untill you reach the best model.