ML Systems for Business: A Step-by-Step Guide
Ivan Reznikov
PhD, Principal Data Scientist || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers
Machine learning has rapidly transformed the business world in the recent years, offering new opportunities for companies to improve efficiency, streamline operations, and gain a competitive edge. As a result, there has been a growing demand for organizations to develop their own custom machine learning systems tailored to their specific business needs. However, creating a machine learning system from scratch can be a complex and intimidating process, requiring a deep understanding of the project's technical and business aspects. In this article, we will explore the key considerations and steps involved in building a machine learning system from the ground up to help organizations leverage the full potential of this cutting-edge technology.
CRISP-DM and OSEMN
CRISP-DM and OSEMN frameworks are popular methodologies for organizing and conducting data science projects. They provide structured and standardized approaches for developing data science projects. Their aim is to help data science teams stay organized and focused, ensuring high-quality solutions are delivered on time to meet the business's needs.
The CRISP-DM framework is a six-step process for data science projects
1. Business understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment.?
CRISP-DM focuses on ensuring that a project is well-structured and that the different stages of the project are completed systematically and iteratively.
The OSEMN framework, on the other hand, is a five-step process:
1. Obtaining
2. Scrubbing
3. Exploring
4. Modeling
5. Interpreting Data.
OSEMN focuses on the technical aspects of data science, such as cleaning and preparing data, building models, and interpreting results.
One key difference between CRISP-DM and OSEMN is that CRISP-DM strongly emphasizes the business understanding and deployment stages. At the same time, OSEMN is more focused on the technical aspects of data science. CRISP-DM is more suited to large and complex data science projects, while OSEMN is better suited to smaller and more focused projects.
While CRISP-DM and OSEMN frameworks can provide a solid foundation for a general data science project, businesses must assess whether these frameworks suit their specific needs before adopting them. Often, these frameworks are not enough to meet the goals and requirements of the business.
Below is my vision, based on experience, of how a generalistic data science framework should look like
ML Project from Scratch
Developing a machine learning system from scratch for business needs requires a systematic approach, starting with understanding business needs, building a proof of concept, developing the model, testing it with fine-tuning, and finally, deployment and support.
Below is a step-by-step guide on how I attempt to build production ML systems.
Step 1. Building a Proof of Concept (PoC)
Step 1a. Understanding the Business Needs
Before starting with the development, it is vital to understand the business requirements and objectives of the machine learning system. This includes identifying the problem that needs to be solved, the data available, and the expected outcomes. It helps to ensure that the right questions are being addressed and that resources are not being wasted.?
Step 1b. Understanding Data
Understanding data is a crucial aspect of any data science project. It is the foundation of any data-driven decision-making process and helps uncover meaningful insights and patterns. Data understanding helps identify suitable data sources and ensures that the data collected is relevant and accurate. Moreover, it also helps identify any potential biases or outliers that may affect the analysis results.
Step 1c. Exploratory Data Analysis (EDA)
领英推荐
Once the business questions are defined, initial data can be collected, roughly cleaned, and analyzed using various statistical and machine learning techniques to extract meaningful insights and patterns. These insights can be used to inform business decisions, optimize processes, and drive growth. It is also important to validate the findings by comparing them with historical data and industry benchmarks.
Step 1d. Building a Proof of Concept (PoC)
PoC helps validate the machine learning solution's feasibility and business requirements on a small scale. This stage is crucial as it helps validate the idea and assess whether the solution is worth pursuing. Usually, experiments are run with a subset of the data, using different algorithms and sometimes fine-tuning their parameters. A successful PoC can be a basis for further development. Success is measured by an appropriate metric that relates directly to the business requirements. Keep in mind that a PoC is just a preliminary evaluation of the solution and may not reflect the final performance of a fully developed product.
Step 2. Building a Minimum Viable Product (MVP)
Building a Minimum Viable Product (MVP) in data science is crucial in developing any data-driven solution. The MVP is a simplified final product version that includes only the essential features that solve the core problem. The objective of an MVP is to validate the hypothesis and gather feedback from stakeholders: user and customers, partners, and investors.
An MVP in data science is different from a PoC in that it is a market-ready product, while the PoC is a test of the feasibility of a solution. A PoC typically focuses on evaluating the technical aspects of the solution: accuracy of the algorithms, processing speed, etc. In contrast, an MVP focuses on solving a business problem and delivering value to the end users. While building the MVP take into account all possible data and ML issues. This way, it can be easily scaled as the business grows, reducing the risk of technical debt and the need for significant rework in the future.
Step 2a. Data Preparation
Data preparation: The machine learning system's quality depends on the data quality. Teams as sales, marketing, and customer support can provide valuable insights into customer behavior and preferences. Data collection from these teams can help in understanding the target audience, their needs, and how the product or service can be improved to meet those needs. After cleaning the data, which might take more time than expected, comes the fun part. Feature engineering is where data scientists use their domain knowledge and creativity to extract meaningful information from the data. They can manipulate and transform the data to extract relevant features, which can then be used for building models
Step 2b. Model Development
In this step, various machine learning algorithms are tested and compared to determine the most suitable one for the business case. The selected algorithm is then further developed and optimized through the use of testing data.
Step 2c. Model Evaluation
Translating the desired business outcomes into specific metrics that can be measured within the machine learning model is crucial. This helps ensure that the model is aligned with the goals and objectives of the project and that it will ultimately deliver value to the business. Regardless of the metric chosen, it is necessary to establish a baseline measure of performance to track progress and judge the rate of return from increasing the complexity of the modeling solution.
Step 2d. Model Deployment
The deployed model should be able to handle incoming data, make predictions, and sometimes even return results in real time. The deployment process involves integrating the model into the business, setting up the infrastructure for hosting the model, and testing the deployment. The pipelines are triggered, results are stored, and notifications and alerts are sent.
Step 3. Next Steps
The PoC was a success, the MVP showed good traction. Now what? Let's concentrate on how we can improve the solution.
Step 3a. Monitoring results
Monitoring the performance of a machine learning model helps in determining the effectiveness of the model. To make informed decisions track and evaluate the results regularly. This is where monitoring tools such as PowerBI, Tableau, or even python visualization packages come into play, providing a clear and easy-to-understand visual representation of the data. This can have a significant impact on the success of the project and overall business outcomes.
Step 3b. Solution performance
The model's performance can usually be improved by adding new data or features, changing the algorithm, fine-tuning the parameters, or in other ml-related ways. The improved model is then retrained and evaluated, with the process being repeated until satisfactory performance is achieved. After an MVP is deployed, one can start iterating right away using agile frameworks. Machine learning engineers usually prioritize sorting out infrastructure issues and keep the first model simple. A reliable pipeline will allow further testing of more complex models.
Step 3c. Model documentation
Documenting the model and the process of developing it is important for future reference, maintenance, and updating of the model. The documentation should include the data used, the algorithm, the parameters, and the performance of the model.?
Step 3d. User support and training
Providing user support and training is crucial for the successful implementation of the machine learning system. This includes providing training on how to use the system, answering user queries, and providing technical support in case of any issues.
Overview
As it might be easy to spot, fragments of CRISP-DM and OSEMN were used to create a generalized framework to deliver a data science solution from idea to product. The suggested framework can be further customized to meet an organization's specific needs.
I hope this article has provided valuable insights into how to start an ML project, move from PoC to MVP or continue improving existing ml solutions.?
If you have found this information helpful, or have additional ideas, cases, or materials to share, please let me know in the comments section below.
PhD, Principal Data Scientist || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers
1 年Added a #retail example in the medium version: https://medium.com/@ivanreznikov/business-ml-systems-from-scratch-to-product-d35c7cd8490e
Data Quality & Governance Analyst @ FARFETCH | MSc Student @ FEUP
1 年This is a superb piece of work, well done Ivan Reznikov! ??
Next Trend Realty LLC./ Har.com/Chester-Swanson/agent_cbswan
1 年Well said.