登录查看更多内容

ML Systems for Business: A Step-by-Step Guide

Ivan Reznikov

PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

发布日期: 2023年2月7日

Machine learning has rapidly transformed the business world in the recent years, offering new opportunities for companies to improve efficiency, streamline operations, and gain a competitive edge. As a result, there has been a growing demand for organizations to develop their own custom machine learning systems tailored to their specific business needs. However, creating a machine learning system from scratch can be a complex and intimidating process, requiring a deep understanding of the project's technical and business aspects. In this article, we will explore the key considerations and steps involved in building a machine learning system from the ground up to help organizations leverage the full potential of this cutting-edge technology.

CRISP-DM and OSEMN

CRISP-DM and OSEMN frameworks are popular methodologies for organizing and conducting data science projects. They provide structured and standardized approaches for developing data science projects. Their aim is to help data science teams stay organized and focused, ensuring high-quality solutions are delivered on time to meet the business's needs.

The CRISP-DM framework is a six-step process for data science projects

No alt text provided for this image — CRISP-DM Framework. Image Generated by Author

1. Business understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment.?

CRISP-DM focuses on ensuring that a project is well-structured and that the different stages of the project are completed systematically and iteratively.

The OSEMN framework, on the other hand, is a five-step process:

1. Obtaining

2. Scrubbing

3. Exploring

4. Modeling

5. Interpreting Data.

OSEMN focuses on the technical aspects of data science, such as cleaning and preparing data, building models, and interpreting results.

One key difference between CRISP-DM and OSEMN is that CRISP-DM strongly emphasizes the business understanding and deployment stages. At the same time, OSEMN is more focused on the technical aspects of data science. CRISP-DM is more suited to large and complex data science projects, while OSEMN is better suited to smaller and more focused projects.

While CRISP-DM and OSEMN frameworks can provide a solid foundation for a general data science project, businesses must assess whether these frameworks suit their specific needs before adopting them. Often, these frameworks are not enough to meet the goals and requirements of the business.

Below is my vision, based on experience, of how a generalistic data science framework should look like

ML Project from Scratch

Developing a machine learning system from scratch for business needs requires a systematic approach, starting with understanding business needs, building a proof of concept, developing the model, testing it with fine-tuning, and finally, deployment and support.

Below is a step-by-step guide on how I attempt to build production ML systems.

Step 1. Building a Proof of Concept (PoC)

Step 1a. Understanding the Business Needs

Before starting with the development, it is vital to understand the business requirements and objectives of the machine learning system. This includes identifying the problem that needs to be solved, the data available, and the expected outcomes. It helps to ensure that the right questions are being addressed and that resources are not being wasted.?

Step 1b. Understanding Data

Understanding data is a crucial aspect of any data science project. It is the foundation of any data-driven decision-making process and helps uncover meaningful insights and patterns. Data understanding helps identify suitable data sources and ensures that the data collected is relevant and accurate. Moreover, it also helps identify any potential biases or outliers that may affect the analysis results.

Step 1c. Exploratory Data Analysis (EDA)

Data & Analytics 5 个月前

Simplify Data Analytics with Generative AI - A…

Data Science Dojo 3 个月前

TEACHNOOK'S DATA SCIENCE (with Generative AI)

TEACHNOOK (TEACHSCAPE ONLINE LEARNING SERVICES PRIVATE LIMITED) 1 年前

Once the business questions are defined, initial data can be collected, roughly cleaned, and analyzed using various statistical and machine learning techniques to extract meaningful insights and patterns. These insights can be used to inform business decisions, optimize processes, and drive growth. It is also important to validate the findings by comparing them with historical data and industry benchmarks.

Step 1d. Building a Proof of Concept (PoC)

PoC helps validate the machine learning solution's feasibility and business requirements on a small scale. This stage is crucial as it helps validate the idea and assess whether the solution is worth pursuing. Usually, experiments are run with a subset of the data, using different algorithms and sometimes fine-tuning their parameters. A successful PoC can be a basis for further development. Success is measured by an appropriate metric that relates directly to the business requirements. Keep in mind that a PoC is just a preliminary evaluation of the solution and may not reflect the final performance of a fully developed product.

Step 2. Building a Minimum Viable Product (MVP)

Building a Minimum Viable Product (MVP) in data science is crucial in developing any data-driven solution. The MVP is a simplified final product version that includes only the essential features that solve the core problem. The objective of an MVP is to validate the hypothesis and gather feedback from stakeholders: user and customers, partners, and investors.

An MVP in data science is different from a PoC in that it is a market-ready product, while the PoC is a test of the feasibility of a solution. A PoC typically focuses on evaluating the technical aspects of the solution: accuracy of the algorithms, processing speed, etc. In contrast, an MVP focuses on solving a business problem and delivering value to the end users. While building the MVP take into account all possible data and ML issues. This way, it can be easily scaled as the business grows, reducing the risk of technical debt and the need for significant rework in the future.

Step 2a. Data Preparation

Data preparation: The machine learning system's quality depends on the data quality. Teams as sales, marketing, and customer support can provide valuable insights into customer behavior and preferences. Data collection from these teams can help in understanding the target audience, their needs, and how the product or service can be improved to meet those needs. After cleaning the data, which might take more time than expected, comes the fun part. Feature engineering is where data scientists use their domain knowledge and creativity to extract meaningful information from the data. They can manipulate and transform the data to extract relevant features, which can then be used for building models

Step 2b. Model Development

In this step, various machine learning algorithms are tested and compared to determine the most suitable one for the business case. The selected algorithm is then further developed and optimized through the use of testing data.

Step 2c. Model Evaluation

Translating the desired business outcomes into specific metrics that can be measured within the machine learning model is crucial. This helps ensure that the model is aligned with the goals and objectives of the project and that it will ultimately deliver value to the business. Regardless of the metric chosen, it is necessary to establish a baseline measure of performance to track progress and judge the rate of return from increasing the complexity of the modeling solution.

Step 2d. Model Deployment

The deployed model should be able to handle incoming data, make predictions, and sometimes even return results in real time. The deployment process involves integrating the model into the business, setting up the infrastructure for hosting the model, and testing the deployment. The pipelines are triggered, results are stored, and notifications and alerts are sent.

Step 3. Next Steps

The PoC was a success, the MVP showed good traction. Now what? Let's concentrate on how we can improve the solution.

Step 3a. Monitoring results

Monitoring the performance of a machine learning model helps in determining the effectiveness of the model. To make informed decisions track and evaluate the results regularly. This is where monitoring tools such as PowerBI, Tableau, or even python visualization packages come into play, providing a clear and easy-to-understand visual representation of the data. This can have a significant impact on the success of the project and overall business outcomes.

Step 3b. Solution performance

The model's performance can usually be improved by adding new data or features, changing the algorithm, fine-tuning the parameters, or in other ml-related ways. The improved model is then retrained and evaluated, with the process being repeated until satisfactory performance is achieved. After an MVP is deployed, one can start iterating right away using agile frameworks. Machine learning engineers usually prioritize sorting out infrastructure issues and keep the first model simple. A reliable pipeline will allow further testing of more complex models.

Step 3c. Model documentation

Documenting the model and the process of developing it is important for future reference, maintenance, and updating of the model. The documentation should include the data used, the algorithm, the parameters, and the performance of the model.?

Step 3d. User support and training

Providing user support and training is crucial for the successful implementation of the machine learning system. This includes providing training on how to use the system, answering user queries, and providing technical support in case of any issues.

Overview

As it might be easy to spot, fragments of CRISP-DM and OSEMN were used to create a generalized framework to deliver a data science solution from idea to product. The suggested framework can be further customized to meet an organization's specific needs.

I hope this article has provided valuable insights into how to start an ML project, move from PoC to MVP or continue improving existing ml solutions.?

If you have found this information helpful, or have additional ideas, cases, or materials to share, please let me know in the comments section below.

Newsletter for ML enthusiasts

11,203 位关注者

Ivan Reznikov

PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

1 年

Added a #retail example in the medium version: https://medium.com/@ivanreznikov/business-ml-systems-from-scratch-to-product-d35c7cd8490e

1 次回应

Rhaydrick Sandokhan

Data Quality & Governance Analyst @ FARFETCH | MSc. in Data Science and Engineering @ FEUP

1 年

This is a superb piece of work, well done Ivan Reznikov! ??

1 次回应

CHESTER SWANSON SR.

Next Trend Realty LLC./wwwHar.com/Chester-Swanson/agent_cbswan

1 年

Well said.

2 次回应

查看更多评论

要查看或添加评论，请登录

Ivan Reznikov的更多文章

5 Reasons Why Sam Altman Might've Been Fired from?OpenAI?

2023年11月18日

5 Reasons Why Sam Altman Might've Been Fired from?OpenAI?

You won’t find jokes like “Sam Altman is officially the first person to lose a job because of ChatGPT” or “Microsoft…

4 条评论
How to Fit Large Language Models in Small Memory: Quantization

2023年9月4日

How to Fit Large Language Models in Small Memory: Quantization

Large Language Models can be used for text generation, translation, question-answering tasks, etc. However, LLMs are…

11 条评论
I Caught 16 US Presidents Using ChatGPT

2023年8月2日

I Caught 16 US Presidents Using ChatGPT

This story is about AI-generated text detectors and their scoring capabilities. While preparing slides, writing an…

1 条评论
How exactly LLM generates text?

2023年7月27日

How exactly LLM generates text?

This article won't discuss transformers or how large language models are trained. Instead, we will concentrate on using…

19 条评论
Reasons Why You Will Need Linear Algebra as a Data Scientist

2023年3月7日

Reasons Why You Will Need Linear Algebra as a Data Scientist

This article is not about why linear algebra is essential in machine learning. This is an article on why you will need…

6 条评论
Hybrid Rule-ML Solutions: A Smarter Way to Run Business

2023年2月27日

Hybrid Rule-ML Solutions: A Smarter Way to Run Business

I have a confession to make. When I was younger, I was sure that ML could, if not overperform, at least match the…

6 条评论
Data Scientist 2.0: The Evolution of the Role and the Skills Needed to Succeed

2023年1月28日

Data Scientist 2.0: The Evolution of the Role and the Skills Needed to Succeed

Data science has rapidly evolved over the past decade, with the demand for data scientists skyrocketing and the job…
The Misuse of Terminology in Data Field Job Descriptions

2023年1月23日

The Misuse of Terminology in Data Field Job Descriptions

What is the difference between "Machine Learning" and "Artificial Intelligence"? What about the difference between…

31 条评论
Stop Starting, Start Finishing: How To Achieve Your Pet Project Goals

2023年1月15日

Stop Starting, Start Finishing: How To Achieve Your Pet Project Goals

I've recently overheard that one's developers' New Year's resolution was to finally finish those pet projects that have…

7 条评论
Using machine learning to identify the true stars of the 2022 World Cup

2022年12月18日

Using machine learning to identify the true stars of the 2022 World Cup

!Spoiler: if you're interested just in the results - scroll down to the last section :) The FIFA World Cup is a highly…

8 条评论

See all articles

ML Systems for Business: A Step-by-Step Guide

Ivan Reznikov

PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

CRISP-DM and OSEMN

ML Project from Scratch

领英推荐

Step 2. Building a Minimum Viable Product (MVP)

Step 3. Next Steps

Overview

Newsletter for ML enthusiasts

11,203 位关注者

Ivan Reznikov的更多文章

社区洞察

其他会员也浏览了

Machine Learning and Big Data: Are They the Future?

How to approach a Machine Learning Project ?

The six most painstaking steps in machine learning – what your team isn’t telling you

H2O.ai: An Open-Source Platform for Building and Deploying Machine Learning Models

MLOps for Data Scientists

24 Ultimate Data Science (ML) projects to work on in 2022

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Building a Machine Learning Data Pipeline: Best Practices & Strategies

Applied Machine Learning Projects: Course Launch

17 Data Analytics Books You Should Read in 2022

CRISP-DM and OSEMN

ML Project from Scratch

领英推荐

Step 2. Building a Minimum Viable Product (MVP)

Step 3. Next Steps

Overview

Newsletter for ML enthusiasts

11,203 位关注者

Ivan Reznikov的更多文章

5 Reasons Why Sam Altman Might've Been Fired from?OpenAI?

How to Fit Large Language Models in Small Memory: Quantization

I Caught 16 US Presidents Using ChatGPT

How exactly LLM generates text?

Reasons Why You Will Need Linear Algebra as a Data Scientist

Hybrid Rule-ML Solutions: A Smarter Way to Run Business

Data Scientist 2.0: The Evolution of the Role and the Skills Needed to Succeed

The Misuse of Terminology in Data Field Job Descriptions

Stop Starting, Start Finishing: How To Achieve Your Pet Project Goals

Using machine learning to identify the true stars of the 2022 World Cup

社区洞察

其他会员也浏览了

Machine Learning and Big Data: Are They the Future?

How to approach a Machine Learning Project ?

The six most painstaking steps in machine learning – what your team isn’t telling you

H2O.ai: An Open-Source Platform for Building and Deploying Machine Learning Models

MLOps for Data Scientists

24 Ultimate Data Science (ML) projects to work on in 2022

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Building a Machine Learning Data Pipeline: Best Practices & Strategies

Applied Machine Learning Projects: Course Launch

17 Data Analytics Books You Should Read in 2022