Automate your machine learning models to maximise business benefits
Aisha Ekundayo, PhD
Data Analytics Consulting | AI Consultant | Data Product Management
As a data scientist, your work life will focus on several aspects of optimal use of data including deriving value from data to improve decision making, helping the product team to understand their end users better or forecasting several aspects of a business to gain competitive advantage. All in all, working on data and deriving valuable information will always be your top priority.
You want the machine learning models built to be used repeatedly for decision making by various stakeholders. In other words, to maximise return on investment (ROI) the insights from the data should not only be available once but at all times and, it should become a vital component of the business planning ecosystem and strategy formulation.
With the new problem being more data and less information, it is critical to have a strategy in place regarding how to make the information extracted from data available to end users at all times. That is, you want insights derived from your company’s data to be automated.
Otherwise, the effort and investment put into exploring the data, building models and making predictions may not realise the intended business benefits. This sounds counter-intuitive but it is the reality of some data science work with PowerPoint presentations the only outcome or a one-off dashboard on PoweBI or Tableau. This can be changed with more accessible and easy to deploy technologies at a reduced cost and time.
After spending a considerable amount of time preparing your data, training the models and making predictions, you need to put all these into a pipeline so that the output from the models can be automated for use in applications and other downstream processes.
Deploying and automating your data science pipelines using DevOps practices also enables the pipeline to automatically ingest new data as they come in and include them as part of the dataset to pick up new trends and patterns in the data for more accurate predictions.
There are a few options in the market. You could use AWS, Google Cloud, Microsoft Azure or IBM Cloud. In this blog, I will be writing about Microsoft Azure. There are numerous advantages of using cloud computing, these include flexibility, automated software updates, limited capital outlay, version control, collaboration, scalability and high availability.
Before we look at model operationalisation, I will like to state the main steps involved in a typical data science project after business understanding and ingesting relevant datasets.
These are:
- Data Exploration
- Data Cleaning
- Feature Engineering
- Feature Selection
- Model Training
- Model Testing
- Model Scoring
Each of the components listed above will be part of the automated end-to-end (E2E) pipeline.
To start building your pipeline, set up an Azure subscription to have access to all the resources required. In your subscription, spin up a Databricks workspace, Azure Machine Learning Service or a Data Science Virtual Machine. Use the notebook to write all your code for the data science work within the provisioned application.
If you are using Databricks, it has a Spark backend and uses parallelisation and distribution dataframe structure to reduce execution time. It is particularly useful if you are working with big data and IoT data. Azure ML (AML) service also offers this capability and reduces time to production.
Next, operationise your machine learning models using Azure Data Factory. This will take care of data ingestion, preparation and transformation using scripts written in your preferred programming language that could be Python, Scala, etc., in your provisioned application such as Azure ML or Databricks. Azure data factory has a drag-and-drop UI which makes it user-friendly and it is the main orchestration centre for model operationalisation. It has three main components: Connections which is used to connect to the database; Datasets and Pipelines which has its ETL functionalities and; Trigger which is used to schedule the activities at pre-defined times.
Lastly, use Application Insights to monitor the performance of your pipeline through the telemetry as it reports all exemptions and events happening across the pipeline. Continuous Integration/Continuous Delivery (CI/CD) can be activated on Azure DevOps- Using CI/CD capability in DevOps for automated integration of new features in your code from git. New features will be added automatically to your models using this CI/CD framework.
To summarise, following these steps will enable you to operationalise your models quickly, efficiently and scale up the solution easily. Scaling up will be easy due to the ability to configure your cluster to scale up or down automatically depending on the required computing requirement.
Technology is always changing and data science is an iterative process, so an agile mindset and methodology are important for continuous improvement of your applications or solution. These applications will continue to evolve and improve but the intuition and principles will largely remain the same.
To share the outputs from your models, generate a REST API in Azure API Management to share your predictions or provision a CosmosDB where analyst and data scientist can query the results from the models.
Enjoy deploying your machine learning models and please share any hack or tip that can help other data scientists trying to automate their ML models.
|Bsc, FCA, MBA, CTP| Treasury | Financial Reporting | Financial Planning & Analysis | Fund Raising| Infrastructure Finance| Debt Management & Structuring |
5 年Impressive