登录查看更多内容

Automated Machine Learning: Just How Much?

Devashish Somani

Senior Software Engineer at GoDaddy

发布日期: 2019年9月15日

Consumers everywhere depend on Alexa, Siri, or Google’s Assistant for all sorts of things -- answering obscure trivia questions, checking the weather, ordering groceries, getting driving directions, turning on the lights, or even inspiring a dance party in the kitchen. These are wonderfully useful (often fun) AI-based devices that have enhanced people’s lives. However, humans are not actually partaking in deep, meaningful conversations with these devices. Rather, automated assistants answer the specific requests that are made of them.

If you’re exploring AI and machine learning in your enterprise, you may have encountered the claim that, if fully automated, these technologies can replace data scientists altogether. There is currently a lot of talk about the next wave of machine learning i.e. automated machine learning (AutoML). There is also a high level of skepticism.

The phrase AutoML is being extensively used in data science conferences, publications, discussions, applications, and systems as an aid to develop better machine learning models. According to Gartner, more than 40% of data science tasks will be automated by 2020. This automation will boost the productivity of data scientists. Most of the machine learning and data science tasks require expert data engineers, data scientists, and researchers, and there is a huge talent supply now. The ability to automate repetitive data science tasks like choosing data sources, feature selection, and data preparation will compensate for the dearth of skilled data science experts. This will also help data scientists build more machine learning models in less time, improve prediction accuracy and model quality, and fine tune more new ML algorithms. Data scientists can focus on the solution instead of spending time on the process of creating data science workflows.

Few companies like Facebook and Google have already started using AutoML for internal processes. Facebook trains and tests approximately 300,000 ML models every month. It has created its own AutoML engineer referred to as Asimo that automatically produces improved versions of existing machine learning models. Even Google is developing AutoML techniques to automate the design of various machine learning models. AutoML is an exciting trend in the spotlight of the data science space with big strides of progress anticipated in the near future.

What is automated machine learning?

Automated machine learning is about building a system, process or application able to automatically create, train and test machine learning models with as little human input as possible. The CRISP-DM cycle was introduced almost 20 years ago and is now an established process with standard steps, such as data preparation, feature engineering and feature optimization, model training, model optimization, model testing, and deployment, which are common to most data science projects. The aim of automation is to remove human interaction as much as possible from these steps.

There are different algorithms and strategies to do this which vary by complexity and performance, but the main idea is empowering business analysts to train a great number of models and deliver the best one with just a small amount of configuration.

We usually only talk about automating machine learning, but it’s really about fitting automation into as many steps of the cycle as possible — not just the training/selection of models. For example, applications to automate data wrangling or data visualization are starting to appear as well.

AutoML Process

AutoML tools streamline the data science process, doing the best it can with the information it has available. There are three main stages to the process:

The first phase involves information ‘mining’ to assist with increasing performance of the generated models, by creating more information for it to learn from. This is very time consuming to do manually, as the data scientist needs to uncover relationships between the data elements and devise ways of exploiting the insight as additional data fields for the machine to pick up on during training.

This is an important phase as this additional data very often means the difference between an unsuitable and an excellent model. AutoML is programmed to try a limited range of data discovery techniques, usually in a way that caters to the ‘average’ data problem, limiting the eventual performance of the model, as it is unable to use SME knowledge that can be essential to success – something which a data scientist can use to their advantage.

Many data science problems start with significant manual effort going into selecting the data to present to an algorithm. Throwing all the data you have at the system will result in a sub-par model, as there are usually many different, often conflicting signals in the data, which need to be targeted and modelled individually.

This is especially true with fraud, where different geographical regions, payment channels etc have vastly differing fraud problems. The manual effort to discover these patterns and design appropriate data sets to allow for accurate detection is still largely un-automated. Taking a multipurpose automated approach to this problem is currently not possible due to the enormous complexity of such an undertaking.

The next phase is model generation. Models with various configurations are created and trained using the data from the previous stage. This is critical as it is almost impossible to use a default configuration for every problem and get the best results.

AutoML has the edge over data scientists here, as it is capable of producing an enormous number of test models, in a very short period of time. The majority of AutoML systems aim to be general purpose and only produce deep neural networks, which can be overkill for many problems, where a simple model, such as logistic regression or decision trees, may be more suitable, but would still benefit from hyperparameter optimisation.

The final phase is bulk performance testing and selection of the best performer. It is at this stage where some manual input is required, not least because it is imperative that the user selects the right model for the task. It’s no use having a fraud risk model, which detects 100% of a fraud problem, but challenges every authorisation.

In the current manual process, the data scientists work with SMEs to understand the data and develop effective descriptive data features. This essential link between SME and data scientist is something that is missing from general purpose AutoML. As described earlier, the process attempts to generate these models automatically from what the tool can discover in the data, which may not be appropriate, leading to poorly performing models. Future AutoML systems should be designed around this, and other constraints, in order to produce quality models at data scientist created standards.

"AutoML created an AI “child” that outperformed all of its human counterparts in many aspects but to replace human Data Scientists is still in the distant future."

Will AutoML make Data Scientists obsolete?

Most definitely not, though I can understand why that’s a concern. AI is currently automating a fair number of tasks that data scientists do, but those tasks are relatively low value. The new features are time savers for data scientists, but cannot do what data scientists do. One of the key areas where automated machine learning is, and for the foreseeable future, will fall short is around feature engineering.

Recall that there are 5 key types of feature engineering:

Feature extraction – machines can easily do stuff like one-hot encoding or transforming existing variables
Feature estimation and selection – machines very easily do variable/predictor importance
Feature correction – fixing anomalies and errors which machines can partly do, but may not recognize all the errors (especially bias!)
Feature creation – the addition of net new data to the dataset – is still largely a creative task
Feature imputation – is knowing what’s missing from a dataset and is far, far away from automation

The last two are nearly impossible for automated machine learning to accomplish. They require vast domain knowledge to accomplish.

Can automated machine learning really fully automate the Data Science cycle with no expert intervention?

This is a tricky question! Some say it can, some say it can’t.

In my opinion, automated machine learning can fully automate the Data Science cycle for standard data science problems. You know the scenario: You have some data, the data is quite general and describes the problem well, no unbalanced classes. You choose a model, you train it on the training set, and you evaluate it on the test set. If performance is acceptable, you deploy it. No major surprises. In this case, the whole cycle can be automated, even introducing a few additional optimization steps.

However, more complex Data Science problems are likely to call for some degree of human — or expert — input.

For example, the domain expert can add some unique knowledge about the data treatment and filtering before continuing with the machine learning process. Also, when the data domain becomes more complex than simple tabular data, e.g., including text, images or time series, the domain expert can contribute with custom techniques for data preparation, data partitioning, and feature engineering.

Basically, the answer to this question is “sometimes.” You can run a fully automated cycle or you can decide to intervene at select points along the way. This is the functionality offered through a feature called Guided Analytics. Guided Analytics allows you to intersperse your workflow with interaction points and thus steer the data science application in different directions, if this is needed.

Guided Analytics is about flexibly adding interaction points in the data pipeline — that is in between the sequence of steps the data go through during the analysis. When you develop a data processing or data analytics application, you don’t just develop it for yourself but also for other people to use too. So, in order to give anyone the opportunity to tweak how the analysis proceeds, you should add some interaction points at strategic locations throughout the pipeline.

The figure above shows an illustrative example of such a workflow:

Several of the gray metanodes represent the “interaction points” of the workflow: the data scientist who built this workflow designed them so that the workflow, when executed on KNIME Server, allows for interactions with their fellow business analyst at these points in the analysis. In the example workflow, the first interaction point allows the business analyst to select the data set(s) to analyze (“Choose Data”). After the data have been loaded, a second interaction point (“Data Cleaning”) displays a data overview and allows the business analyst to interact: remove useless columns, deal with outliers, fix skewed distributions – whatever the data scientist deemed interesting and relevant at this point. The part in the middle now runs through an analysis and allows the business analyst to provide feedback until a result is reached that is satisfactory. The workflow concludes by allowing the analyst to either deploy the model directly into – in this case - a database or inspect the result in an interactive dashboard.

The future of AutoML

AutoML continues to be developed and there have been some large improvements driven by the main current AutoML providers; Google and Microsoft. Those developments have focused mainly on improving the speed of generating production ready models, rather than exploring how the technology can be improved for more difficult problems (fraud and network intrusion detection, for example), where AutoML can only go so far, before data scientist input is required.

As AutoML solutions continue to develop and expand, more complex manual process will become possible to automate. Current AutoML systems work extremely well on image and speech processing as there is SME knowledge embedded within AutoML to be able to perform these tasks so well. Future AutoML systems will have the ability for the business user to input their knowledge to aid the machine in generating very accurate models automatically.

On top of this, complex data science pipelines will become increasingly streamlined, and adding a large variety of algorithms to optimise will further expand the possible varieties of problems citizen data scientists will be able to tackle.

Although many data science tasks will become automated, this will enable data scientists to perform bespoke tasks for the business; further driving innovation and allowing business to focus on more important revenue generation and business growth activities.

"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency."

要查看或添加评论，请登录

Devashish Somani的更多文章

Rethinking AI Innovation Beyond Traditional Product Development Processes

2025年3月20日

Rethinking AI Innovation Beyond Traditional Product Development Processes

Managing Artificial Intelligence (AI) innovation using traditional product development frameworks is similar to…
Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

2025年2月21日

Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

Which Attention Mechanism Excels in Large-Scale Document Processing? AI is evolving to handle longer and more complex…
How to Build a Scalable Customer 360 Platform as a Service

2024年9月23日

How to Build a Scalable Customer 360 Platform as a Service

The concept of a Customer 360 view has been around for several years. Many companies have tried to build such a…
Expedite Apache Spark Queries with Bloom Filter Indexing

2024年9月9日

Expedite Apache Spark Queries with Bloom Filter Indexing

When dealing with large-scale datasets, achieving efficient data querying and processing is essential for maintaining…

3 条评论
Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

2024年8月30日

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Sometimes just knowing the tools are not important, knowing how to use it plays most important part. Though we have…

3 条评论
Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

2023年7月3日

Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

Apache SeaTunnel, the latest project to achieve top-level status within the Apache Software Foundation (ASF), addresses…
How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

2021年11月20日

How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

In recent years, the conventional wisdom has been that companies should prioritize efficiency over redundancy in their…
The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

2020年6月4日

The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

Now is the time to think of what history would say about this crisis. And now is the time for all of us to define our…
My Key Takeaways from Amazon AI Conclave 2019

2019年12月22日

My Key Takeaways from Amazon AI Conclave 2019

For over 20 years, Amazon has been investing deeply in Artificial Intelligence. Today, Machine learning (ML) algorithms…

3 条评论
Emergence of Data Storytelling in Data Science

2019年8月17日

Emergence of Data Storytelling in Data Science

Why stories? We’re really passionate about storytelling. Why? Because stories are the only way to ensure that everyone…

2 条评论

See all articles

Automated Machine Learning: Just How Much?

Devashish Somani

Senior Software Engineer at GoDaddy

The future of AutoML

Devashish Somani的更多文章

社区洞察

其他会员也浏览了

How much data do I need to use AI for engineering products?

Power of Predictive Analytics for Your Business - AI and Machine Learning Integration: The Power Behind Predictive Analytics

Why Data Labeling is Crucial for Machine Learning: 7 Benefits You Need to Know

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

Designing AI Architectures: A comprehensive guide to data acquisition, model development, deployment, and monitoring

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Meeting the Challenges of Migrating to Azure ML and Embracing Generative AI: Expert Insights

Machine learning operations: How to enhance AI projects and boost value streams

MLops & AI Models with real-world applications.

Population, Sample, and Sampling Techniques in Machine Learning

The future of AutoML

Devashish Somani的更多文章

Rethinking AI Innovation Beyond Traditional Product Development Processes

Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

How to Build a Scalable Customer 360 Platform as a Service

Expedite Apache Spark Queries with Bloom Filter Indexing

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

My Key Takeaways from Amazon AI Conclave 2019

Emergence of Data Storytelling in Data Science

社区洞察

其他会员也浏览了

How much data do I need to use AI for engineering products?

Power of Predictive Analytics for Your Business - AI and Machine Learning Integration: The Power Behind Predictive Analytics

Why Data Labeling is Crucial for Machine Learning: 7 Benefits You Need to Know

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

Designing AI Architectures: A comprehensive guide to data acquisition, model development, deployment, and monitoring

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Meeting the Challenges of Migrating to Azure ML and Embracing Generative AI: Expert Insights

Machine learning operations: How to enhance AI projects and boost value streams

MLops & AI Models with real-world applications.

Population, Sample, and Sampling Techniques in Machine Learning