登录查看更多内容

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

Kai R. Larsen

发布日期: 2018年10月2日

[This is an excerpt from Kai R. Larsen and Daniel Becker, Automated Machine Learning for Business. Oxford University Press, 2019. The entire chapter is available here.]

Davenport and Patil (2012) suggested that coding was the most basic and universal skill of a data scientist but that this would be less true in five years. Impressively, they predicted that the more enduring skillwould be to “communicate in the language that all stakeholders understand”—and to demonstrate the specialskills involved in storytelling with data. The next step for us, then, is to develop these abilities vital for the data scientist or a subject matter expert capable of doing some of the work of a data scientist.

True to Davenport and Patil’s vision, this book does not require coding skills unless your data is stored in complex databases or across multiple files. (We will provide some exercises allowing you to test your data skills if you are so inclined.) With that in mind, let’s review the content of the book laid out visually now in Figure 2.1. The machine learning life-cycle has its roots in extensive research and practitioner experience (Shearer, 2000) and is designed to be helpful to everyone from novices to machine learning experts.

The life cycle model, while figured linearly here, is not an altogether linear process. For every step in the process, lessons learned may require a return to a previous step, even multiple steps back. Unfortunately, it is not uncommon to get to the Interpret & Communicate stage and find a problem requiring a return to Define Project Objectives, but careful adherence to our suggestions should minimize such problems. In this book, each of the five stages is broken down into actionable steps, each examined in the context of the hospital readmission project.

This book takes advantage of Automated Machine Learning (AutoML) to illustrate the machine learning process in its entirety. We define AutoML as any machine learning system that automates the repetitive tasks required for effectivemachine learning. For this reason, amongst others, AutoML is capturing the imagination of specialists everywhere. Even Google’s world-leading Google Brain data scientists have been outperformed by AutoML (Le & Zoph, 2017). As machine learning progress is traceable mostly to computer science, it is worth seeing AutoML initially from the code-intensive standpoint. Traditionally, programming has been about automating or simplifying tasks otherwise performed by humans. Machine learning, on the other hand, is about automating complex tasks requiring accuracy and speed beyond the cognitive capabilities of the human brain. The latest in this development, AutoML, is the process of automating machine learning itself. AutoML insulates the analyst from the combined mathematical, statistical, and computer sciences that are taking place “under the hood,” so to speak. As one of us, Dan Becker, has been fond of pointing out, you do not learn to drive a car by studying engine components and how they interact. The process leading to a great driver begins with adjusting the mirrors, putting on the seat belt, placing the right foot on the brake, starting the car, putting the gear shift into “drive,” and slowly releasing the brake.

As the car starts moving, attention shifts to the outside environment as the driver evaluates the complex interactions between the gas, brake, and steering wheel combining to movethe car. The driver is also responding to feedback, such as vibrations and the car’s position on the road, all of which require constant adjustments to accommodate, focusing more on the car’s positionon the road rather than the parts that make it run. In the same way, we best discover machine learning without the distraction of considering its more complex working components: whether the computer you are on can handle the processing requirements of an algorithm, whether you picked the best algorithms, whether you understand how to tune the algorithms to perform their best, as well as a myriad of other considerations. While the Batmans of the world need to understand the difference between a gasoline-powered car and an electric car and how they generate and transfer power to the drivetrain, we thoroughly believe that the first introduction to machine learning should not require advanced mathematical skills.

2.1 What is Automated Machine Learning?

We started the chapter by defining AutoML as the process of automating machine learning, a verytime and knowledge-intensiveprocess. A less self-referential definition may be “off-the-shelf methods that can be used easily in the field, without machine learning knowledge” (Guyon et al., 2015, p. 1). While this definition may be a bit too optimistic about how little knowledge of machine learning is necessary, it is a step in the right direction, especially for fully integrated AutoML, such as Salesforce Einstein.

Most companies that have adopted AutoML tend to be tight-lipped about the experience, but a notable exception comes from Airbnb, an iconic sharing economy company which recently shared their AutoML story (Husain & Handel, 2017). One of the most important data science tasks at Airbnb is to build customer lifetime value models (LTV) for both guests and hosts. Thisallows Airbnb to make decisions about individual hosts as well as aggregated markets such as any city. Because the traditional hospitality industry has extensive physical investments, whole citiesare often lobbied to forbid or restrict sharing economy companies. Customer LTV models allow Airbnb to know where to fight such restrictions and where to expand operations.

To increase efficiency, Airbnb identified four areas where repetitive tasks negatively impacted the productivity of their data scientists. There were areas where AutoML had a definitive positive impact on productivity. While these will be discussed later, it is worth noting these important areas:

1. Exploratory data analysis. The process of examining the descriptive statistics for all features as well as their relationship with the target.

2. Feature engineering. The process of cleaning data, combining features, splitting features into multiple features, handling missing values, and dealing with text, to mentiona few of potentially hundreds of steps.

3. Algorithm selection and hyperparameter tuning. Keeping up with the “dizzying number” of available algorithms and theirquadrillions of parameter combinations and figuring out which work best for the data at hand.

4. Model diagnostics. Evaluation of top models, including the confusion matrix and different probability cutoffs.

In a stunning revelation, Airbnb stated that AutoML increased their data scientists’ productivity, “often by an order ofmagnitude” (Husain & Handel, 2017). Given data scientist salaries, this should make every CEO sit up and take notice. We have seen this multiplier in several of our projects and customer projects; Airbnb’s experiences fit with our own experiencesin using AutoML for research. In one case, a doctoral student applied AutoML to a project that was then in its third month. The student came back after an investment of two hours in AutoML with a performance improvement twice that of earlier attempts. In this case, the problem was that he had not investigated a specific class of machine learning that turned out to work especiallywell for the data. It is worth noting that rather than feel defeated by this result, the student could fine-tune the hyperparameters of the discovered algorithm to later outperform the AutoML. Without AutoML in the first place, however, we would not have gotten such magnified results. Our experience tracks with feedback from colleagues in industry. The Chief Analytics Officer who convinced Kai to consider the area of AutoML told a story of how DataRobot, the AutoML used in this book, outperformed his large team of data scientists right out of the box. This had clearly impressed him because of both the size of his team, their decades of math and statistics knowledge, and their domain expertise. Similarly, AutoML allowed Airbnb data scientists to reduce model error by over 5%, the significance of which can only be explainedthrough analogy. Consider that Usain Bolt, the sprinter whose name has become synonymous with the 100-meter dash, has only improved the world record by 1.6 percent throughout his career (Aschwanden, 2017).

For all the potential of AutoML to support and speed up existing teams of data scientists, the potential of AutoML is that it enables the democratization of data science. It makes it available and understandable to most, and makes subject matter expertise more importantbecause it may now be faster to train a subject matter expert in the useof AutoML than it is totraina data scientist to understand the business subject matter at hand (for example, Accounting).

Sections 2.2: What AutoML is NOT. / 2.3: Available Tools and Platforms / Section 2.4 Eight Criteria for AutoML Excellence / Section 2.5 How Do the Fundamental Principles of Machine Learning and Artificial Intelligence Transfer to AutoML? A Point-by-Point Evaluation are all available in the full article on ResearchGate. Contribute to the machine learning principles in this Google Sheet.

John L.

AI Architecture / Generative AI / Agentic AI / Enterprise Data and Analytics / ExMag7 / Development and Leadership

6 年

Aligning algorithms with use cases - then iterate multiple different algorithm flavors and compare results. Automation can work with many basic use cases - but for real competitive advantage - it requires modeling a process that can’t be easily automated.

1 次回应

Jonathan Hodges

Generative AI & Data Products Leader

6 年

Very cool! We plan on doing an evaluation of DataRobot and Driverless AI next year. Your book looks like it will be great companion to the effort.

Erland Mathias Str?mmen

Senior Security Architect, Group Security | Financial Crime Prevention Services (FCPS)

6 年

Great view??

1 次回应

Spencer Rafii

Acquisition Agent at New Western

6 年

Very Nice! I recognize the ML Process from somewhere ;)?

2 次回应

Weston Ballard

Building in longevity tech | Stanford MBA

6 年

Always love reading this! ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Kai R. Larsen的更多文章

The Literature Review Process Is Broken. Can Machine Learning Help?

2018年10月17日

The Literature Review Process Is Broken. Can Machine Learning Help?

In our upcoming article in the Journal of the AIS, we empirically show the extent to which a set of theory review…

2 条评论
Robo-advisors and AutoML

2018年1月14日

Robo-advisors and AutoML

Congratulations to @DataRobot for mention in Harvard Business Review, and by none other than Thomas Davenport…
Congrats, University of Colorado, Boulder!

2016年10月25日

Congrats, University of Colorado, Boulder!

Congratulations to the University of Colorado, Boulder on its ranking as the #32 university in the world. Also kudos to…

3 条评论
Data Science and Business Analytics Conference

2014年8月11日

Data Science and Business Analytics Conference

Our Second Annual Leeds Analytics conference is coming up on September 18th and 19th. Great opportunity to update your…

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

Kai R. Larsen

2.1 What is Automated Machine Learning?

Kai R. Larsen的更多文章

社区洞察

其他会员也浏览了

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

CRISP-DM Process for Machine Learning Projects

Understanding XGBoost: A Powerful Machine Learning Algorithm

10 Essential Machine Learning Algorithms Every Beginner Should Know

Building Intelligent Systems Integrating Machine Learning with Data Engineering

Klassifier No Code Machine Learning

Top 10 Automated Machine Learning(Auto ML) tools used in 2020-2021

Day 2: The MLOps Lifecycle

Machine Learning Algorithms: A Deep Dive into Key Techniques

Leveling Up as a Product Manager: AI and Machine Learning Beyond the Code

2.1 What is Automated Machine Learning?

Kai R. Larsen的更多文章

The Literature Review Process Is Broken. Can Machine Learning Help?

Robo-advisors and AutoML

Congrats, University of Colorado, Boulder!

Data Science and Business Analytics Conference

社区洞察

其他会员也浏览了

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

CRISP-DM Process for Machine Learning Projects

Understanding XGBoost: A Powerful Machine Learning Algorithm

10 Essential Machine Learning Algorithms Every Beginner Should Know

Building Intelligent Systems Integrating Machine Learning with Data Engineering

Klassifier No Code Machine Learning

Top 10 Automated Machine Learning(Auto ML) tools used in 2020-2021

Day 2: The MLOps Lifecycle

Machine Learning Algorithms: A Deep Dive into Key Techniques

Leveling Up as a Product Manager: AI and Machine Learning Beyond the Code