Automated Machine Learning: Just How Much?

Automated Machine Learning: Just How Much?

Consumers everywhere depend on Alexa, Siri, or Google’s Assistant for all sorts of things -- answering obscure trivia questions, checking the weather, ordering groceries, getting driving directions, turning on the lights, or even inspiring a dance party in the kitchen. These are wonderfully useful (often fun) AI-based devices that have enhanced people’s lives. However, humans are not actually partaking in deep, meaningful conversations with these devices. Rather, automated assistants answer the specific requests that are made of them.

If you’re exploring AI and machine learning in your enterprise, you may have encountered the claim that, if fully automated, these technologies can replace data scientists altogether. There is currently a lot of talk about the next wave of machine learning i.e. automated machine learning (AutoML). There is also a high level of skepticism.

The phrase AutoML is being extensively used in data science conferences, publications, discussions, applications, and systems as an aid to develop better machine learning models. According to Gartner, more than 40% of data science tasks will be automated by 2020. This automation will boost the productivity of data scientists. Most of the machine learning and data science tasks require expert data engineers, data scientists, and researchers, and there is a huge talent supply now. The ability to automate repetitive data science tasks like choosing data sources, feature selection, and data preparation will compensate for the dearth of skilled data science experts. This will also help data scientists build more machine learning models in less time, improve prediction accuracy and model quality, and fine tune more new ML algorithms. Data scientists can focus on the solution instead of spending time on the process of creating data science workflows.

Few companies like Facebook and Google have already started using AutoML for internal processes. Facebook trains and tests approximately 300,000 ML models every month. It has created its own AutoML engineer referred to as Asimo that automatically produces improved versions of existing machine learning models. Even Google is developing AutoML techniques to automate the design of various machine learning models. AutoML is an exciting trend in the spotlight of the data science space with big strides of progress anticipated in the near future.

What is automated machine learning?

Automated machine learning is about building a system, process or application able to automatically create, train and test machine learning models with as little human input as possible. The CRISP-DM cycle was introduced almost 20 years ago and is now an established process with standard steps, such as data preparation, feature engineering and feature optimization, model training, model optimization, model testing, and deployment, which are common to most data science projects. The aim of automation is to remove human interaction as much as possible from these steps.

No alt text provided for this image

There are different algorithms and strategies to do this which vary by complexity and performance, but the main idea is empowering business analysts to train a great number of models and deliver the best one with just a small amount of configuration.

We usually only talk about automating machine learning, but it’s really about fitting automation into as many steps of the cycle as possible — not just the training/selection of models. For example, applications to automate data wrangling or data visualization are starting to appear as well.

AutoML Process

No alt text provided for this image

AutoML tools streamline the data science process, doing the best it can with the information it has available. There are three main stages to the process:

The first phase involves information ‘mining’ to assist with increasing performance of the generated models, by creating more information for it to learn from. This is very time consuming to do manually, as the data scientist needs to uncover relationships between the data elements and devise ways of exploiting the insight as additional data fields for the machine to pick up on during training.

This is an important phase as this additional data very often means the difference between an unsuitable and an excellent model. AutoML is programmed to try a limited range of data discovery techniques, usually in a way that caters to the ‘average’ data problem, limiting the eventual performance of the model, as it is unable to use SME knowledge that can be essential to success – something which a data scientist can use to their advantage.

Many data science problems start with significant manual effort going into selecting the data to present to an algorithm. Throwing all the data you have at the system will result in a sub-par model, as there are usually many different, often conflicting signals in the data, which need to be targeted and modelled individually.

This is especially true with fraud, where different geographical regions, payment channels etc have vastly differing fraud problems. The manual effort to discover these patterns and design appropriate data sets to allow for accurate detection is still largely un-automated. Taking a multipurpose automated approach to this problem is currently not possible due to the enormous complexity of such an undertaking.

The next phase is model generation. Models with various configurations are created and trained using the data from the previous stage. This is critical as it is almost impossible to use a default configuration for every problem and get the best results.

AutoML has the edge over data scientists here, as it is capable of producing an enormous number of test models, in a very short period of time. The majority of AutoML systems aim to be general purpose and only produce deep neural networks, which can be overkill for many problems, where a simple model, such as logistic regression or decision trees, may be more suitable, but would still benefit from hyperparameter optimisation.

The final phase is bulk performance testing and selection of the best performer. It is at this stage where some manual input is required, not least because it is imperative that the user selects the right model for the task. It’s no use having a fraud risk model, which detects 100% of a fraud problem, but challenges every authorisation.

In the current manual process, the data scientists work with SMEs to understand the data and develop effective descriptive data features. This essential link between SME and data scientist is something that is missing from general purpose AutoML. As described earlier, the process attempts to generate these models automatically from what the tool can discover in the data, which may not be appropriate, leading to poorly performing models. Future AutoML systems should be designed around this, and other constraints, in order to produce quality models at data scientist created standards.

"AutoML created an AI “child” that outperformed all of its human counterparts in many aspects but to replace human Data Scientists is still in the distant future." 

Will AutoML make Data Scientists obsolete?

Most definitely not, though I can understand why that’s a concern. AI is currently automating a fair number of tasks that data scientists do, but those tasks are relatively low value. The new features are time savers for data scientists, but cannot do what data scientists do. One of the key areas where automated machine learning is, and for the foreseeable future, will fall short is around feature engineering.

Recall that there are 5 key types of feature engineering:

  • Feature extraction – machines can easily do stuff like one-hot encoding or transforming existing variables
  • Feature estimation and selection – machines very easily do variable/predictor importance
  • Feature correction – fixing anomalies and errors which machines can partly do, but may not recognize all the errors (especially bias!)
  • Feature creation – the addition of net new data to the dataset – is still largely a creative task
  • Feature imputation – is knowing what’s missing from a dataset and is far, far away from automation

The last two are nearly impossible for automated machine learning to accomplish. They require vast domain knowledge to accomplish. 

Can automated machine learning really fully automate the Data Science cycle with no expert intervention?

This is a tricky question! Some say it can, some say it can’t.

In my opinion, automated machine learning can fully automate the Data Science cycle for standard data science problems. You know the scenario: You have some data, the data is quite general and describes the problem well, no unbalanced classes. You choose a model, you train it on the training set, and you evaluate it on the test set. If performance is acceptable, you deploy it. No major surprises. In this case, the whole cycle can be automated, even introducing a few additional optimization steps.

However, more complex Data Science problems are likely to call for some degree of human — or expert — input.

For example, the domain expert can add some unique knowledge about the data treatment and filtering before continuing with the machine learning process. Also, when the data domain becomes more complex than simple tabular data, e.g., including text, images or time series, the domain expert can contribute with custom techniques for data preparation, data partitioning, and feature engineering.

Basically, the answer to this question is “sometimes.” You can run a fully automated cycle or you can decide to intervene at select points along the way. This is the functionality offered through a feature called Guided Analytics. Guided Analytics allows you to intersperse your workflow with interaction points and thus steer the data science application in different directions, if this is needed.

Guided Analytics is about flexibly adding interaction points in the data pipeline — that is in between the sequence of steps the data go through during the analysis. When you develop a data processing or data analytics application, you don’t just develop it for yourself but also for other people to use too. So, in order to give anyone the opportunity to tweak how the analysis proceeds, you should add some interaction points at strategic locations throughout the pipeline.

No alt text provided for this image

The figure above shows an illustrative example of such a workflow:

Several of the gray metanodes represent the “interaction points” of the workflow: the data scientist who built this workflow designed them so that the workflow, when executed on KNIME Server, allows for interactions with their fellow business analyst at these points in the analysis. In the example workflow, the first interaction point allows the business analyst to select the data set(s) to analyze (“Choose Data”). After the data have been loaded, a second interaction point (“Data Cleaning”) displays a data overview and allows the business analyst to interact: remove useless columns, deal with outliers, fix skewed distributions – whatever the data scientist deemed interesting and relevant at this point. The part in the middle now runs through an analysis and allows the business analyst to provide feedback until a result is reached that is satisfactory. The workflow concludes by allowing the analyst to either deploy the model directly into – in this case - a database or inspect the result in an interactive dashboard.

The future of AutoML

AutoML continues to be developed and there have been some large improvements driven by the main current AutoML providers; Google and Microsoft. Those developments have focused mainly on improving the speed of generating production ready models, rather than exploring how the technology can be improved for more difficult problems (fraud and network intrusion detection, for example), where AutoML can only go so far, before data scientist input is required.

As AutoML solutions continue to develop and expand, more complex manual process will become possible to automate. Current AutoML systems work extremely well on image and speech processing as there is SME knowledge embedded within AutoML to be able to perform these tasks so well. Future AutoML systems will have the ability for the business user to input their knowledge to aid the machine in generating very accurate models automatically.

On top of this, complex data science pipelines will become increasingly streamlined, and adding a large variety of algorithms to optimise will further expand the possible varieties of problems citizen data scientists will be able to tackle.

No alt text provided for this image

Although many data science tasks will become automated, this will enable data scientists to perform bespoke tasks for the business; further driving innovation and allowing business to focus on more important revenue generation and business growth activities.

"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency."

要查看或添加评论,请登录

Devashish Somani的更多文章

社区洞察

其他会员也浏览了