The Elusive Goal of Automated Deep Modelling

Over the last decade, neural networks or deep learning as it is better known today, has permeated every sphere of modelling. The aspects of deep learning that have enabled this to happen comprise automation of feature extraction for machine learning and the provision of models of substantial complexity that can be trained using data alone. A more recent trend is that of increasing automation in the tools used to develop learning models themselves. This has led many in the corporate world to believe that the problem of data modelling is solved, and that provided copious amounts of data can be fed to large scale AI platforms, all possible kinds of models can be automatically derived. My goal in this article is to establish the contrarian point of view that while automation in the tooling for AI model development will continue to make rapid strides, the dream of fully automated deep modeling will elude us like a receding horizon until fundamentally new paradigms of AI emerge that disrupt current status quo in the area.

Our Expectations from Deep Learning

Let us first take some time to understand the reasons behind our high expectations from deep learning. Although many of the architectures and algorithms were developed through the eighties and the nineties, the world looked up and noticed the wonder that was deep learning only in the last decade or so; like when Alex Krizhevsky, a relative newbie in the area of computer vision, went on to win the ImageNet competition in 2012 using deep learning models alone, with over 10% improvement in performance over competitors who were using conventional models. Prior to this, deep learning had also made inroads into the area of handwriting recognition (Alex Graves in ICDAR 2009); soon afterwards, deep learning replaced the end-to-end speech recognition processing chain, displacing GMMs and HMMs from their hegemony of thirty years. The general message seemed to be that domain knowledge was no longer necessary to obtain high-performing models -- anybody could do it.

The general message seemed to be that domain knowledge was no longer
necessary to obtain high-performing models -- anybody could do it.
... Just feed data and models and insights will be mechanically generated.

Subsequently, frameworks like Tensorflow and Pytorch have emerged with elaborate tooling support (e.g. Tensorboard) and libraries to streamline the painful parts of deep learning development. Parameter settings for training regimes have been managed by schemes like Adagrad, Adam and RMSProp to further simplify the model training process. Lastly, the area of neural network model search has progressed substantially, allowing tools like Google's AutoML to provide for automated model selection and training (in a limited manner of course, as we shall see). With these advances as well as the widespread availability of pre-trained deep learning models populating several "model zoos", many outside the developer community are beginning to nurture the belief that the problem of deep learning is a solved one, and that all one needs is a platform containing deep learning tools and seed models for transfer learning to generate models for any data set; in other words, just feed data and models and insights will be mechanically generated.

Data Sets and Task Specificity

Some have even claimed that generic deep learning models can "understand" the data. However, this is where true capabilities of deep learning start being coloured by marketing hype and wishful thinking borne partly out of the desperation to address the manpower shortage in the area. Beyond all hype surrounding deep learning models, these are essentially what would technically be termed as "model-free estimators" -- in other words a sophisticated and powerful form of curve-fitting to data. Deep learning is therefore just as powerful as the data that we use for training. However, one could still go ahead and argue that by feeding in data that's sufficiently rich in information content, we can get deep learning to achieve any target behaviour. The Watson computing system developed in IBM's DeepQA project was able to win the first place prize of a million dollars on the quiz show "Jeopardy!" after all! So couldn't we dump all the verbal responses of an intelligent person and construct a deep model of the thought processes of a human being?

Couldn't we dump all the verbal responses of an intelligent person and 
construct a deep model of the thought processes of a human being?

Not quite! The AI models that were developed for the Jeopardy! challenge, it turns out, were specifically developed for this, and outside the scope of this task, it can do precious little. This is the story of the overwhelming majority of deep learning models today that excel in specific tasks like handwriting recognition, speech recognition, finding objects in images and videos, text translation, summarization and so on (the last two are extremely intriguing in that they execute without developing an understanding in terms of of internal world models like human beings do). This implies that the process of learning for deep networks requires a specific data set for each new task learned. The model learning process can sometimes reuse some of the training resources of a previous model trained on a similar task, but there is no continuity of the learning across tasks  in a real sense. Automatically assembling a relevant data set is therefore the foremost nontrivial problem in automated deep modelling. In other words, there is no automated way in which, given data from an entity like an enterprise is pooled, that insights should magically emerge through automated model training. 

Automatically assembling a relevant data set is therefore the foremost 
nontrivial problem in automated deep modelling.

Things are not all hunky dory even for task-specific modelling with deep networks. Assuming that a data set is available for model training, deep learning models are typically constrained to operate on things similar to what they may have encountered during the training process. When a deep network is faced with variations in data it's response can be appropriate or off the mark in a fashion that is unpredictable. Nowhere is this more apparent than where deep learning is used for fraud detection. Fraud is committed by people who will constantly adopt new ways to avoid detection. This implies that learning based fraud detection model will work until a certain point that the pattern tallies in a direct or abstract sense with a case in the past, and there after fail miserably. As of date, there is no automatic method to detect when such a model should be retrained and using what kind of data.

Automated Model Search

Having relegated ourselves to task-specific learning with deep neural networks, the next question we will address is as to whether even under these restricted circumstances, the problem of developing deep learning models can be automated. We will engage on this exercise with an understanding of the three kinds of variations that are perceived in neural network models.

Variations in Node and Layer Structures

Deep learning networks contain layers of processing nodes as a generic feature. Most layers comprise nodes that execute composition of a vector inner product and a nonlinearity, and the nonlinearities can range from rectified linear units to sigmoids to softmaxes and so on; while there are some guiding principles for usage, the final choice does not follow any established pattern. Then there are kinds of nodes like “convolutional”, “LSTM”, “GRU”, “max-pooling” and “dropout” that execute grossly disparate functions compared to the standard node described above, and combinations of all of these are routinely included into deep learning models. This implies that “model search” for deep learning models has to operate in a space where the number of possible variations suffers from combinatorial explosion.

Architectural Variations

When we come to deep model variations at an architectural level, the number of possible models just grows larger. There is foremost the question of the number of layers in the model, the size and composition of nodes in a layer and the kinds of connections that exist between layers. Some layers are composed via a collation of disparate layers, and many have “skip connections”. BERT and YOLOv3 are two deep learning models popular for natural language processing and computer vision tasks respectively, and these are totally different kinds of beasts altogether. They are also individually huge, comprising hundreds of millions of parameters; YOLOv3 has 106 layers, and BERT is so large that it will not fit into GPU memory for execution. YOLOv3’s final outputs emerge from three different layers, two of which are internal to the stack. YOLOv3 is again grossly different from Mask-RCNN, another popular model in computer vision. There are also myriad variants of autoencoders, variational models, generative models and deep reinforcement learning models that are used across applications. In summary, the architectural variations in deep learning models have become too diverse to be accommodated in a tractable model search process.

In summary, the architectural variations in deep learning models have 
become too diverse to be accommodated in a tractable model search process.


Learning Mechanisms

The final aspect in which deep learning models differ from one another is in the way the learning process operates to estimate internal parameters. Many of the architectures mentioned previously have distinctive learning mechanisms. Adversarial learning models have two networks that are trained with completely different criteria -- one of them to generate accurate examples modeled by a distribution, and the other to distinguish whether a sample belongs to the distribution or outside of it. Models for image “style transfer” use their own distinctive cost functions. Even facial recognition uses a ”triplet loss function” that is distinctive as well as crucial to the success of such networks. For reinforcement learning with “Deep-Q” algorithms, the cost functions used are aimed at maximizing the agent’s rewards. There are also abstract notions like “attention” that have varying implementations for domain-specific problems. All this implies that for two identical deep model architectures, there are a very large number of possible ways to generate a model using a given data set by simply varying the training mechanism.

... for two identical deep model architectures, there are a very large 
number of possible ways to generate a model using a given data set by 
simply varying the training mechanism.


Automated Transfer Learning

Some would argue though that the process of deep model building is way simpler when one resorts to transfer learning. The notion of transfer learning relies on the generation of models based on a data set that draws upon other data sets that are similar or related. Frequently the models trained use parts of models pre-trained on other data sets as their starting point and adopt similar architectures. However, the implied simplification in model development is again deceptive. Firstly, the choice of an appropriate pre-trained model is itself an unsolved problem. In computer vision, many attempt to develop deep learning models using a pre-trained stack developed using the ImageNet data set; however, sometimes it works and sometimes it doesn’t, based on whether the task at hand can work well with the features extracted by the pre-trained model. Furthermore, in a model like YOLOv3 with 106 layers, only the bottom half of the stack is based on ImageNet, whereas the top 53 layers are newly trained; this implies that the problem of architectural search cannot be wished away for transfer learning either except for cases where a retraining of the topmost layer suffices for addressing the new target. To complicate matters further, sometimes transfer learning on a large pre-trained stack takes unpredictable turns, as with BERT, where researchers typically have to make several training runs, many of which will fail to converge for unknown reasons. All of this goes on to underline the fact that transfer learning, while reducing architectural search requirements, is still hardly amenable to complete automation.

... the choice of an appropriate pre-trained model is itself an unsolved 
problem.


Conclusion

In conclusion, tooling for deep learning is constantly improving, as also the libraries of building blocks, simplifying aspects of model development and deployment alike; but the dream of automating the process of model development it appears to be a receding horizon. Certain parts of model retraining in an automated fashion is a certain possibility though, especially where an automated method for quality assurance of recent data exists.

Foremost, new classes of deep learning need to be devised where a single 
model can learn simultaneously a variety of tasks, many of which may not be
 related.
... Such a model needs to have standard set of building blocks and a common 
training algorithm.
... should also have a utility-based meta-model that will formulate the 
appropriate cost function for a certain objective of training. 


 However, in order for an automated process of model building from data alone, some fundamental changes need to happen in the way deep learning models are structured and trained. Foremost, new classes of deep learning need to be devised where a single model can learn simultaneously a variety of tasks, many of which may not be related. Such a model needs to have standard set of building blocks and a common training algorithm. Since we have pointed out the difficult arising from multiple cost criteria, such a holy grail of deep learning should also have a utility-based meta-model that will formulate the appropriate cost function for a certain objective of training. At the end of it all, it appears that our hope of automating deep learning AI model-building is synonymous with adopting a disruptive computational structure that matches our human brain in more ways than it does today.

Dr. Puranjoy Bhattacharya

Senior data science professional specialized in AI/ML & Cognitive Computing (Distinguished Technologist at Infosys Ltd.)

4 年

Just came across a new forward architectural search from Microsoft. Looks interesting. https://www.microsoft.com/en-us/research/blog/project-petridish-efficient-forward-neural-architecture-search/

回复

Thanks Puran Da, great article, summarises it well. I look at the transfer learning part, and am optimistic about more innovations in this space soon, :-)

回复

Nice thoughts Puranjay. loved it. Can a logic programming algorithm supervise a set a DL models and share the layers ? ?

回复
Piyush Gupta

Principal Architect (Software & Security ) | Leader CTO Office | Innovator

4 年

I can't more agree on this, well said Puranjoy.!

回复
Vineet Govil

CTPO at Physics Wallah | Ex CTO Viacom18 Digital Ventures | Ex Head Dish Network Inc, India | Nasscom DTC Mentor | Mentor & Advisor

4 年

Very true

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了