A Neural Model to Learn Language Closer to How Humans Do
Jesus Rodriguez
CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.
Natural language understanding(NLU) is one of the disciplines that has seen leading the deep learning revolution of the last few years. From basic chatbots to general-purpose digital assistants, conversational interfaces have become of the most prevalent manifestations of artificial intelligence(AI) influencing our daily lives. Despite the remarkable progress, NLU applications seem to be mostly constrained to task-specific models in which the language representation is very tailored to a specific task. Recently, researchers from Microsoft published a new paper and implementation of a new technique that can learn language representations across different NLU tasks.
The specialization of NLU models has its roots on something call language embeddings. Conceptually, a is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. Currently, most NLU models rely on domain-specific language embeddings that can’t be applicable to other NLU tasks. In order to create more general-purpose conversational applications, we need language embedding models that can be reused across different NLU tasks.
The domain specialization characteristic of language embeddings in in direct contrast to how humans master language concepts. Upon learning a specific concept, humans are able to reuse it and apply it to numerous conversations across different topics. For instance, its trivial for a kid that just learned to ski to have a conversation about skating than for someone who has never been exposed to those activities. If we were modeling those conversations using NLU methods, we would have to train different models in the ski and skating language terms in order to have an effective dialog. Developing language embedding models that can be applied across different NLU tasks is one of the top challenges of the current generation of NLU techniques.
The Foundation: Multi-Task Learning and Language Model Pre-Training
Given its relevance, the idea of creating reusable language embeddings has been an active area of research in the NLU space for the last decade. Those efforts have produced two main techniques that represent the foundation to Microsoft’s new language learning model: multi-task learning and language model pre-training.
As it names clearly indicates, multi-task learning(MTL) is inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task. In the context of language learning, MTL models create embeddings that can be reused across different NLU activities. The challenge with traditional MTL models is that they relied on supervised learning techniques that require large amounts of task-specific labeled data which is rarely available and difficult to scale.
To mitigate some of the challenges of MTL, researchers looked at the emerging field of semi-supervised learning. Language model pre-training(LMPT) is a new technique that can learn universal language representations by leveraging large amounts of unlabeled data. LMPT models are initially trained in unsupervised objectives and then fine-tuned for specific NLU tasks.
Together MTL and LMPT constitute the foundation of ‘s new language embedding model. To some extent, Microsoft approached the challenge of creating reusable language embeddings not by inventing a brand new method but by cleverly combining MTL and LMPT in a new neural network architecture that can learn textual representations that can be applied to different NLU tasks.
MT-DNN
Multi-Task Deep Learning Network(MT-DNN) is a new multi-task network model for learning universal language embeddings. MT-DNN combines strategies from muti-task learning and language model pre-training to achieve embedding reusability across different NLU tasks while maintaining state-of-the-art performance. Specifically, the embeddings learned by MT-DNN focused on four types of language tasks:
· Single-Sentence Classification: Given a sentence, the model labels it using one of the predefined class labels.
· Text Similarity: Given a pair of sentences, the model predicts a real-value score indicating the semantic similarity of the two sentences.
· Pairwise Text Classification: Given a pair of sentences, the model determines the relationship of the two sentences based on a set of pre-defined labels.
· Relevance Ranking: Given a query and a list of candidate answers, the model ranks all the candidates in the order of relevance to the query.
MT-DNN is based on an intriguing neural network architecture that combines general-purpose and task specific layers. The lower layers are shared across all tasks while the top layers are task-speci?c. The input X, either a sentence or a pair of sentences, is ?rst represented as a sequence of embedding vectors, one for each word, in l1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l2. Finally, for each task, additional task-speci?c layers generate task-speci?c representations, followed by operations necessary for classi?cation, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using language model pre-training, then refines them via multi-task learning.
To train the MT-DNN model, Microsoft used a two-stage process. The first stage is based on a language pretraining model in which the parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence prediction. That phase is followed by a multi-task fine-tuning stage in which a minibatch based stochastic gradient descent is used to learn all the parameters of the model and optimize for task-specific objectives.
MT-DNN in Action
Microsoft evaluated MT-DNN against different state-of-the-art multi-task language models using three popular benchmarks: GLUE, Stanford Natural Language Inference (SNLI), and SciTail. Among the candidate models, Microsoft included Google’s BERT considered by many the gold standard of language pre-training techniques. In all tests, MT-DNN consistently outperformed the alternative models showing a tremendous level of efficiency when adapting to new tasks. The following matrix summarizes the results of the GLUE test.
One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels. The following chart shows how MT-DNN produced language embeddings that were considerably more universal than those produced by BERT for the same task.
Together with the research paper, Microsoft open sourced a PyTorch-based implementation of MT-DNN. Developers can test this implementation by downloading and running the Docker instance encapsulating the model.
The transition from task-specific to universal language embeddings will be one of the main areas of focus on the next generation of NLU applications. Although in nascent stages, techniques such as MT-DNN highlight some of the key ideas to create universal language embedding representations that are applicable to many NLU tasks.
Executive | Business Strategy | Banking | Payment | Data | Fraud & Risk Management | Innovation
6 年Great article Jesus Rodriguez on universal language dataset embedding and compelling accuracy stats on MTDNN wave .