Neural Language Models (NLM) without pain

Neural Language Models (NLM) without pain

What is a Language Model?

  • Language Modeling is the task of predicting what word comes next.

No alt text provided for this image

  • We can also think of a Language Model as a system that assigns a probability to a piece of text (paragraph)

No alt text provided for this image

Why do we need Language Models?

Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text:

  • Predictive typing in smartphones
  • Spelling correction: P(about fifteen minutes from) > P(about fifteen minuets from)
  • Speech recognition: P(I saw a van) > P(eyes awe of an); given a speech signal, what is the corresponding text.
  • Authorship identification: who wrote some sample text
  • Machine translation: P(high?winds tonight) > P(large?winds tonight); generating output text of a language conditioned on an input sentence of another language.
  • Dialogue bots

n-gram Language Models

Using a large amount of text (corpus), we collect statistics about how frequently different words are, and use these to predict the next word. For example, the probability that a word w comes after these three words “students opened their” can be estimated as follows:?

  • P(w | students opened their) = count of (students opened their w) / count of (students opened their)

The above example is a 4-gram model. And we may get:?

  • P(books | students opened their) = 0.4
  • P(cars | students opened their) = 0.05
  • P(... | students opened their) = ...

Then we can conclude that the word “books” is more probable than “cars” in this context.?

Accordingly, arbitrary text can be generated from a language model given starting word(s), by sampling from the output probability distribution of the next word, and so on.

Language Modeling Toolkits:

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens:    1,024,908,267,229
Number of sentences:    95,119,665,584
Number of unigrams:         13,588,391
Number of bigrams:         314,843,401
Number of trigrams:        977,069,902
Number of fourgrams:     1,313,818,354
Number of fivegrams:     1,176,470,663        

Examples of 4-gram data:

serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
...        

Sparsity problem:?

  • What if “students opened their ” never occurred in data? Add small ?? to the count for every w (smoothing).
  • What if “students opened their” never occurred in data? We can condition on “opened their” instead (backoff).

Large storage requirements: Need to store count for all n-grams you saw in the corpus.

For more information, kindly refer to the article: Probabilistic Language Models

Neural Language Model (NLM)

NLM usually (but not always) uses an RNN to learn sequences of words (sentences, paragraphs, … etc) and hence can predict the next word.?

Advantages:?

  • Can process variable-length input
  • Computations for step t use information from many steps back
  • Model size doesn’t increase for longer input, same weights applied on every timestep.

No alt text provided for this image

As depicted, At each step, we have a probability distribution of the next word over the vocabulary.

Disadvantages:?

  • Recurrent computation is slow (sequential, one step at a time)
  • In practice, for long sequences, difficult to access information from many steps back


Evaluating Language Models

Perplexity is the standard evaluation metric for Language Models. Perplexity is defined as the inverse probability of a text, according to the Language Model. A good language model should give a lower Perplexity for a test text. Specifically, a lower perplexity for a given text means that text has a high probability in the eyes of that Language Model.

No alt text provided for this image

Moreover, if we have two language models, for example, one for sports and the other for politics, we can use Perplexity to classify a piece of text to be sports or politics based on the lower Perplexity value.


Language Modeling is the task of predicting what word comes next


More advanced and related topics such as neural machine translation, attention, and transformers will be / are discussed.

Reference:

CS224n: Natural Language Processing with Deep Learning Stanford / Winter 2019

Christian Versloot

Doing things with weather data ? ????

4 年

Thanks for the read! Today, RNNs are increasingly being replaced by Transformer based architectures due to their parallelism. I've been looking into them recently and am really impressed about what they can achieve. https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了