Claude Shannon's information theory and Language Models

Claude Shannon's photograph

If you work in machine learning, you would have to be living under a rock if you are not aware of the big revolution that language models have brought to the area of Natural Language Processing. Language models make the assumption that natural language can be regarded as a stochastic process, and then build predictive models of language tokens - sometimes by predicting the future given the past (as in the GPT class of models from OpenAI) - or sometimes through more complex setups such as the Cloze task that models such as Google's BERT are trained on. If you are trained as an information theorist, the task appears quite familiar; in fact, language models were introduced by Claude Shannon himself in 1948 in the study of natural language and a large body of theory and algorithms for analyzing and building them has been developed over decades. I should note that information theory has been largely sitting on the sidelines of the present day revolution in machine learning, probably because either information theory is too abstract to be immediately useful to the extremely fast paced world of deep learning, or because a good researcher in deep learning requires experimental skills beyond the training of many information theorists.

Either way, I thought it would be interesting to reiterate how extraordinary Shannon's insight was back in 1948 (see his Bell System Technical Journal paper). Think about it - there were no digital computers back then. Certainly no internet. The transistor was being invented at the same time Shannon was coming up with information theory. Yet, his words still appear relevant today. In his paper, he makes the case that english can be approximated by a series of more and more complex series of random processes. He gives six examples of ever more complex Natural Language Generation techniques:

No alt text provided for this image

and then states that

``It appears then that a sufficiently complex stochastic process will give a satisfactory representation of a discrete source.''

which I think most natural language processing researchers would agree, is in essence the foundation for the revolution that we are witnessing, where more and more complex language models are producing breathtaking reproductions to human discourse.

Shannon then points out that while he could implement the above using tables of word n-grams - which he had available up to 3-grams - that a simpler, more implicit method existed based on sampling a book at random. What I find remarkable about this is if we fast forward to the Deep Learning era, the notion of using n-grams, quite prevalent for a while in natural language processing, gave way to a modern setup where n-gram statistics are largely implicit in neural network based language models. Granted, Shannon's book sampling method has no resemblance to Stochastic Gradient Descent, but I do wonder whether we are now finally getting closer to that bit of insight from Shannon with things like Google's REALM.

To keep Shannon's streak of uncanny prescience going: he went on to attempt to establish the entropy of english, and came up with several methods, including

``A second method is to delete a certain fraction of the letters from a sample of English text and then let someone attempt to restore them. If they can be restored when 50% are deleted the redundancy must be greater than 50%.''

Those familiar with the Cloze task in BERT pre-training will recognize some of those ideas above. In Cloze, a fraction of words are deleted and then we ask the guesser to restore them. An interesting piece of insight that does not seem to have yet been taking advantage of - note Shannon's discussion on the relation between entropy and the existence of crossword puzzles - can anyone figure out whether crossword puzzles are a good task to train language models?

The essential point of this article is that information theory really ought to be brought into the fold of AI much more aggressively than we are currently pursuing. The extraordinary progress in physical sciences in the first half of the 20th century (which then created the digital world in which we live today) was only possible because there were remarkable experimental and theoretical physicists engaged together attempting to figure out the physical world. I think we are in a similar juncture but a different venture - we are trying to figure out how to build intelligent machines - yet I feel that not enough work is happening in the intersection between information theory and deep learning.



?ukasz D?bowski

Associate Professor at Polish Academy of Sciences, Institute of Computer Science

3 年

I have just published a book titled "Information Theory Meets Power Laws: Stochastic Processes and Language Models" at Wiley, which tries to combine mathematical research in stochastic processes with quantitative linguistics and statistical language modeling. I guess that some ideas in this book, like the necessity of dealing with factual knowledge in language models (stated formally in the mathematical disguise of strong nonergodicity), can be also inspiring for artificial intelligence. Here is the link: https://onlinelibrary.wiley.com/doi/book/10.1002/9781119625384

回复
Gil Shamir

Research Team Lead at Google

3 年

Luis - very nicely put, and I completely agree. In particular, the last paragraph, I believe, is certainly true. I spent many years of my career at Google advocating for dissemination of mature ideas from information theory into machine learning, and there is definitely room for more impact there. Two examples are: - Minimum description length inspired regularization for online learning: https://proceedings.mlr.press/v44/shamir15.pdf - Logistic regression regret: https://proceedings.mlr.press/v125/shamir20a.html

要查看或添加评论,请登录

社区洞察

其他会员也浏览了