登录查看更多内容

The Simpsons, Borges and data science

Hernan Revale

Senior Business Intelligence Consultant | MSc Business Analytics, Imperial College Business School

发布日期: 2021年6月18日

We, The Simpsons’ fans, know that everything, absolutely everything can be referenced to some chapter of that fantastic sitcom. We also know that the ability to identify events of the sitcom with so many daily events in the real world is because it was written by people with high educational levels and a variety of backgrounds. Hence, we come across constant references to movies, books, characters and music of all kinds. In fact, some of its writers were mathematical geeks , so if we pay attention, we can find several hidden gags in various chapters that refer to this science. In short, they were nerds.

No hay texto alternativo para esta imagen

In one episode of the series, Montgomery Burns, the vile billionaire, gives Homer a brief tour of his vast mansion. In addition to visiting a room with exotic birds, and another with the largest television in the world, he shows him a room with a thousand monkeys typing on a thousand typewriters . Burns explains that they will soon finish the "greatest novel in history", then take a random page written by one of the poor primates and read the first part of the novel "A Tale of Two Cities "by Charles Dickens, but with a misprint in one of his words. Faced with such a mistake, Burns angrily throws the paper at the monkey.

This scene of just a few seconds accounts for the Infinite Monkey Theorem. Broadly speaking, the proposal is that an infinite number of monkeys typing in an infinite period on infinite typewriters would be capable of writing any work, such as Cervantes's Don Quixote. The theorem is related to the Borel-Cantelli lemma from the beginning of the 20th century, from which it follows that, under certain conditions, the probability of an event will be 0 - the event will not occur - or 1 - the event will happen with certainty, that is to say, with a probability of 100% -. Now, what is the logic behind this proposition?

The text of The Ingenious Gentleman Don Quixote of La Mancha begins with the well-known phrase:

"Somewhere in la Mancha, in a place whose name I do not care to remember, a gentleman lived not long ago, one of those who has a lance and ancient shield on a shelf and keeps a skinny nag and a greyhound for racing..."

Suppose we want one of our monkeys to randomly write the first expression, “Somewhere in la Mancha”. Also, to simplify the example, suppose that we do not take into account capital letters or punctuation, so we are interested in the monkey writing "somewhere in la mancha”.

Imagine that our typewriter has only the 26 letters of the alphabet and a space (27 symbols), so the probability that the monkey will find a certain symbol randomly – the letter s, for example - will be 1/27 (probability of typing 1 symbol -the letter s- out of the 27 possible options). And, assuming that each type on the keyboard are statistically independent events between each other (the fact that the monkey chooses a key does not influence the next key it chooses), we can calculate the probability that the monkey chooses the first two letters that we are looking for ('s' and 'o'), as a multiple of each of their probabilities. That is, 1/27 x 1/27, which is the same as (1/27)^2, the number squared.

Taking this same reasoning to all the characters we are looking for, which are 22 in our example including spaces, the probability that our dear primate writes "somewhere in la mancha" would be the same as the probability of typing a certain random letter multiplied by 22 times, that is, (1/27)^22. By now, you may have realized that it is a very small number (in fact, it gives a number with 32 zeros after the decimal point). By contrast, we could calculate the probability that the monkey does not write the sequence of letters sought in a simple way as calculating the total probability (1) minus the probability of writing our sequence. This would give us 1 - (1/27)^22, which is a number very close to one.

So far, it seems that the probability of succeeding with our beloved monkey is not in our favour. But, what if we start adding more monkeys? We said that the probability of failure with our first monkey is 1 - (1/27)^2, a number that is very close to 1, but not quite. If we incorporate a second monkey, the calculation we should do to find our new probability of failure would be (1 - (1/27)^22)^2. In fact, if we incorporate n monkeys, the calculation would be (1 - (1/27)^24)^n. As we know that the number within the exponential calculation is less than 1, as n increases (our number of monkeys), the number within the larger parentheses will tend to decrease. On this line, if n tends to infinity, the result of the equation will tend to zero. Which is the same as saying that: if the number of monkeys tends to infinity, the probability of failure tends to be zero. The magic of infinity.

Although this presentation so far may seem like an act of revenge from your high school math teacher, let us think about the implications of what we just inferred. Infinite monkeys, typing on infinite typewriters, could effectively write Don Quixote. In fact, they would be condemned to write Don Quixote unfailingly, as well as Homer's Odyssey, and any story, novel and event that happened or will be about to happen. Moreover, if we add enough characters to the typewriters, they could be written in all possible languages. As they would also write innumerable text sequences without any sense.

It is worth mentioning that, in an initiative with little scientific clarity, in 2003 scientists in England left a keyboard in a cage with six monkeys for a month, and all they obtained was a succession of repeated letters. However, it did not take away their pride in taking what was written and publishing it under a limited edition of a book entitled "Notes towards the complete work of Shakespeare ". In turn, if someone else wants to carry out the experiment, but without hurting the literary aspirations of any primate, you can enter the virtual version of it . In some of these virtual simulations, after billions of monkey-years, they finally managed to generate a few characters that matched fragments of literary classics .

Jorge Luis Borges, a writer who uses in many of his stories the concept of infinity and eternity, describes in "The Library of Babel" the existence of a library of seemingly endless identical galleries, in which there are innumerable books with 410 pages each, 40 lines per page and 80 symbols per line, where the number of symbols is 25 including space, comma and period. The protagonist of the story explains that the Babel books are composed of random compositions of these signs, exhausting all possible combinations (whose number is immensely large, but not infinite). Besides, some men wander the eternal corridors in search of some book whose content is more than an incoherent succession of symbols, but some precious knowledge, trying to give some meaning to an objective destined to failure, where the probability of success is practically zero. Similar to walking into a room with a thousand monkeys randomly typing and hoping they have written a page of Dickens' novel.

La salle des planètes / Erik Desmazières

Again, in some of these books you would find the truths of the Universe, the stories of every being that existed and yet to be born told from all possible perspectives, all the texts of humanity (including Don Quixote and the shameful love letter that you wrote in elementary school), and also innumerable facsimiles of the previous ones with minimal variations, imperfections and fallacious statements. The volumes of the library with the knowledge of the Universe would be something similar to the intelligible realm of Forms that Plato raised more than two thousand years ago, in which the truth hides.

Now, what does all this have to do with data science? To begin with, the huge amounts of data - giving almost an idea of infinity to the capacity of the human mind - are the speciality of this science. By now, we must have heard ad nauseam that we constantly generate a lot of data with the electronic appendix that we call a smartphone and that we very generously share it with big corporations. These companies monetize our attention through algorithms trained to keep us in their claws, and intend to sell us that ergonomic desk chair that the day before you thought you needed, and now you see it everywhere in adds and think that Mark Zuckerberg reads your thoughts (or is that just me?).

Towards Data Science 7 个月前

Episode 2: The History of Data Science

Favio Vazquez 4 年前

Some things just write themselves

Keith Aumiller 3 个月前

The algorithms, unlike a monkey pressing keys at random without a specific purpose, efficiently fulfil the objectives for which they were programmed. Actually, the great thing about algorithms is that they will do exactly what you program them to do; but the bad thing about algorithms is that they will do exactly what you program them to do. In other words, the advantage of algorithms, as well as the limitation, is that their potential is limited to the instructions or rules entered by those who program them.

This is where Machine Learning appears which, as the name suggests, tries to break the aforementioned limitations by learning, that is, algorithms that through "experience" (the analysis of large databases) improve their performance. This is how, for example, you can train an algorithm to distinguish between photos of cats and dogs just by 'feeding' it many photos of both animals and explaining (or tagging) when it is a dog and when it is a cat. In this manner, with enough information, and imitating what could be the learning process of a kid, with the visualization of a new image of a cat that it has never seen, the algorithm will be able to distinguish it easily based on the large amount of data already analyzed and learned. It was not necessary to code each of the differential characteristics of each animal, but the program has been able to 'learn' it autonomously.

But machine learning is not only used to distinguish photos of animals, it can also be used for, for instance, medical diagnoses, where the algorithm could analyze millions of data points per second for early detection of different types of diseases . On the other hand, it could be used for more perverse objectives, such as the definition of a psychometric profile of citizens for the generation of hypersegmented political campaigns that seek to shape the opinion of the voters. Sounds apocalyptic? This is exactly what happened in the famous Cambridge Analytica case a few years ago.

We have already mentioned that we, like monkeys in typewriters, are supplying information through our interactions in interconnected devices, which individually may not report a significant value. For instance, no one is particularly interested in how many photos of kittens you liked yesterday or how many milliseconds you stopped to watch the cooking video of that dish that you are never going to make. Nevertheless, these interactions at an aggregate level are the input of algorithms that, through sophisticated statistical tools, are programmed to learn about human behaviour and have surprising powers of prediction. Like when in 2012, the manager of a chain of stores in the United States received an angry father because his teenage daughter was getting discount coupons for cribs and baby clothes from his company, and the father said that the company is encouraging her to get pregnant. It turns out that the advertising algorithms had detected subtle changes in the adolescent's consumption behaviour patterns and were able to quickly predict that she was indeed pregnant, even before the father , with whom she shared a home. Almost 10 years later, these kind of stories may no longer seem so surprising to us.

At the same time, the increase in the complexity of artificial intelligence (AI) has taken dimensions in which, as the program improves automatically, not even the creators themselves understand what is going on in there. This is the case of the two Facebook chatbots that, while chatting with each other, they created their own language that not even the programmers who created them could understand it.

To grasp an idea of the power of data, in 2012 the psychologist Michal Kosinski and his team already had developed a model that, with an average knowledge of 68 Facebook likes of a user, could predict their skin colour with 95% accuracy, sexual orientation with 88%, and much more, such as political and religious affiliation, intelligence, drug use and even if it has divorced parents. It is said that at the time of publication of these results, Kosinski received two calls, one with a threat of legal action, and the other with a job proposal, both from Facebook .

Based on the immense amount of data from which these algorithms feed, they can not only analyze them, generate correlations between variables and even predict behaviours, but they can also combine the information in a novel way and generate new results that never previously existed. For example, there is a website called "This Person Does Not Exist " that using adversarial generative networks (artificial neural networks that "compete" with each other to improve), it generates the face of a person who never existed. In fact, there are already numerous websites that with the same concept generate images of cats, CVs, Startup websites, works of art, feet (?), even memes, which never existed.

Meme generated with artificial intelligence, you would understand it if you were an AI

In this line, the OpenAI company published earlier this year a trained AI that can create images from text commands called DALL-E (a pun between the painter Salvador Dalí and Pixar's WALL-E). Perhaps we are wondering what is new, if with Google Images we seem to do the same. But here the keyword is "create", that is, the system can generate previously non-existent images based on the indications that we give it through text, allowing us a world of possibilities. Let us think we want to see a snail. We can tell the system the style of the image (do we want it as a cartoon, 3D, X-ray or realistic?), the context (in a mountain, a field or a forest?), the angle (from the side, from above, a macro photograph?), the composition of the same (made of glaciers, motorcycles, avocado?), and any other combinations that comes to mind.

Well, at the moment OpenAI did not open the code and only allows a limit of combinations, but it is a matter of time before technology advances and various professions will find themselves in the need of a transformation, such as graphic designers, architects, or the world of fashion. We can simply imagine a situation where we are looking to design the logo of our new business and we want it to combine the concepts of “technology”, “professionalism” and “multiculturalism” so that with a couple of clicks we will receive thousands of design proposals generated by our AI. Let us also imagine that we want options for the facade of the business, the layout of the interior and its furniture, the color combinations and patterns for the staff uniforms, etc.

Some results of our snail-bike generated by DALL-E

Now, let's go back to the beginning. We already have enormous amounts of data (the books in our library), the algorithms that seek to make sense of it (the reviewers wandering the endless corridors), and the ability to generate infinite permutations of the data to generate new information that has never existed before. So, are we one step away from creating our Babel library? If we can imagine all the possible combinations of words and images, is there some precious and hidden information to be found there? For example, with the web that allows us to create random faces, has a user inadvertently ever generated the face of Jesus, Buddha, or the future intergalactic president of the year 2132?

Although the computational power to calculate something as immense as the information found in Borges's fictional library is unattainable, where the number of books far exceeds the number of atoms in the universe (251^312000 books against 10^80 atoms ). Who knows if we are not going to continue developing ingenious ways to reduce the number of permutations to only those relevant. However, who defines what is "relevant"? In addition, it is worth clarifying that the wealth of the new generated combinations will be tied to the information with which such algorithms are fed. But, for the moment, we will not be sure if when we enter the Startup website generator we will find the new future multimillion-dollar idea that will come out in a few years in the future. Similary as when Homer Simpson dreamed that he created a product that makes him rich, but never can get to see it well, remember?

Agostina Orellano

Software Developer en Mercado Libre

3 年

Congratulations Hernán! Very good!

1 次回应

查看更多评论

要查看或添加评论，请登录

Hernan Revale的更多文章

Los Simpsons, Borges y la ciencia de datos

2021年6月11日

Los Simpsons, Borges y la ciencia de datos

Nosotros, los fanáticos de Los Simpsons, sabemos que todo, absolutamente todo puede ser referenciado a algún capítulo…

6 条评论

The Simpsons, Borges and data science

Hernan Revale

Senior Business Intelligence Consultant | MSc Business Analytics, Imperial College Business School

领英推荐

Hernan Revale的更多文章

社区洞察

其他会员也浏览了

The Importance of Game Theory in Data Science: How the Prisoner’s Dilemma Can Help You Make Better Decisions

One, Two, Many, Lots

The Accidental Data Scientists

May the Fourth Be With You: How Star Wars and Data Science Are More Connected Than You Think

What Data Science Forgot

9 Movies Every Data Scientist Should?Watch

RANDOM FOREST MODEL(RFM)

Binary Search Demystified

Meet CTW’s Senior Data Scientist, Jiang Xiaolan

You have to fall in love with the Insights not with the Models (or with Coding)

领英推荐

Hernan Revale的更多文章

Los Simpsons, Borges y la ciencia de datos

社区洞察

其他会员也浏览了

The Importance of Game Theory in Data Science: How the Prisoner’s Dilemma Can Help You Make Better Decisions

One, Two, Many, Lots

The Accidental Data Scientists

May the Fourth Be With You: How Star Wars and Data Science Are More Connected Than You Think

What Data Science Forgot

9 Movies Every Data Scientist Should?Watch

RANDOM FOREST MODEL(RFM)

Binary Search Demystified

Meet CTW’s Senior Data Scientist, Jiang Xiaolan

You have to fall in love with the Insights not with the Models (or with Coding)