PART II: ON NATURAL LANGUAGE PROCESSING (NLP)
1. NLP is Here
NLP is now able to perform many useful business functions. The table below shows that many language-related practical applications can be handled by NLP today.
2. A bit of history
Natural Language Processing and Understanding are subfields of Artificial Intelligence (AI), so they share the same history as computation in general and AI in particular. In fact, early AI pioneers assumed NLP would be one of the easiest AI applications to implement. After all, how hard could it be to check a dictionary with a computer?
My earlier NLU article discusses the reason why we know that assessment was far too optimistic. As for NLP, we can sum up its history along the following eras:
- Prior to 1955 there was mostly theoretical work from Chomsky and others that lead to the formalization of context-free grammars. Also, prior to 1955 we had Alan Turing’s work on the nature of universal computation, and John Von Neumann’s invention of the architecture for all modern computers.
- The second era goes from 1955 through 1970 which saw work on NLP within the confines of the AI program launched at the 1955 Dartmouth Conference by computer scientists who proposed a study to advance the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. In this conference it was famously suggested a 2-month, 10-man study to accomplish this feat!
- During another AI workshop in 1959, some early natural language systems were discussed. These were basic pattern matching and heuristic-based query response systems that later led to the famous 1960’s Eliza program that simulated a dialogue with a therapist by simply turning the user input into a question: User: “I feel sad”, Computer: “Why do you feel sad?” Amusing to realize that at the time that was impressive! To be fair, several corpora were also developed during this period such as the Brown Corpus of American English.
- The following period between 1970 to 2007 saw progress in the areas of improved algorithms and heuristics to improve language parsing, but not much practical uses beyond the immature interactive voice response systems used in the later part of the 90s. Expert Systems also made a brief foray during the 80s. Also worth noting that the Cyc project, a long-running artificial intelligence project that aims to assemble a comprehensive ontology and knowledge base that spans the basic concepts and rules about the world got started in 1983.
- The current era is marked by the advances with machine learning algorithms since 2007, the increasing presence of software agents such as Siri since 2010, and the emergence of Big Data driven by global networks and the increasing power of computer processing as we shall see next.
3. What is the NLP area of work?
Natural Language interactions can be categorized according to Bloom’s Taxonomy[1]. This taxonomy was mainly developed to help define educational agendas, but it has relevance to our discussion. Bloom defines the following levels of questions:
1. Knowledge. Answers require direct recall of information. E.g. What is the biggest city in Japan?
2. Comprehension. Requires categorizations or comparisons prior to answering. E.g. If I put these blocks together, what shape they form?
3. Application. Applying knowledge to a new situation. How would you use a hose to get water out of a flooded basement?
4. Analysis. There is a need to identify sources, causations and correlations to find an answer: Why did the Civil War take place?
5. Synthesis. These questions require development of original ideas or approaches: How would you create an AI machine?
6. Evaluation. These questions require judgment based on prior or assumed knowledge or even formed opinion: What do you think of the book “Wallace’s Infinite Jest”?
The following table, derived from UCONN’s work[2], exemplifies the key action verbs related to each level of question. NLP can be fairly applied to first four columns:
Google and Bing can easily answer first-level questions (Knowledge). For example, “What are the colors of the French flag?” yields a valid response thanks to the engines’ powerful search algorithms.
However, results will vary when asking a Level 2 question (Comprehension), requiring the comparison of information. Googling, “Which countries have a higher GDP than Mexico?” will return general links to GDP statistics. Still, Google and Bing fumble when asked a Level 3 (Application) question such as “Which car manufacturer has lower average prices, Mercedes or Ford?”
WolframAlpha is a more powerful system, but it also falters with this question. As it turns out, the problem is with parsing the question’s grammar, not with the system’s ability to figure out the answer. A simpler “Mercedes vs Ford price” query does indeed work as expected in WolframAlpha. Unlike Google and Bing, WolframAlpha works because it checks the car manufacturers, their car models, and each model’s price. Finally, WolframAlpha figures out that the prices must be averaged in order to compare them. Impressive. See below:
Beyond this, moving to Level 4 (Analysis) is the area of recommendation and sentiment analysis systems. Even though human language understanding relies on some form of Universal Grammar (as per Chomsky), most successful NLP systems use a “fake it till you make it” approach that relies heavily on statistical and heuristic tricks to recognize the purpose of sentences.
While we are not close to having cognitive algorithms that are able to respond to broad questions requiring implicit knowledge and contextual understanding (Levels 5 and 6), this does not mean we cannot benefit from some current NLP capabilities. In fact, there are many specialized NLP applications capable of providing answers to higher level questions for specific narrow domains.
Take for instance Bloom’s Level 5 (Synthesis) ability to create content themselves. While it is still beyond the reach for computers to be able to create meaningful text on their own, with music, work done by scientist-musician, David Cope, has produced beautiful computer-generated compositions that closely resemble the styles of Bach and Beethoven among others[3].
Also, even though Bloom’s Taxonomy places recommendation as the highest level of categorization (Level 6), specific recommendation engines are applied extensively by online vendors today on sites such as Netflix or Amazon. However, before we uncork the champagne bottle, let’s keep in mind that these recommendation engines do not work from a premise of understanding but rather by using targeted heuristics, and not a general understanding of your preferences.
4. The World of Machine Learning
Machine Learning (ML) algorithms are all about Classification and then Prediction based on those classifications. The ability to classify, and then to predict, is at the core of our survival. Ancient humans who failed to properly classify a Saber-toothed tiger as a member of the class “Dangerous Animals” did not live long enough to procreate and, therefore, are not our ancestors!
At its core, machine learning classification occurs in two forms: The first is via Supervised Learning where you give the algorithm a set of training data that essentially serves as an example for the algorithm (“Fans with red shirts are more likely to be rooting for Manchester United; fans with white shirts most likely support Real Madrid”). The second is with Unsupervised Learning where we essentially allow the algorithm to run loose and create potential groups or clusters based on auto-generated hypotheses that can then be statistically tested for accuracy (“Every time Manchester United scores, a larger percentage of fans wearing red shirts celebrate; likewise, every time Real Madrid scores, many more fans with white shirts cheer”).
Supervised Learning is about training; unsupervised about discovery.
Regardless, all ML algorithms must deal with: Overfitting and the so-called Curse of Dimensionality. Overfitting occurs when the algorithm is accidentally trained to give spurious responses, resulting from input data that is ultimately irrelevant but appears not to be so. In a way, you may think of overfitting as the ML equivalent of stereotyping. No, not all tall people make good basketball players, and not all English people have bad teeth. You are perhaps familiar with this pattern: “*nationality* are * ridiculous stereotype*”. Still, overfitting is something that demagogue politicians leverage to rattle their base against other groups.
The curse of dimensionality is that with each additional attribute (dimension) a learner algorithm must consider, the potential number of outcomes grows exponentially. A preferences learner that focuses on only three dimensions, say Age, Gender, and Education, can work well, but as additional factors are considered: residency, ethnicity, political affiliation, etc. the algorithm can rapidly choke up and become ineffective. Again, the trick is to realize which dimensions (attributes) are relevant to a problem.
These days there seems to be a plethora of Machine Learning algorithms, today’s most popular machine learning approaches can be grouped (ahem, classified) according to the following categories: Connectionist (Neural), Evolutionary (Genetic), Statistical (Bayesians), Analogizers, and Symbolist (Linguistic). Following is a brief discussion for each.
5. Connectionist Learning
Connectionists rely on mimicking the way our brains appear to work. The idea is that simulating the exponentially large number of neural connections and getting these neurons to act as ON/OFF gates based on connection weights can achieve learning analogous to that of humans. Connectionist Learning is better known as Neural Networks and it is this approach that has generated the most excitement and advances in recent years. Recent techniques such as Multilayering, Backpropagation and Boltzmann Machines have advanced this field significantly.
Today’s neural networks consist of three types of layers: Input to capture the initial data, hidden layers to weight on specific extraction features, and output to produce the weighted results for a given pattern. Neural Network systems may consist of multiple layers and hence have acquired the term “Deep Learning”.
Boltzmann Machines apply learning via simulated dream-and-awake states to allow neurons to learn by randomly exploring weights during the dream period but reacting more ‘logically’ when awake. Backpropagation works via a feedback loop to automatically generate weight adjustments and by also allowing the neural system to converge to valid, but sub-optimal solution states. This is so because Neural network’s output is probabilistic in nature. Modern Deep Learning neural networks accept the fact that perfect states are not attainable and that, sometimes, good enough learning is a good enough result. Having said this, Deep Learning has produced great results with speech recognition, translation systems, and various forms of text classification.
6. Evolutionary (Genetic) Learning
Evolutionary Genetic Algorithms simulate the way nature has yielded all the myriad, complex life forms found around the world. If you think of the maintenance of life as a specific engineering problem, nature’s evolutionary laws, first identified by Charles Darwin, provide a blueprint for how solutions to environmental challenges can be found. Evolution works by trying a variety of ‘genetic’ sequences against a specific problem. The sequences that provide the best fit as a solution are then mutated and tried again. Also, using a true-to-form version of computer sex, genetic crossbreeding is simulated in order to generate even more variations. Clearly, it has taken nature over four billion years to create intelligent systems (i.e. humans, allegedly), but the speed and efficiency of computers is expected to achieve results even faster than this!
In any case, the difficulty in expressing a problem domain, and in encoding the pseudo-genes to be evolved, makes this technique more suitable for narrow domains. Genetic Algorithms have been applied with some success in Biochemistry research and in creating stock trading algorithms.
7. Statistical Learning
Statistical approaches, particularly those using the Na?ve Bayes algorithm have proven extremely successful as tools for supervised learning. The Bayes algorithm is used to calculate the probability of an event, based on prior knowledge of conditions related to the event. Once trained, the algorithm can estimate the likelihood that a value of unknown classification matches each of the classifications in the reference data. For example, if the training (reference) data indicates that 90% of Rotten Tomatoes movie reviews contain the word “Good”, then the statistical approach would be correct to assume that any new review containing the word “Good” is a positive one.
A problem with statistical methods is that they are highly dependent on the accuracy and validity of the training data. In some cases, they can encounter spurious correlations.
A second problem is that redundant input might be counted, generating a bias toward a particular result. An internet story that goes viral will gradually be thought of as likely to be true, given the large number of shares it receives. This is fine if the story is true, but if the story is along the lines of fake memes then all you have is fodder for conspiracy theorists.
8. Analogy Learning
The problem with the previous methods is that they require supervised used of training data. Nearest-Neighbor and Support Vector Machines (SVM) are examples of unsupervised learning. Also statistical in nature, they work by deriving learning through analogy.
The most important burden in any analogical learner is how to measure similarity. Algorithms such as the Nearest-neighbor algorithm are used in finding the shortest path to a solution, and it has been applied to scores of diverse problems. Do you want to locate the most likely foci of infection? Find the point nearest to most known incidences. This is how physician, John Snow, (not to be confused with the Game of Thrones character![4]) discovered that the source of the 1854 London cholera epidemic came from one of the water pumps in the city.
The main issue with nearest neighbor is the so-called ‘curse of dimensionality’. All is well as long as you apply nearest neighbor to compare only a single attribute, but once you start adding a combination of attributes (e.g. ‘height’, ‘weight’, ‘eye color’, etc.) the algorithm becomes exponentially harder to compute; it quickly becomes almost impossible to identify relevant attributes amongst the thousands of attributes that have no bearing on a result.
Support Vector Machines (SVMs) is a linear algebraic construct used to determine the hyperplane that optimizes the separation between two sets. In other words, SVMs work by segmenting sample data and finding correlations of data in each segment. The algorithm tries this several times until it finds a maximum correlation with the relevant elements known as Support Vectors. SVMs are the mathematical equivalent of sifting a riverbed to separate (classify) sand from gold. The algorithm is non-deterministic as it depends on initial random assumptions (was there gold in the area you chose to search?) and can yield different results in different runs. Still, statistically speaking, the classification results will always converge toward an acceptable range. SVM has been most successful in handwriting recognition, where it has beaten sophisticated neural network algorithms. Also, it has been mathematically demonstrated that SVM is very resilient against over-fitting and tends to be extremely accurate given its optimization strategy.
9. Symbolist Learning. The linguistic approach
Symbolist Learning systems rely on linguistic analysis (parsing language into its individual components), indicative logic reasoning (“Socrates is a philosopher. All philosophers are human. All humans are mortal. Therefore, Socrates is mortal.”), and the use of ontological frameworks to provide a contextual environment.
Language parsing has been addressed mostly using so-called Augmented Transition Networks. These are the representation of recursive analysis of language into its constituent noun-phrases and core components (Adjectives, Nouns, Verbs, Adverbs, etc.). This technique suffers from the nested complexity and ambiguity of languages. As an example, think of the following English sentence: Fish fish fish. That one can be parsed as Fish (subject) that fish (verb) for fish (direct object). But also, the following sentence is a valid one: Fish fish fish eat eat eat.[5]
Predicate Logics is better understood and can be handled more mathematically to extract logical equivalences and correlations amongst different predicates. The Socrates inference given above is an example.
Ironically, even though the Symbolist approach is the most likely path to true language understanding, the commercial success of other forms of machine learning have taken the focus away from this research. In fact, the approach has been met with derision by proponents of other approaches. The head of a speech group at IBM used to quip that every time he fired a linguist, the language recognizer’s performance went up.
Still, significant work continues in this area thanks to the open availability of corpora with collection of sample texts, lexical databases like Wordnet, publicly available ontologies, and tools like NLTK for Python. Furthermore, Object Oriented programming aligns well with symbolist ontological frameworks that benefit from class inheritance and polymorphism. For example, a class for the concept ‘Human’ can be created with a series of attributes related to the concept. If a sub-class ‘Male’ were created, then this sub-class would automatically inherit all the attributes already defined for ‘Human’.
Much like the other techniques, Symbolist approaches have applicability for narrow domain purposes with very practical applications, such as sentiment analysis, rule-based systems, and text categorization and analysis.
10. Using Machine Learning for NLP
Computer scientist Pedro Domingos[6] suggests that since some machine learning techniques work better than others for specific problems, the path forward will have to be a Master Algorithm that combines many of these approaches[7]. This is precisely what is happening within current NLP research.
For example, symbolists can benefit from machine learning techniques in order to generate dynamic ontology creation.
Other new NLP approaches combine symbolist techniques with various machine learning approaches that define attributes for words and weights and correlations. The defined data can then be injected into machine learning algorithms such as neural networks to then derive language associations.
A general approach known as Word Embeddings is used for a broad set of word and grammar modeling techniques for feature extraction. Embedding words into a vector space in a way that semantic and syntactic regularities between words are implicitly captured allows the feature extraction of their role in sentences. For example, by assigning a vector to the word “car” and another to the word “automobile”, the ML algorithm will be able to determine, based on the similarity of their vectors that the two words can be used synonymously in some circumstances.
Google has been researching a specific Word Embedding technique known as Word Vectors (also known as Word2Vec) that encodes word’s syntactic and semantic meaning relationships across collections of words. This approach results in a multidimensional matrix of words where intersecting rows and columns can be “learned” by the ML to derive contextual similarities.
Another advantage of representing words as vectors is that they can be manipulated via mathematical operators. This can give results such as King – man + woman = Queen. The approach also can infer that the words “cat” and “dog” both refer to animals in the context of “household pets”, and so on.
The technique works because words that share similar contexts tend to have similar meanings. The context of a word is defined by surrounding words, and if we measure of the probability of two words having overlapping contexts, that can yield similarities.
Facebook on the other hand, introduced a technique known as fastText, another word embedding method. Unlike Google’s approach, fastText represents each word as an n-gram of characters. (A bigram is usually understood as a pair of words that usually come together like fast track, but it can be any consecutive set of written units such as letters, syllables, or words). This approach allows for the identifications of words that share the same roots and can handle prefixes and postfixes more effectively. Algorithms using this technique can not only figure out that run and running are the same action, but that Gastroenteritis is related to the word Gastric and all other Gastric-related bigrams (Gastric Ulcer, Gastric Bypass, etc.)
There are variations to the them such as WordRank which tries to improve on Word2Vec by focusing on “robust Ranking” of the words. This is primarily a Word2Vec optimization technique because it implies a certain meaning in the ordering of the word, focusing on context words more related to a certain word by placing them on the top of the list. i.e. ranking the correlations.
It is believed that both Google and Facebook use these techniques in their automated systems assigned with policing their user content rules.
11. What about Virtual Agents
Virtual agents like Alexa, Siri, Google Talk, Cortana and others are getting smarter by the day. Still, let’s keep in mind that at the core of a Virtual Agent’s power are the backend data and search algorithms and not just their voice recognition capabilities and mellifluous voices.
Virtual Agents are benefiting from the power provided by Big Data capabilities and by using a combination of NLP techniques using high end cloud-based computer processing to do smart parsing and on-the-fly context disambiguation techniques.
From a systems infrastructure perspective, the smartphone virtual agents leverage access to 4G high speed networks, cloud computing services, and large amounts of data storage. The effects of the latter can truly be a game changer as additional sources of data become available online. This is possible because, as processing costs continue to plummet, the focus has shifted to investment in mass storage. For example, Storage Area Networks have gone from 46% of hardware investment in 2000 to 75% in 2005[8]. Data is the engine that’s pushing forward all other technologies; including the explosion in the use of Virtual Agents.
As impressive as the results are, these Virtual Agents are not truly capable of “understanding” even though their responses may appear ‘intelligent’ in certain instances. These systems work primarily by latching onto dominant words from the parsed query (typically nouns) and using contextual heuristics to fetch search engine responses with a higher correctness probability. This ability to extract key information derived from the question and then pulling implicit information sources is best exemplified by this query: “What is today’s weather?” Here the system parses the sentence to extract the “weather today” search term and, along with the clever use of GPS information of the user’s location, the system can further customize the response. The search engine in the backend takes care of the rest via the use of highly proprietary search engine algorithms.
In terms of the Bloom Taxonomy, most of these systems are in Level 2, and just beginning to be capable of handling Level 3 compound questions.
12. Making it all Sound Human
Years ago, a small village in Mexico was celebrating the activation of its first automated phone exchange. As the mayor gave a glowing discourse on how the town was finally “entering modernity,” and how people would now be able to automatically place calls simply by dialing the numbers, an elderly woman sitting next to a friend of mine complained, “Automatic? This ain’t automatic! Automatic was when I lifted the receiver and asked Maria, the switchboard lady, to connect me to my daughter!”
She had a point. From a user’s perspective, we want to be able to articulate a need in a simple way, and then have the need satisfied by the appropriate service. Replicating Maria’s level of service via automation ultimately required the emergence of software that could truly recognize natural-language and take intelligent action.
First generation Interactive Voice Response Systems (IVR) were used a little too frequently by companies that view customer queries as a nuisance rather than as a service revenue opportunity. Things, however, have improved somewhat. Voice recognition systems today can recognize upwards of 98% of speech under ideal acoustic conditions and narrow domains. While earlier voice recognition was based on simple acoustic pattern matchings which worked only with a limited set of options (“Say Yes or No”), today’s systems are capable of performing feature analysis of spoken language phonemes (English has forty-six of those); giving better results over broader and broader domains.
Statistical analysis systems that use Hidden Markov Models, and other syntactical analysis techniques, have improved the field significantly. These techniques are machine learning systems with an ability to infer words based on grammar precepts. For example, if the algorithm hears “This is a ???? house” with the voice recognition system unsure whether ???? is the word “great” or “grit” (this is a Scott speaking), it will apply an analysis of the sentence Part-of-Speech (POS) to determine the likelihood of ??? being an adjective rather than a noun. It will then appropriately select the word great.
Computers can now simulate Maria, the switchboard lady, when it comes to simply responding to a phone connection request. Still, things would be more complicated if we had to instruct the computer to also have CyberMariaTM offer to do some town gossiping.
13. The Road Ahead
As often occurs with technology, it is not a single invention, but rather the convergence of various independent developments that leads to transformational technologies. Take the emergence of social media. With the explosive online participation of over a billion people across the globe it is estimated that 2.5 Exabytes of data are generated every day world-wide[9] (i.e. 500 billion U.S. photocopies, 610 billion e-mails, 7.5 quadrillion minutes of phone conversations, etc.). You would need to purchase two and a half billion 1GB thumb drives at Staples to store that. In fact, as figured by those who have taken the time to do these calculations, all the words ever spoken by mankind amount to ‘only’ 5 Exabytes[10].
In two days, we generate as much data as all words ever spoken by the human race. Indeed, thanks to the advent of high-speed computing, social media, and global connectivity, there are now an estimated one trillion web pages on the Web, of which only 50 billion have been indexed by search engines such as Google and Bing[11]. The area unindexed by the search engines is usually referred to as the Deep Web.
Something’s got to give with so much data, but the problem is that most of that data is in plain text and not indexed or structured in any way. As we have discussed, on the Natural Language Processing front, we have seen there are plenty of resources and techniques that could help exploit that unexplored wealth of information.
This combination of technologies is also enabling the emergence of smarter systems. At their core, these Virtual Agents take the form of Query Answering systems (QAS), capable of parsing human-language requests, and applying some form of machine language algorithm. They utilize available back-end information ‘on-the-fly’ by applying heuristics for inference reasoning and by maintaining a level of contextual interaction needed to formulate the responses. The results appear magical. To many, these systems are beginning to look as though we are close to fulfilling the original dream of computer scientists to develop software that appears to intelligently take over tasks normally performed by human beings.
Even if the road to Natural Language Understanding may turn out to be a Sisyphean task, there is no doubt that gradually NLP would “appear” to understand more and more of the world around us.
And that just might be as good as it gets for now.
NOTE
A great deal of the material in my NLU/NLP articles first appeared in my earlier 70-page white paper entitled “Cognitive Automation Primer” available for Kindle here: https://www.amazon.com/Cognitive-Automation-Primer-Machine-Learning-ebook/dp/B01L0PRSJO
FOOTNOTES
[1] Bloom's Taxonomy second domain, the Affective Domain, was detailed by Bloom, Krathwhol and Masia in 1964 (Taxonomy of Educational Objectives: Volume II, The Affective Domain). Bloom's theory advocates this structure and sequence for developing attitude – also now commonly expressed in personal development as 'beliefs'. https://cft.vanderbilt.edu/guides-sub-pages/blooms-taxonomy/
[2] https://assessment.uconn.edu/primer/taxonomies1.html
[3] https://artsites.ucsc.edu/faculty/cope/mp3page.htm or check his Bach-like chorale in Youtube: https://www.youtube.com/watch?feature=player_detailpage&v=PczDLl92vlc
[4] Though that John Snow could benefit from nearest neighbor to avoid the undead too!
[5] To parse it, think of this: (Fish [that] (fish [that] fish [habitually] eat) [habitually] eat) [habitually] eat [food]. This means that the sentence Fish fish fish fish eat eat eat eat is also valid, and so on. Another way to parse it as as follows: Fish that habitually eat fish that habitually eat fish that habitually eat fish that habitually eat . . . and so on.
[6] The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos.
[7] Professor Domingos has been working on a master algorithm that combines the best features form the various discussed ML algorithms. You can download his learner “Alchemy” from alchemy.cs.washington.edu
[8] Enterprise Architecture and New Generation Information Systems—Dimitris N. Chorafas
[9] An Exabyte being equivalent to 1000 petabytes. A petabye being equivalent to 1000 terabytes. A terabyte being equivalent to 1000 Gigabytes, which is about what you can get with two external disk drives for less than $200
[10] “How much information?”—Hal Varian and Peter Lyman. https://www2.sims.berkeley.edu/research/projects/how-much-info/print.html
[11] The Intelligent Web – Gautam Shroff