kliux
Juan José Eguizabal
Socio fundador, director de I+D+I en KLiUX Experto en dise?o de aerogeneradores
Every decade seems to have its technological buzzwords: we had personal computers in 1980s; Internet and worldwide web in 1990s; smart phones and social media in 2000s; and Artificial Intelligence (AI) and Machine Learning in this decade. However, the field of AI is 67 years old and this is the second of a series of five articles wherein: 1. This article discusses the genesis of AI and the first hype cycle during 1950 and 1982 2. The second article discusses a resurgence of AI and its achievements during 1983-2010 3. The third article discusses the domains in which AI systems are already rivaling humans 4. The fourth article discusses the current hype cycle in Artificial Intelligence 5. The fifth article discusses as to what 2018-2035 may portend for brains, minds and machines
https://onenote.com/webapp/pages?token=H_snkVBDxNkn5FbZiBU0ZU0em6rD2ujDvt72OF7-cRhaLKV9o-PhCHKJpXnG0HLDQS7tDdbqDhJ6Kx4SQwexT3R9Gq6ICSqG0&id=636552655088498047
Resurgence of Artificial Intelligence During 1983-2010
Dr. Alok Aggarwal CEO and Chief Data Scientist Scry Analytics, California, USA Office: +1 408 872 1078; Mobile: +1 914 980 4717 [email protected] January 20, 2018
Prologue
Every decade seems to have its technological buzzwords: we had personal computers in 1980s; Internet and worldwide web in 1990s; smart phones and social media in 2000s; and Artificial Intelligence (AI) and Machine Learning in this decade. However, the field of AI is 67 years old and this is the second of a series of five articles wherein: 1. This article discusses the genesis of AI and the first hype cycle during 1950 and 1982 2. The second article discusses a resurgence of AI and its achievements during 1983-2010 3. The third article discusses the domains in which AI systems are already rivaling humans 4. The fourth article discusses the current hype cycle in Artificial Intelligence 5. The fifth article discusses as to what 2018-2035 may portend for brains, minds and machines
In memory of Alan Turing, Marvin Minsky and John McCarthy
1983 2010 2018 2035
Genesis of AI & The First Hype Cycle
Resurgence of AI During 1983 -2010
Current AI During 20112017 & The Hype Cycle
Future of Brains, Minds & Machines 2018 -2035
The Timeline
1950
www.scryanalytics.com
Resurgence of Artificial Intelligence
The 1950-82 era saw a new field of Artificial Intelligence (AI) being born, lot of pioneering research being done, massive hype being created, and AI going into hibernation when this hype did not materialize, and the research funding dried up [56]. During 1983 and 2010, research funding ebbed and flowed, and research in AI continued to gather steam although "some computer scientists and software engineers would avoid the term artificial intelligence for fear of being viewed as wild-eyed dreamers" [43].
During 1980s and 90s, researchers realized that many AI solutions could be improved by using techniques from mathematics and economics such as game theory, stochastic modeling, classical numerical methods, operations research and optimization. Better mathematical descriptions were developed for deep neural networks as well as evolutionary and genetic algorithms, which matured during this period. All of this led to new sub-domains and commercial products in AI being created.
In this article, we first briefly discuss supervised learning, unsupervised learning and reinforcement learning, as well as shallow and deep neural networks, which became quite popular during this period. Next, we will discuss the following six reasons that helped AI research and development in gaining steam – hardware and network connectivity became cheaper and faster; parallel and distributed became practical, and lots of data ("Big Data") became available for training AI systems. Finally, we will discuss a few AI applications that were commercialized during this era.
Machine Learning Techniques Improve Substantially
Supervised Machine Learning
These techniques require to be trained by humans by using labeled data [58]. Suppose we are given several thousand pictures of faces of dogs and cats and we would like to partition them into two groups – one containing dogs and the other cats. Rather than doing it manually, a machine learning expert writes a computer program by including the attributes that differentiate dog-faces from cat-faces (e.g., length of whiskers, droopy ears, angular faces, round eyes). After enough attributes have been included and the program checked for accuracy, the first picture is given to this "black box” program. If its output is not the same as that provided by a "human trainer" (who may be training in person or has provided a pre-labeled picture), this program modifies some of its internal code to ensure that its answer becomes the same as that of the trainer (or the pre-labeled picture). After going through several thousand such pictures and modifying itself accordingly, this black box learns to differentiate the faces of dogs from cats. By 2010, researchers had developed many algorithms that could be used inside the black box, most of which are mentioned in the Appendix, and today, some applications that commonly use these techniques include object recognition, speaker recognition and speech to text conversion.
Figure 1: Process Flow for Supervised Learning Techniques
Unsupervised learning algorithms
These techniques do not require any pre-labeleddata and they try to determine hidden structure from "unlabeled" data [59]. One important use case of unsupervised learning is computing the hidden probability distribution with respect to the key attributes and explaining them, e.g., understanding the data by using its attributes and then clustering and partitioning it in "similar" groups. There are several techniques in unsupervised learning most of which are mentioned in the Appendix. Since the data points given to these algorithms are unlabeled, their accuracy is usually hard to define. Applications that use unsupervised learning include recommender systems (e.g., if a person bought x then will the person by y), creating cohorts of groups for marketing purposes (e.g., clustering by gender, spending habits, education, zip code), and creating cohorts of patients for improving disease management. Since k-means is one of the most common technique, it is briefly described below:
Suppose we are given a lot of data points each having n attributes (which can be labelled as n coordinates) and we want to partition them into k groups. Since each group has n coordinates, we can imagine these data points as being in an n-dimensional space. To begin with, the algorithm partitions these data points arbitrarily into k groups. Now, for each group the algorithm computes its centroid, which is an imaginary point with each of its coordinates being the average of the same coordinates of all the points in that group, i.e., this imaginary point's first coordinate is the average of all first coordinates of the points in this group, second coordinate is the average of all second coordinates, and so on. Next, for each data point, it finds the centroid that is the closest to that point and achieves a new partition of these data points into k new groups. This algorithm again finds the centroids of these groups and repeats these steps until it either converges or has gone through a specified number of iterations. An example in a two-dimensional space with k=2 is shown in the picture below:
Another technique, hierarchical clustering creates hierarchical groups, which at the top level would have 'super groups' each containing sub-groups, which may contain sub-sub groups and so on. K-means clustering is often used for creating hierarchical groups as well.
Figure 2: Typical Output of a 2-means algorithm for partitioning red and blue points in a plane
Reinforcement Learning
Reinforcement Learning (RL) algorithms learn from the consequences of their actions, rather than from being taught by humans or by using pre-labeleddata [60]; it is analogous to Pavlov’s conditioning, when Pavlov noticed that his dogs would begin to salivate whenever he entered the room, even when he was not bringing them food [61]. The rules that such algorithms should obey are given upfront and they select their actions on basis of their past experiences and by considering new choices. Hence, they learn by trial and error in a simulated environment. At the end of each “learning session,” the RL algorithm provides itself a “score” that characterizes its level of success or failure, and over time, the algorithm tries to perform those actions that maximize this score. Although IBM’s Deep Blue, which won the chess match against Kasporov, did not use Reinforcement Learning, as an example, we describe a potential RL algorithm for playing chess:
As input, the RL algorithm is given the rules of playing chess, e.g., 8*8 board, initial location of pieces, what each chess piece can do in one step, a score of zero if the player’s king has a check-mate, a score of one if the opponent's king has a check-mate, and 0.5 if only two kings are left on the board. In this embodiment, the RL algorithm creates two identical solutions, A and B, which start playing chess against each other. After each game is over, the RL algorithm assigns the appropriate scores to A and B but also keeps complete history of the moves and countermoves made by A and B that can be used to train A and B (individually) for playing better. After playing several thousand such games in the first round, the RL algorithm uses the “self-generated” labelled data with outcomes of 0, 0.5, and 1 for each game and of all the moves played in that game and by using learning techniques, determines the patterns of moves that led A (and similarly B) to getting a poor score. Hence for the next round, it refines these solutions for A and for B and optimizes the play of such "poor moves," thereby, improving them for the second round, and then for the third round, and so on, until the improvements from one round to another become miniscule, in which case A and B end up being reasonably well-trained solutions.
In 1951, Minsky and Edmonds built the first neural network machine, SNARC (Stochastic Neural Analogy Reinforcement Computer); it successfully modeled the behavior of a rat in a maze searching for food, and as it made its way through the maze, the strength of some synaptic connections would increase, thereby reinforcing the underlying behavior, which seemed to mimic the functioning of living neurons [5]. In general, Reinforcement Learning algorithms perform well while solving optimization problems, in game theoretic situations (e.g., in playing Backgammon [62] or GO [94]) and in problems where the business rules are well defined (e.g., autonomous car driving) since they can self-learn by playing against humans or against each other.
Figure 3: SNARC was the First Reinforcement Network Machine built by Minsky and Edmonds in 1951
Mixed learning
Mixed learning techniques use a combination of one or more of supervised, unsupervised and reinforcement learning techniques. Semi-supervised learning is particularly useful in cases where it is expensive or time consuming to label a large dataset. “, e.g., while differentiating dog-faces from cat-faces, if the database contains some images that are labeled but most of them are not. Some of their broad uses include classification, pattern recognition, anomaly detection, and clustering/grouping.
Resurgence of Neural Networks – Both Shallow and Deep
As discussed in the previous article [56], a one-layer perceptron network consists of an input layer, connected to one hidden layer of perceptrons, which is in turn connected to an output layer of perceptrons[17]. A signal coming via a connection is recalibrated by the “weight” of that connection, and this weight is assigned to a connection during the "learning process". Like a human neuron, a perceptron "fires" if all the incoming signals together exceed a specified potential but unlike humans, in most such networks, signals only move from one layer to that in front of it. The term, Artificial Neural Networks (ANNs) was coined by Igor Aizenberg and colleagues in 2000 for Boolean threshold neurons but is used for perceptronsand other "neurons" of the same ilk [63]. Examples of one hidden layer and eight-hidden layer networks are given below:
Figure 4: Potential Uses of Machine Learning Algorithms
Figure 5: One hidden layer network (Left) and Eight hidden layer network (Right) (Source: Google)
In 1979, Fukushima provided the first “convolutional neural network” (CNN) when he developed Neocognitronin which he used a hierarchical, multilayereddesign [65]. CNNs are widely used for image processing, speech to text conversion, document processing and Bioactivity Prediction in Structure based Drug Discovery [97].
In 1983, Hopfield popularized Recurrent Neural Networks (RNNs), which were originally introduced by Little in 1974 [51,52,55]. RNNs are analogous to Rosenblatt’s perceptron networks but are not feedforward because they allow connections to go towards both the input and output layers; this allows RNNs to exhibit temporal behavior. Unlike feedforward neural networks, RNNs use their internal memory to process arbitrary sequences of incoming data. RNNs have since been used for speech to text conversion, natural language processing and for early detection of heart failure onset [98].
In 1997, Hochreiterand Schmidhuberdeveloped a specific kind of deep learning recurrent neural network, called LSTM (long short-term memory) [66]. LSTMs mitigate some problems that occur while training RNNs and they are well suited for predictions related to time-series. Applications of such networks include those in robotics, time series prediction, speech recognition, grammar learning, handwriting recognition, protein homology detection, and prediction in medical care pathways [99].
In 2006, Hinton, Osinderoand Teh invented Deep Belief Networks and showed that in many situations, multi-layer feedforward neural networks could be pre-trained one layer at a time by treating each layer as an unsupervised machine and then fine-tuning it using supervised backpropagation [67]. Applications of such networks include those in image recognition, handwriting recognition, and identifying of onset of diseases such as liver cancer and schizophrenia [100, 109].
Although multi-layer perceptronswere invented in 1965 and an algorithm for training an 8-layer network was provided in 1971 [18, 19, 20], the term, Deep Learning, was introduced by Rina Dechter in 1986 [64]. For our purposes, a deep learning network has more than one hidden layer.
Given below are important deep learning networks that were developed during 1975 and 2006 and are frequently used today; their description is out of scope of this article:
Parallel and Distributed Computing Improve AI Capabilities
During 1983 and 2010, hardware became much cheaper and more than 500,000 times faster; however, for many problems, one computer was still not enough to execute many machine learning algorithms in a reasonable amount of time. At a theoretical level, computer science research during 1950-2000 had shown that such problems could be solved much faster by using many computers simultaneously and in a distributed manner.
Although multi-layer perceptrons were invented in 1965 and an algorithm for training an 8-layer network was provided in 1971, the term, Deep Learning, was introduced by Rina Dechter in 1986
However, the following fundamental problems related to distributed computing remained resolved until 2003: (a) how to parallelize computation, (b) how to distribute data “equitably” among computers and do automatic load balancing, and (b) how to handle computer failures and interrupt them if they go into infinite loops. In 2003, Google published Google File Systems paper and then followed it up by publishing MapReduce in 2004, which was a framework and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [68]. Since MapReduce was proprietary to Google, in 2006, Cutting and Carafella (from University of Washington but working at Yahoo) created an open source and free version of this framework called Hadoop [69]. Also, in 2012, Spark and its resilient distributed datasets were invented, which reduced the latency of many applications when compared to MapReduce and Hadoop implementations [70]. Today a Hadoop-Spark based infrastructure can handle 100,000 or more computers and several hundred million Gigabytes of storage.
Big Data begins to help AI systems
In 1998, John Mashey(at Silicon Graphics) seemingly first coined the term, “Big Data,” that referred to large volume, variety and velocity at which data is being generated and communicated [71]. Since most learning techniques require lots of data (especially labelled data), the data stored in organizations’ repositories and on the World Wide Web, became vital for AI. By early 2000, social media websites such as Facebook, Twitter, Pinterest, Yelp, and Youtubeas well as weblogs and a plethora of electronic devices started generating Big Data, which set the stage for creating several “open databases” with labeled and unlabeled data (for researchers to experiment with) [72,73]. By 2010, humans had already created almost a quadrillion Gigabytes (i.e., one zetta bytes) of data, most of which was either structured (e.g., spreadsheets, relational databases) or unstructured (e.g., text, images, audio and video files) [74].
Progress in Sub-fields of AI and Commercial Applications
Reinforcement Learning Algorithms play Backgammon
In 1992, IBM’s Gerald Tesauro built TD-Gammon, which was a reinforcement learning program to play backgammon; its level was slightly below that of the top human backgammon players at that time [62].
Figure 6: Important Sub-fields of Artificial Intelligence in 2010
Machines beat humans in Chess
Alan Turing was the first to design a computer chess program in 1953 although he "ran the program by flipping through the pages of the algorithm and carrying out its instructions on a chessboard” [75]. In 1989, chess playing programs, HiTechand Deep Thought developed at Carnegie Mellon University, defeated a few chess masters [76]. In 1997, IBM’s Deep Blue became the first computer chess-playing system to beat world’s champion, Garry Kasparov. Deep Blue’s success was essentially due to considerably better engineering and processing 200 million moves per second [77].
Robotics
In 1994, Adler and his colleagues at Stanford University invented, a stereotactic radiosurgery-performing robot, Cyberknife, which could surgically remove tumors; it is almost as accurate as human doctors, and during the last 20 years, it has treated over 100,000 patients [78]. In 1997, NASA built Sojourner, a small robot that could perform semi-autonomous operations on the surface of Mars [79].
Better Chat-bots
In 1995, Wallce create A.L.I.C.E., which was based on pattern matching but had no reasoning capabilities [80]. Thereafter, Jabberwacky (renamed Cleverbotin 2008) was created, which had web-searching and gameplaying abilities [81] but was still limited in nature. Both chatbots used improved NLP algorithms for communicating with humans.
Improved Natural Language Processing (NLP)
Until the 1980s, most NLP systems were based on complex sets of hand-written rules. In the late 1980s, researchers started using machine learning algorithms for language processing. This was due to the faster and cheaper hardware as well as the reduced dominance of Chomsky-based theories of linguistics. Instead researchers created statistical models that made probabilistic decisions based on assigning weights to appropriate input features, and they also started using supervised and semi-supervised learning techniques and partially labeled data [82,83].
Speech and Speaker Recognition
During late 1990s, SRI researchers used deep neural networks for speaker recognition and they achieved significant success [84]. In 2009, Hinton and Deng collaborated with several colleagues from University of Toronto, Microsoft, Google and IBM, and showed substantial progress in speech recognition using LSTM-based deep networks [85,86].
Recommender Systems
By 2010, several companies (e.g., TiVo, Netflix, Facebook, Pandora) built recommendation engines using AI and started using them for marketing and sales purposes, thereby, improving their revenue and profit margins [87].
Recognizing hand-written digits
In 1989, LeCunand colleagues provided the first practical demonstration of backpropagation; they combined convolutional neural networks (CNNs) with back propagation in order to read “handwritten” digits. This system was eventually used to read the numbers in handwritten checks; in 1998, and by the early 2000s, such networks processed an estimated 10% to 20% of all the checks written in the United States [88].
Conclusion
The year 2000 had come and gone but Alan Turing’s prediction of humans creating an AI computer remained unfulfilled [3,4] and Loebnerprize was initiated in 1990 with the aim of developing such a computer [89]. Nevertheless, substantial progress was made in AI, especially with respect to deep neural networks, which were invented in 1965 with the first algorithm for training them given in 1971 [18,19,20]; during 1983 and 2010, exemplary research done by Hinton, Schmidhuber, Bengio, LeCun, Hochreiter, and others ensured rapid progress in deep learning techniques [90,91,92,93] and some of these networks began to be used in commercial applications. Because of these techniques and the availability of inexpensive hardware and data, which made them practical, the pace of research and development picked up substantially during 2005 and 2010, which in turn, led to a substantial growth in AI solutions that started rivaling humans during 2011 and 2017; we will discuss such solutions in the next article, "Domains in Which AI Systems are RivalingHumans" [151].
During 1983 and 2010, exemplary research done by Hinton, Schmidhuber, Bengio, LeCun, Hochreiter, and others ensured rapid progress in deep learning and some networks were also used in commercial applications
References for all articles in this series can be found at www.scryanalytics.com/bibliography
About the Author
Dr. Alok Aggarwal is the founder and CEO of Scry Analytics (www.scryanalytics.com); prior to this, he was the co-founder and Chairman of Evalueserve(www.evalueserve.com). He received his PhD in Electrical Engineering and Computer Science from Johns Hopkins University in 1984 and worked at IBM Watson Research Center during 1984 and 2000; during 1989-90, he taught at MIT and advised two PhD students and during 1998-2000, he founded IBM India Research Lab. and grew it to 60 researchers.
Contact: Office: +1 408 872 1078; Mobile: +1 914 980 4717
Email: [email protected]
Frequently Used Approaches and Techniques for Supervised & Unsupervised Learning
Supervised Machine Learning Techniques Unsupervised Machine Learning Techniques
Minimum Message Length (decision graphs) Clustering
Multilinear subspace learning k-means
Naive Bayes classifier Hierarchical clustering
Maximum entropy classifier Anomaly detection techniques
Conditional random fields Density-based techniques, e.g., k-nearest neighbor
Backpropagation Density-based techniques, e.g., local outlier factors
Boosting Sub-space based correlation and outlier detection
Bayesian statistics Correlation-based outlier detection
Gaussian process regression Special kind of support vector machines
Support vector machines Replicator neural networks
Minimum Complexity Machines Cluster analysis-based outlier detection
Random Forests Deviations from association rules
Ensembles of Classifiers Fuzzy logic based outlier detection
Ordinal classification Ensemble techniques using score normalization
Nearest Neighbor Algorithm & Approximations Ensemble techniques using feature bagging
Neural Networks (shallow and deep) Neural Networks (shallow and deep)
Probably Approximately Correct (PAC) learning Mixture models
Symbolic machine learning algorithms Hebbian Learning
Genetic Algorithms Generative Adversarial Networks
Handling imbalanced datasets Learning latent variable models
Statistical relational learning Expectation–maximization algorithm
Group method of data handling Blind signal separation techniques
Kernel estimators Principal components analysis
Learning Automata Singular value decomposition
Learning Classifier Systems Independent component analysis
Analytical learning Dependent component analysis
Artificial neural network Non-negative matrix factorization
Case-based reasoning Low-complexity coding and decoding
Decision tree learning Stationary subspace analysis
Inductive logic programming Common spatial pattern recognition