What is Machine Learning? How it changes the world towards the future. How Apple Siri works.
Bobby Singh
3x RedHat ★ 3x Microsoft ★ DevOps & Cloud Engineer ★ Solution Architecture ★ Aviatrix Certified Engineer ★ Multi hybrid Cloud ★ Jenkins ★ Kubernetes
Who should read this article?
Anyone curious who wants a straightforward and accurate overview of what machine learning is, about how it works, and its importance. We go through each of the pertinent questions raised above by slicing technical definitions from machine learning pioneers and industry leaders to present you with a basic, simplistic introduction to the fantastic, scientific field of machine learning.
A glossary of terms can be found at the bottom of the article, along with a small set of resources for further learning, references, and disclosures.
What is Artificial Intelligence?
Artificial intelligence (AI) is a wide-ranging branch of computer science concerned with building smart machines capable of performing tasks that typically require human intelligence. AI is an interdisciplinary science with multiple approaches, but advancements in machine learning and deep learning are creating a paradigm shift in virtually every sector of the tech industry.
HOW DOES ARTIFICIAL INTELLIGENCE WORK?
Can machines think? — Alan Turing, 1950
Less than a decade after breaking the Nazi encryption machine Enigma and helping the Allied Forces win World War II, mathematician Alan Turing changed history a second time with a simple question: "Can machines think?"
Turing's paper "Computing Machinery and Intelligence" (1950), and it's subsequent Turing Test, established the fundamental goal and vision of artificial intelligence.
At its core, AI is the branch of computer science that aims to answer Turing's question in the affirmative. It is an endeavor to replicate or simulate human intelligence in machines.
The expansive goal of artificial intelligence has given rise to many questions and debates. So much so, that no singular definition of the field is universally accepted.
The major limitation in defining AI as simply "building machines that are intelligent" is that it doesn't actually explain what artificial intelligence is? What makes a machine intelligent?
In their groundbreaking textbook Artificial Intelligence: A Modern Approach, authors Stuart Russell and Peter Norvig approach the question by unifying their work around the theme of intelligent agents in machines. With this in mind, AI is "the study of agents that receive percepts from the environment and perform actions." (Russel and Norvig viii)
Norvig and Russell go on to explore four different approaches that have historically defined the field of AI:
- Thinking humanly
- Thinking rationally
- Acting humanly
- Acting rationally
The first two ideas concern thought processes and reasoning, while others deal with the behavior. Norvig and Russell focus particularly on rational agents that act to achieve the best outcome, noting "all the skills needed for the Turing Test also allow an agent to act rationally." (Russel and Norvig 4).
Patrick Winston, the Ford professor of artificial intelligence and computer science at MIT, defines AI as "algorithms enabled by constraints, exposed by representations that support models targeted at loops that tie thinking, perception and action together."
"Artificial intelligence is a set of algorithms and intelligence to try to mimic human intelligence. Machine learning is one of them, and deep learning is one of those machine learning techniques."
While these definitions may seem abstract to the average person, they help focus the field as an area of computer science and provide a blueprint for infusing machines and programs with machine learning and other subsets of artificial intelligence.
While addressing a crowd at the Japan AI Experience in 2017, DataRobot CEO Jeremy Achin began his speech by offering the following definition of how AI is used today:
"AI is a computer system able to perform tasks that ordinarily require human intelligence... Many of these artificial intelligence systems are powered by machine learning, some of them are powered by deep learning and some of them are powered by very boring things like rules."
A few examples of Narrow AI include:
- Google search
- Image recognition software
- Siri, Alexa, and other personal assistants
- Self-driving cars
- IBM's Watson
What is machine learning?
The scientific field of machine learning (ML) is a branch of artificial intelligence, as defined by Computer Scientist and machine learning pioneer Tom M. Mitchell: “Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience .”
An algorithm can be thought of as a set of rules/instructions that a computer programmer specifies, which a computer can process. Simply put, machine learning algorithms learn by experience, similar to how humans do. For example, after having seen multiple examples of an object, a compute-employing machine learning algorithm can become able to recognize that object in new, previously unseen scenarios.
How does machine learning work?
In the video above, Head of Facebook AI Research, Yann LeCun, simply explains how machine learning works with easy to follow examples. Machine learning utilizes various techniques to intelligently handle large and complex amounts of information to make decisions and/or predictions.
How does Apple use machine learning today?
Apple has made a habit of crediting machine learning with improving some features in the iPhone, Apple Watch, or iPad in its recent marketing presentations, but it rarely goes into much detail—and most people who buy an iPhone never watched those presentations, anyway. Contrast this with Google, for example, which places AI at the center of much of its messaging to consumers
There are numerous examples of machine learning being used in Apple's software and devices, most of them new in just the past couple of years.
Machine learning is used to help the iPad's software distinguish between a user accidentally pressing their palm against the screen while drawing with the Apple Pencil, and an intentional press meant to provide input. It's used to monitor users' usage habits to optimize device battery life and charging, both to improve the time users can spend between charges and protect the battery's long-term viability. It's used to make app recommendations.
Then there's Siri, which is perhaps the one thing any iPhone user would immediately perceive as artificial intelligence. Machine learning drives several aspects of Siri, from speech recognition to attempts by Siri to offer useful answers. Savvy iPhone owners might also notice that machine learning is behind the Photos app's ability to automatically sort pictures into pre-made galleries, or to accurately give you photos of a friend named Jane when her name is entered into the app's search field.
In other cases, few users may realize that machine learning is at work. For example, your iPhone may take multiple pictures in rapid succession each time you tap the shutter button. An ML-trained algorithm then analyzes each image and can composite what it deems the best parts of each image into one result.
Siri is a personal assistant that communicates using speech synthesis. Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through. This article presents more details about the Machine learning based technology behind Siri’s voice.
How Does Speech Synthesis Work?
Building a high-quality text-to-speech (TTS) system for a personal assistant is not an easy task. The first phase is to find a professional voice talent whose voice is both pleasant and intelligible and fits the personality of Siri. To cover some of the vast variety of human speech, we first need to record 10—20 hours of speech in a professional studio. The recording scripts vary from audiobooks to navigation instructions, and from prompted answers to witty jokes. Typically, this natural speech cannot be used as it is recorded because it is impossible to record all possible utterances the assistant may speak. Thus, unit selection TTS is based on slicing the recorded speech into its elementary components, such as half-phones, and then recombining them according to the input text to create entirely new speech. In practice, selecting appropriate phone segments and joining them together is not easy, because the acoustic characteristics of each phone depend on its neighboring phones and the prosody of speech, which often makes the speech units incompatible with each other. Illustrates how speech can be synthesized using a speech database segmented into half-phones.
Illustration of unit selection speech synthesis using half-phones. The synthesized utterance "Unit selection synthesis” and its phonetic transcription using half-phones are shown at the top of the figure. The corresponding synthetic waveform and its spectrogram are shown below. The speech segments delimited by the lines are continuous speech segments from the database that may contain one or more half-phones.
In contrast to the front-end, the backend is mostly language-independent. It consists of unit selection and waveform concatenation parts. When the system is trained, the recorded speech data is segmented into individual speech segments using forced alignment between the recorded speech and the recording script (using speech recognition acoustic models). The segmented speech is then used to create a unit database. The unit database is further augmented with important information, such as the linguistic context and acoustic features of each unit. We refer to this data as the unit index. Using the constructed unit database and the predicted prosodic features that guide the selection process, a Viterbi search is performed in the speech unit space to find the best path of units for synthesis.
Illustration of Viterbi search for finding the best path of units in the lattice. The target half-phones for synthesis are shown at the top of the figure, below which each box corresponds to an individual unit. The best path, as found by the Viterbi search, is shown as a line connecting the selected units.
The selection is based on two criteria: 1) the units must obey the target prosody and 2) the units should, wherever possible, be concatenated without audible glitches at the unit boundary. These two criteria are called the target and concatenation costs, respectively. The target cost is the difference between the predicted target acoustic features and the acoustic features extracted from each unit (stored in unit index), whereas the concatenation cost is the acoustic difference between consequent units (see Figure 4). The overall cost is calculated as follows:
where un is the nth unit, N is the number of units, wt and we are the target and concatenation cost weights. After determining the optimal sequence of units, the individual unit waveforms are concatenated to create continuous synthetic speech.
Illustration of the unit selection method based on target and concatenation costs.
Results
We built a deep MDN-based hybrid unit selection TTS system for the new Siri voices. The training speech data contains a minimum of 15 hours of high-quality speech recordings sampled at 48 kHz. We segmented the speech into half-phones using forced alignment, i.e., automatic speech recognition to align the input phone sequence with acoustic features extracted from the speech signal. This segmentation process results in around 1–2 million half-phone units, depending on the amount of recorded speech.
To guide the unit selection process, we trained the unified target and concatenation model using a deep MDN architecture. The input to the deep MDN consists of mostly binary values with some additional continuously-valued features. The features represent information about the quinone context (2 preceding, current, and 2 succeeding phones); syllable, word, phrase, and sentence level information; and additional prominence and stress features. The output vector consists of the following acoustic features: Mel-frequency cepstral coefficients (MFCCs), delta-MFCCs, fundamental frequency (f0), and delta-f0 (including values both at the beginning and the end of each unit), and the duration of the unit. Since we are using an MDN as an acoustic model, the output also contains the variances of each feature that act as automatic context-dependent weights.
Also, because the fundamental frequency of speech regions is highly dependent on the utterance as a whole, and to create natural and lively prosody in synthesized speech, we employed a recurrent deep MDN to model the f0 features.
The architecture of the trained deep MDN consists of 3 hidden layers with 512 rectified linear units (ReLU) as nonlinearities in each layer. Both input and output features are mean and variance normalized before training. The final unit selection voice consists of the unit database including feature and audio data for each unit, and the trained deep MDN model.
The quality of the new TTS system is superior to the previous Siri system. In an AB pairwise subjective listening tests, listeners clearly preferred the new deep MDN-based voices over the old voices. The results are shown in Figure 6. This better quality can be attributed to multiple improvements in the TTS system, such as deep MDN-based back-end resulting in better selection and concatenation of the units, higher sampling rate (22 kHz vs 48 kHz), and better audio compression.
Results of the AB pairwise subjective listening tests. The new voices were rated clearly better in comparison to the old ones.
Since the TTS system needs to run on mobile devices, we optimized its runtime performance for speed, memory usage, and footprint by using fast preselection, unit pruning, and parallelization of the calculations.
Software Engineer @ Volkswagen Group Technology Solutions India
4 年Amazing..
DevOps Engineer @Amdocs
4 年Nice work ?
DevOps @Forescout ?? | Google Champion Innovator | AWS | DevOps | 3X GCP | 1X Azure | 1X Terraform | Ansible | Kubernetes | SRE | Jenkins | Tech Blogger ??
4 年Informative article Bobby Singh ?