Invention of Voice Recognition; This Centuries Phenomenon
Experimentation with Voice Recognition Systems has been a focus since the 1950s and with Defense Advanced Research Projects Agency’s (DARPA) new solicitation offer has included more than just identification and classification. Research based on its Grounded Artificial Intelligence Language Acquisition (GAILA) project has a benchmark of human voice identification, interpretation and understanding.
DARPA’s RATS Program
Recent documents from DARPA show that they are in the third phase of their already initiated Robust Automatic Transcription of Speech (RATS) program. Until recently, the accuracy of speech recognition and differentiating signal processing was in question and not effective enough for military applications. The RATS program with initial funding of $13 million and then an extra $2.4 million for Air Force deployment was designed and stipulated to produce algorithms that correct former development flaws by:
- Separating speech patterns to filter out music and background noise.
- Determining the language of the speech.
- Providing Identification for a single speaker or multiple speakers.
- Analyzing and identify specific word patterns.
The DARPA RATS program has included IBM, The Massachusetts Institute of Technology (MIT) and other universities Carnegie Mellon, John Hopkins, The University of Baltimore, University of Maryland, University of Pittsburgh and universities located in Czech Republic, Spain, England and Denmark. It has included research from Raytheon along with research from Stanford Research Institute (SRI).
All audio communication is now able to be tagged and recorded from any device that includes commands and communication like cell phones, cars, refrigerators and other smart devices incorporating commands in its operation thus producing what is being called the “universal translator” seen on Star Trek movies for over 50 years.
“In the Beginning”
Voice Recognition Software formerly from speech recognition research (SUR) program funded by DARPA in the 1970s was responsible for the Harpy System produced at Carnegie Mellon and was able to recognize the vocabulary of an “average three year old.” The development of Harpy led to search capabilities that "proved the finite-state network of possible sentences," which is the key to continued identification and accuracy. At this time Bell Labs expanded the research to interpret multiple people’s voices.
My first experience with voice recognition was at the Micro Computer Lab at Phillips Petroleum Company (PPCO); where I was first employed right out of college. About a year prior, the PPCO Distributed Data Processing (DDP) group received the first IBM PC in boxes at the reception desk. I passed by the boxes for several days noticing the address name as Distributed Data Processing (DDP) group. No one came to claim them so I dragged them back to my cubical. I had no idea that the first IBM PC was inside. Although it arrived in pieces which I put together like a jigsaw puzzle matching each piece with a logical idea of its location, it was the beginning. Processing speed was slow and a necessary component of voice recognition’s capabilities but within two years, a specialized microprocessor arrived at the PPCO lab using the first artificial intelligence analytics to identify language. After speaking into a microphone a recording of the words read from a script could be played back. It was pretty clunky in broken machine audio voice with many of the words missed but again, it was a beginning. The instructions were to read the same script at least 10 times for full mimicking accuracy. It was determined by numerous tests that enormous amounts of data and faster processing speeds had to be available or complex algorithmic identification probability models were left wanting.
By 2001, Google invented ‘Google Voice Search’ for iPhones which utilized data centers to compute the enormous amount of data analysis needed for better accuracy in matching user queries with actual human voice patterns. It wasn’t until 2010 that Google made a major breakthrough for the Android device; differentiating speech patterns along with 230 billion English words.
Apple sideswiped Google’s major money maker having the top search analytics; or so claimed Google’s Chairman, Eric Schmidt in his testimony to a US Judiciary Committee on anti-trust with its purchase of technology known as SRI, so named from the former Stanford Research Institute. SRI had been at the forefront of voice recognition research to date. Google claimed that SIRI renamed by Apple at the purchase was potentially a major threat to Google. Over the years, there have been so many suits filed against Apple, Microsoft, and Sony…on and on from all shapes and size companies claiming patented infringement or a violation of anti-trust laws. So many technology experts are categorizing the Artificial Intelligence (AI) behind voice recognition as one of the technological phenomenon of the last century. Many are claiming their paten means complete ownership. The stakes are very high possibly affecting every smart device.
“Big Brother” Listening
Recently, the US Government was caught tapping into vast amount of mobile device communication causing the US public to rant and rave for a media cycle. This act could have been interpreted as an invasion of privacy with the US instigating surveillance on its own citizens. The public was told not to worry and that the US Government only needed to capture the enormous amount of diverse languages, speech inflections and words to perfect the voice recognition system for the military. The American people were assured that surveillance would only be for those outside US borders. Every day the three letter agencies lines are being blurred and what constitutes US borders is a matter of interpretation.
DARPA’s recent solicitation for a prototype to be constructed in partnership and affiliated with GAILA requires a fundamentally more advanced technology delivery system. In this prototype, the software must be able to interpret visual cues and translate them to describe experiences before during and after an activity or event. It must be able to input images, videos, virtual reality and digital images and be able to describe the instant imagery along with a time lapsed predicator description of salient elements. In other words, it must be able to interpret moving activity accurately while also parsing its elements in sequence and describing the event in English. Another positive outcome will be enhances in Augmented Reality / Virtual Reality (ARVR) as well as feeds to self-driving cars and other autonomous devices.