Yanny vs Laurel & Acoustic Modeling in Speech Recognition Systems
https://centricdigital.com/blog/digital-strategy/a-101-on-natural-language-processing-in-business/

Yanny vs Laurel & Acoustic Modeling in Speech Recognition Systems


Twitter data shows people hear "Yanny" 47% vs "Laurel" 53% of the time.

Auditory Processing Disorder (APD)

Also known as central auditory processing disorder (CAPD), is a hearing problem that affects about 5% of school-aged children.

A child's auditory system isn't fully developed until age 15. Many kids diagnosed with APD can develop better skills over time as their auditory system matures.

Cooper and Gates (1991) estimated the prevalence of adult APD to be 10 to 20%.

(Cooper JC Jr., Gates GA. Hearing in the elderly – the Framingham cohort, 1983–1985: Part II. Prevalence of central auditory processing disorders. Ear Hear. 1991;12(5):304–311)

The ability to listen to and comprehend multiple messages at the same time is a trait that is heavily influenced by our genes say federal researchers. These "short circuits in the wiring" sometimes run in families or result from a difficult birth, just like any learning disability.Auditory processing disorder can be associated with conditions affected by genetic traits, such as various developmental disorders. 

APD can manifest as problems determining the direction of sounds, difficulty perceiving differences between speech sounds and the sequencing of these sounds into meaningful words, confusing similar sounds such as "hat" with "bat", "there" with "where", etc. Fewer words may be perceived than were actually said, as there can be problems detecting the gaps between words, creating the sense that someone is speaking unfamiliar or nonsense words. In addition, it is common for APD to cause speech errors involving the distortion and substitution of consonant sounds.

Signs and Symptoms

  • Has difficulty processing and remembering language-related tasks but may have no trouble interpreting or recalling non-verbal environmental sounds, music, etc.
  • May process thoughts and ideas slowly and have difficulty explaining them
  • Misspells and mispronounces similar-sounding words or omits syllables; confuses similar-sounding words (celery/salary; belt/built; borrow/barrow; jab/job; affect/effect)
  • May be confused by figurative language (metaphor, similes) or misunderstand puns and jokes; interprets words too literally
  • Often is distracted by background sounds/noises
  • Finds it difficult to stay focused on or remember a verbal presentation or lecture
  • May misinterpret or have difficulty remembering oral directions; difficulty following directions in a series
  • Has difficulty comprehending complex sentence structure or rapid speech
  • “Ignores” people, especially if engrossed
  • Says “What?” a lot, even when has heard much of what was said

Strategies

  • Show rather than explain
  • Supplement with more intact senses (use visual cues, signals, handouts, manipulatives)
  • Reduce or space directions, give cues such as “ready?”
  • Reword or help decipher confusing oral and/or written directions
  • Teach abstract vocabulary, word roots, synonyms/antonyms
  • Vary pitch and tone of voice, alter pace, stress key words
  • Ask specific questions as you teach to find out if they do understand
  • Allow them 5-6 seconds to respond (“think time”)
  • Have the student constantly verbalize concepts, vocabulary words, rules, etc.

Neural Networks for Acoustic Modeling in Speech Recognition

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input.

A recent studies have demonstrated the effectiveness of DBN-DNNs for detecting sub-phonetic speech attributes (also known as phonological or articulatory features) in the widely used Wall Street Journal speech database (5k-WSJ0).

DNN architectures with 5 to 7 hidden layers and up to 2048 hidden units per layer were explored, producing greater than 90% frame-level accuracy for all 21 attributes tested in the full DNN system. On the same data, DBN-DNNs also achieved a very high per frame phone classification accuracy of 86.6%. Overfitting and the time required for discriminative fine-tuning with back-propagation which was one of the main impediments to using DNNs.

It is now known that most of the gain comes from using deep neural networks to exploit information in neighboring frames and from modeling tied context-dependent states. https://ieeexplore.ieee.org/document/6296526/

Chatbots, Voice Activated Technologies & Natural Language Understanding (NLU)

Language can be "grounded" or "inferred".

Humans understand many words in terms of associations with sensory-motor experiences. Abstract words are acquired in relation to other grounded words.It enables humans to acquire and use words and sentences in context.

Inferred language derive menaings from words themselves rather than what they represent.

When trained on large corpuses of text but not on real world representations, NLP and NLU perform poorly on the Turing Test as demonstrated by an experiment called The Chinese Room Argument https://plato.stanford.edu/entries/chinese-room/

Recent advances in technology has been impressive so far including the recent Google I/O announcement and demo of Google Duplex https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

So while Google, Facebook, Amazon are all racing towards the perfect AI Assistant using context for NLU, don't beat yourself up for "Yanny" and "Laurel" that are missing "context", a key ingredient for Natural Language Understanding.



要查看或添加评论,请登录

Learie Hercules的更多文章

社区洞察