Adding Voice to ChatGPT
We’ve all had a daydream about talking to technology like Jarvis in Iron Man, The Computer in Star Trek, or even Rick’s sarcastic space ship from Rick and Morty. However, what we’ve come to expect from Siri, Alexa, Google Assistant, etc. is far less that what many of us envisioned when virtual assistants were introduced. While these devices are capable of providing simple voice interfaces for various smart home applications like smart lights, TVs, and smart thermostats, they really lack the conversational depth, and human nuance one would expect from anything resembling a conversation. It’s possible to get very simple, atomic answers like “what is 60 squared?” or “what’s the generic of Prozac?”, but these products really struggle with anything remotely complex.
Then OpenAI brought us ChatGPT- finally, we could have a real conversation with an AI, and have it perform significant tasks well above and beyond what we could even expect of most human assistants. Now, if you ask your virtual assistant “what is the meaning of life?” you’ll get a nice summary of some Greek philosophy and 20th century existentialism. However, it no longer has the voice capabilities we had before. My immediate thoughts were that it’s incredible that humankind created something so miraculously close to a general intelligence, but then somehow failed to merge it with existing technology like, say, an Amazon Echo.
Due to the open source nature of computing today, combining these technologies on a home laptop with a little glue code has never been easier- most major operating systems come bundled with a text-to-voice service by default, and Meta has provided an open-source voice-to-text model called ‘Wav2VecForCTC’ (this one was appropriately small for consumer hardware). Connect these to the OpenAI API, and we’re off to the races, right?
Well, there were a few hurdles to deal with in development. First, GPT models are “auto-regressive”, meaning they are learning from their own answers as they are providing them- this is why there’s a noticeable delay as ChatGPT appears to “type” its answers into the console. Without going too in-depth about GPT, this unfortunately means that there will be an inevitable delay while GPT produces the answer, no matter how fast the rest of the script performs at the client end. However, GPT-3.5 performs quite fast after a few updates, I’m assuming from scaled up resources following mass-scale adoption and investment in OpenAI.
Another challenge is the performance of the voice-to-text model- the laptop I initially tested this on had an exceptionally poor quality microphone, which was producing very noisy sound, and thus noisy transcribed text, and some rather strange responses from ChatGPT. Having worked in machine vision previously, it takes surprisingly little information to still identify entities despite noise, and so it wasn’t too surprising that a larger model was all it took for voice-to-text to function considerably better. I would still recommend using a functional microphone for this process though, as it performs considerably better with clean input.
In spirit of the open source code this heavily relies on, I have open sourced a proof of concept here: peace-blaster/Voice_ChatGPT: PoC to use voice to communicate with ChatGPT (github.com). It is currently only tested on confirmed functioning on Mac and Linux.
There are a few other limitations though, primarily due to the simplicity of this code- I have not yet implemented a proper event listener to continuously listen, and acknowledge a keyword like “hey Alexa”. This would involve training my own model, as this capability doesn’t seem easily possible from existing models, and I currently lack the capital and processing resources to take on ML development at that scale. Another limitation is that where this functions like a typical smarthome today, it doesn’t handle sustained conversations like ChatGPT does in its web UI. While I’m sure this is possible, I kept it simple as this was intended to be a mere proof of concept.
Moving forward, there are several capabilities I would like to see. As mentioned previously, the ability to have long, sustained conversations would be fantastic, especially where GPT is auto-regressive. This would really pave the way for applications like AI therapy, robot pets/friends, incredible immersion in videogames, and so many other things that were science fiction even 5 years ago. I’d also like to explore some more advanced prompt engineering to “trick” ChatGPT into logically interpreting voice, and returning actionable substrings that can be parsed out of its answer, and passed on to something like IFTTT for smart home control. For instance, the user could say something like “it’s too bright in here”, and then this could be prefaced with something like “if the following string sounds like the user wants the light on, type a literal XXXYYYZZZ at the end of the answer”. This may or may not end becoming necessary, especially where OpenAI is now working on plugins which could do this in a far less hacky manner.