Your Personal Voice GPT Assistant with Eleven Labs
Not long ago we have built an audio ChatGPT bot using OpenAI Whisper API and Google's Text-to-Speech module. The quality of the synthetic voice wasn't so great. Today I am responding to a viewer's request and building it again with Eleven Labs API.
Eleven Labs API
Eleven Labs so far has produced the best sounding voice on the market. You can sign up for free to use up to 10,000 characters.
Once you signed in, click the profile picture and select "Profile".
You will find the API key there.
Select your voice
2. You can sample the voices you like and add to VoiceLab.
3. Rename the voice. Go back to the VoiceLab and you can see them there.
Finding the Voice ID
To use the voice in our code, you need to find the Voice ID.
I used Postman with GET command.
In the response, you can identify the Voice ID by the name you imported as.
????????????"voice_id":?"bqGlZCw25vVCZyrtYzMnSx",
????????????"name":?"My?Therapist"
With API key and Voice ID, you are now ready to use it in the code.
Replace Google TTS Code
In the previous blog, we already built an audio ChatGPT with Google TTS module. Now we are just going to replace that part.
领英推荐
Eleven Labs have two TTS endpoints. We are going to run a POST to the stream endpoint.
https://api.elevenlabs.io/v1/text-to-speech/<voice-id>
https://api.elevenlabs.io/v1/text-to-speech/<voice-id>/stream
The standard endpoint will convert the text into speech as a mp3 file, while the stream endpoint will return an audio stream.
For a better chatbot experience, we do not want to wait for the mp3 file before the response can be played back. We are going to use the steaming method.
From the previous code, you can remove the Google TTS part.
Below is the code you will replace with. Here is the reference.
? ? CHUNK_SIZE = 1024
? ? url = "https://api.elevenlabs.io/v1/text-to-speech/bqGlZCwvVCZyrtYzMnSx/stream"
? ? headers = {
? ? ? "Accept": "audio/mpeg",
? ? ? "Content-Type": "application/json",
? ? ? "xi-api-key": config.ELEVEN_API_KEY
? ? }
? ? data = {
? ? ? "text": system_message,
? ? ? "model_id": "eleven_monolingual_v1",
? ? ? "voice_settings": {
? ? ? ? "stability": 0.5,
? ? ? ? "similarity_boost": 0.5
? ? ? }
? ? }
? ? response = requests.post(url, json=data, headers=headers, stream=True)
You can add the API key directly with double quotes. Adjusting the stability and similarity boost will alter the voice, but I haven't tried it.
Streaming the Audio
Since we want to stream the AI voice, we can use the ffplay on Windows instead of relying on Audio output from Gradio.
import subprocess
? ? cmd = ['ffplay', '-autoexit', '-']
? ? proc = subprocess.Popen(cmd, stdin=subprocess.PIPE)
? ? for chunk in response.iter_content(chunk_size=1024):
? ? ? ? proc.stdin.write(chunk)
? ? proc.stdin.close()
? ? proc.wait()
Change the display output
Lastly, we wanted to change the output from audio to displaying our chat transcript.
Since the Conversation global variable already has all the transcript, we just need to make it more readable with the following code.
# Format the conversation for display
? ? formatted_conversation = ""
? ? for message in conversation:
? ? ? ? if message["role"] == "user":
? ? ? ? ? ? formatted_conversation += "Me: " + message["content"] + "\n"
? ? ? ? elif message["role"] == "assistant":
? ? ? ? ? ? formatted_conversation += "You: " + message["content"] + "\n"
? ? return formatted_conversation.strip()
Now we are returning the text output, so we could change the Gradio command from audio to text in the Output.
bot = gr.Interface(fn=transcribe, inputs=gr.Audio(source="microphone", type="filepath"), outputs="text"
)
This is what the final demo looks like and the audio will be played out automatically without you clicking anything. It's not perfect but the output will display all the transcripts.
Voice Design
In the Eleven Labs documentation, it talked pausing by adding "-" into the text. So I crafted the system role prompt and trying to get the model respond with "-" from time to time. It did make the GPT model to respond with "-" but when Eleven Labs plays back, it is not as good as I expected. If you have worked out a better way to make the voice more natural sounding, please let me know.