Your Personal Voice GPT Assistant with Eleven Labs

Your Personal Voice GPT Assistant with Eleven Labs

Not long ago we have built an audio ChatGPT bot using OpenAI Whisper API and Google's Text-to-Speech module. The quality of the synthetic voice wasn't so great. Today I am responding to a viewer's request and building it again with Eleven Labs API.

Eleven Labs API

Eleven Labs so far has produced the best sounding voice on the market. You can sign up for free to use up to 10,000 characters.

Once you signed in, click the profile picture and select "Profile".

You will find the API key there.

Select your voice

  1. Select the menu and go to Voice Library.

No alt text provided for this image

2. You can sample the voices you like and add to VoiceLab.

No alt text provided for this image

3. Rename the voice. Go back to the VoiceLab and you can see them there.

Finding the Voice ID

To use the voice in our code, you need to find the Voice ID.

I used Postman with GET command.

No alt text provided for this image

In the response, you can identify the Voice ID by the name you imported as.

????????????"voice_id":?"bqGlZCw25vVCZyrtYzMnSx",
????????????"name":?"My?Therapist"        

With API key and Voice ID, you are now ready to use it in the code.

Replace Google TTS Code

In the previous blog, we already built an audio ChatGPT with Google TTS module. Now we are just going to replace that part.

Eleven Labs have two TTS endpoints. We are going to run a POST to the stream endpoint.

https://api.elevenlabs.io/v1/text-to-speech/<voice-id>
https://api.elevenlabs.io/v1/text-to-speech/<voice-id>/stream        

The standard endpoint will convert the text into speech as a mp3 file, while the stream endpoint will return an audio stream.

For a better chatbot experience, we do not want to wait for the mp3 file before the response can be played back. We are going to use the steaming method.

From the previous code, you can remove the Google TTS part.

Below is the code you will replace with. Here is the reference.

? ? CHUNK_SIZE = 1024
? ? url = "https://api.elevenlabs.io/v1/text-to-speech/bqGlZCwvVCZyrtYzMnSx/stream"

? ? headers = {
? ? ? "Accept": "audio/mpeg",
? ? ? "Content-Type": "application/json",
? ? ? "xi-api-key": config.ELEVEN_API_KEY
? ? }

? ? data = {
? ? ? "text": system_message,
? ? ? "model_id": "eleven_monolingual_v1",
? ? ? "voice_settings": {
? ? ? ? "stability": 0.5,
? ? ? ? "similarity_boost": 0.5
? ? ? }
? ? }

? ? response = requests.post(url, json=data, headers=headers, stream=True)        

You can add the API key directly with double quotes. Adjusting the stability and similarity boost will alter the voice, but I haven't tried it.

Streaming the Audio

Since we want to stream the AI voice, we can use the ffplay on Windows instead of relying on Audio output from Gradio.

import subprocess
? ? cmd = ['ffplay', '-autoexit', '-']
? ? proc = subprocess.Popen(cmd, stdin=subprocess.PIPE)
? ? for chunk in response.iter_content(chunk_size=1024):
? ? ? ? proc.stdin.write(chunk)

? ? proc.stdin.close()
? ? proc.wait()        

Change the display output

Lastly, we wanted to change the output from audio to displaying our chat transcript.

Since the Conversation global variable already has all the transcript, we just need to make it more readable with the following code.

# Format the conversation for display
? ? formatted_conversation = ""
? ? for message in conversation:
? ? ? ? if message["role"] == "user":
? ? ? ? ? ? formatted_conversation += "Me: " + message["content"] + "\n"
? ? ? ? elif message["role"] == "assistant":
? ? ? ? ? ? formatted_conversation += "You: " + message["content"] + "\n"
? ? return formatted_conversation.strip()        

Now we are returning the text output, so we could change the Gradio command from audio to text in the Output.

bot = gr.Interface(fn=transcribe, inputs=gr.Audio(source="microphone", type="filepath"), outputs="text"

)        

This is what the final demo looks like and the audio will be played out automatically without you clicking anything. It's not perfect but the output will display all the transcripts.

No alt text provided for this image

Voice Design

In the Eleven Labs documentation, it talked pausing by adding "-" into the text. So I crafted the system role prompt and trying to get the model respond with "-" from time to time. It did make the GPT model to respond with "-" but when Eleven Labs plays back, it is not as good as I expected. If you have worked out a better way to make the voice more natural sounding, please let me know.

要查看或添加评论,请登录

Leo Wang的更多文章

社区洞察

其他会员也浏览了