Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit
Nasir Uddin Ahmed
Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Dean’s List Award Recipient, Universiti Malaya.
In today's digital era, artificial intelligence (AI) is making vast strides, integrating into everyday applications and offering unprecedented convenience. One exciting AI development is creating multilingual virtual assistants capable of understanding and responding in multiple languages. In this article, I will walk through building a multilingual AI assistant using tools like Speech Recognition, Google Gemini's generative AI, and the Streamlit platform for a simple, interactive user experience.
The Idea: An AI Assistant with a Voice The idea behind this project is simple—build an assistant that listens to your voice, understands your queries, processes them using a powerful AI model, and responds by speaking back in a natural voice. This AI assistant is multilingual, meaning it can understand and respond to different languages, enhancing accessibility and usability across different user groups.
Key Components
- Speech Recognition with speech_recognition library: The assistant starts by capturing voice input using the speech_recognition Python library. This package enables real-time audio capture and voice-to-text conversion, making it an essential part of the pipeline.
- Text Generation using Google Gemini AI: For generating human-like responses, I used Google’s Gemini AI. This generative model excels in understanding user input and creating intelligent, context-aware responses.
- Text-to-Speech with gTTS: Once the AI generates the response, we convert that text into speech using Google’s gTTS (Google Text-to-Speech) library. The resulting audio file can then be played back to the user or downloaded for future use.
- Interactive User Interface with Streamlit: Finally, all the components are tied together using Streamlit, a powerful and easy-to-use library for creating web apps in Python. The app listens to user queries, processes them via the Google Gemini AI model, and responds both as text and speech.
Breaking Down the Code Let’s break down the key components of the assistant:
in requirements.txt put the below options
SpeechRecognition
pyaudio
google-generativeai
gTTS
pipwin
streamlit
1. Setting Up Logging for Debugging:
# This is Logger for the application
LOG_DIR = "logs"
LOG_FILE_NAME = "application.log"
os.makedirs(LOG_DIR, exist_ok=True)
log_path = os.path.join(LOG_DIR,LOG_FILE_NAME)
logging.basicConfig(
filename=log_path,
format = "[ %(asctime)s ] %(name)s - %(levelname)s - %(message)s",
level= logging.INFO
)
This section sets up a logging mechanism that helps capture and troubleshoot errors. Creating and maintaining logs is critical for tracking performance and identifying issues during runtime.
2. Capturing User Voice Input:
def takeCommand():
r = sr.Recognizer()
with sr.Microphone() as source:
print("Listening...")
audio = r.listen(source)
try:
query = r.recognize_google(audio, language="en-in")
print(f"User said: {query}")
except Exception as e:
logging.info(e)
return "None"
return query
The takeCommand() function uses the microphone to listen for the user’s input. The captured audio is converted to text using Google’s Speech Recognition API. Error handling ensures the application does not crash if the assistant cannot understand the input.
3. Processing User Input with Google Gemini AI:
领英推è
def gemini_model(user_input):
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(user_input)
return response.text
This function takes the text from the voice input and feeds it to the Google Gemini generative AI model. It uses Gemini to generate a contextually appropriate response based on the user’s input.
4. Converting Text to Speech:
def text_to_speech(text):
ttx = gTTS(text=text, lang="en")
ttx.save("speech.mp3")
This function converts the AI-generated response back into speech using the gTTS library. The resulting audio file, "speech.mp3," is saved locally for playback and download.
5. Bringing Everything Together with Streamlit:
def main():
st.title("Multilingual AI Assistant")
if st.button("Ask me anything!"):
with st.spinner("Listening..."):
text = takeCommand()
response = gemini_model(text)
text_to_speech(response)
audio_file = open("speech.mp3", 'rb')
audio_bytes = audio_file.read()
st.text_area(label="Response:", value=response, height=350)
st.audio(audio_bytes, format='audio/mp3')
st.download_button(label="Download Speech",
data=audio_bytes,
file_name="speech.mp3",
mime="audio/mp3")
Here, Streamlit serves as the front end for the AI assistant, making it user-friendly and interactive. Once a user clicks the “Ask me anything!†button, the assistant listens to their query, generates a response, and presents it in both text and audio form.
The Benefits of This Approach
- Voice-first Interaction: Using voice input enables hands-free operation and makes the assistant accessible to a broader audience, including users who prefer or need to interact with technology via speech.
- Multilingual Support: By leveraging gTTS Google Gemini, the assistant can easily switch between languages, making it suitable for global users.
- Generative AI for Intelligent Responses: Google Gemini’s advanced capabilities allow the assistant to handle a wide range of questions, generating natural, human-like responses in real time.
- Streamlit for Simplicity: The use of Streamlit simplifies deployment, offering a sleek interface for users while reducing the complexity involved in web development.
Conclusion
Building a multilingual AI assistant is an exciting project that combines the power of speech recognition, generative AI, and user-friendly platforms like Streamlit. This solution showcases how various tools and libraries can be integrated to create a functional, accessible, and intelligent assistant.
AI has immense potential, and projects like these pave the way for more innovative applications. Whether for personal use, education, or business, such assistants have the potential to revolutionize how we interact with technology, offering seamless, intuitive, and efficient communication.