ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit

Nasir Uddin Ahmed

Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Deanâ€™s List Award Recipient, Universiti Malaya.

å‘å¸ƒæ—¥æœŸ: 2024å¹´9æœˆ14æ—¥

In today's digital era, artificial intelligence (AI) is making vast strides, integrating into everyday applications and offering unprecedented convenience. One exciting AI development is creating multilingual virtual assistants capable of understanding and responding in multiple languages. In this article, I will walk through building a multilingual AI assistant using tools like Speech Recognition, Google Gemini's generative AI, and the Streamlit platform for a simple, interactive user experience.

The Idea: An AI Assistant with a Voice The idea behind this project is simpleâ€”build an assistant that listens to your voice, understands your queries, processes them using a powerful AI model, and responds by speaking back in a natural voice. This AI assistant is multilingual, meaning it can understand and respond to different languages, enhancing accessibility and usability across different user groups.

Key Components

Speech Recognition with speech_recognition library: The assistant starts by capturing voice input using the speech_recognition Python library. This package enables real-time audio capture and voice-to-text conversion, making it an essential part of the pipeline.
Text Generation using Google Gemini AI: For generating human-like responses, I used Googleâ€™s Gemini AI. This generative model excels in understanding user input and creating intelligent, context-aware responses.
Text-to-Speech with gTTS: Once the AI generates the response, we convert that text into speech using Googleâ€™s gTTS (Google Text-to-Speech) library. The resulting audio file can then be played back to the user or downloaded for future use.
Interactive User Interface with Streamlit: Finally, all the components are tied together using Streamlit, a powerful and easy-to-use library for creating web apps in Python. The app listens to user queries, processes them via the Google Gemini AI model, and responds both as text and speech.

Breaking Down the Code Letâ€™s break down the key components of the assistant:

in requirements.txt put the below options

SpeechRecognition
pyaudio
google-generativeai
gTTS
pipwin
streamlit

1. Setting Up Logging for Debugging:

# This is Logger for the application
LOG_DIR = "logs"
LOG_FILE_NAME = "application.log"

os.makedirs(LOG_DIR, exist_ok=True)

log_path = os.path.join(LOG_DIR,LOG_FILE_NAME)

logging.basicConfig(
    filename=log_path,
    format = "[ %(asctime)s ] %(name)s - %(levelname)s - %(message)s",
    level= logging.INFO
)

This section sets up a logging mechanism that helps capture and troubleshoot errors. Creating and maintaining logs is critical for tracking performance and identifying issues during runtime.

2. Capturing User Voice Input:

def takeCommand():
    r = sr.Recognizer()
    with sr.Microphone() as source:
        print("Listening...")
        audio = r.listen(source)
        
    try:
        query = r.recognize_google(audio, language="en-in")
        print(f"User said: {query}")
    except Exception as e:
        logging.info(e)
        return "None"
    return query

The takeCommand() function uses the microphone to listen for the userâ€™s input. The captured audio is converted to text using Googleâ€™s Speech Recognition API. Error handling ensures the application does not crash if the assistant cannot understand the input.

3. Processing User Input with Google Gemini AI:

é¢†è‹±æŽ¨è

Small Language Models (SLMs)

Santiago Santa MarÃa Morales 1 å¹´å‰

Artificial Intelligence in Mobile App Development: Revolutionizing User Experience and Efficiency

Artificial Intelligence in Mobile App Development:â€¦

Srinivas Mogilipala 5 ä¸ªæœˆå‰

Revolutionizing AI Landscapes:
Leveraging Azure OpenAI Models for Diverse Functions and Fine-Tuned Solutions

Revolutionizing AI Landscapes: Leveraging Azure OpenAIâ€¦

Krishna Srikanth K 10 ä¸ªæœˆå‰

def gemini_model(user_input):
    genai.configure(api_key="YOUR_API_KEY")
    model = genai.GenerativeModel('gemini-pro')
    response = model.generate_content(user_input)
    return response.text

This function takes the text from the voice input and feeds it to the Google Gemini generative AI model. It uses Gemini to generate a contextually appropriate response based on the userâ€™s input.

4. Converting Text to Speech:

def text_to_speech(text):
    ttx = gTTS(text=text, lang="en")
    ttx.save("speech.mp3")

This function converts the AI-generated response back into speech using the gTTS library. The resulting audio file, "speech.mp3," is saved locally for playback and download.

5. Bringing Everything Together with Streamlit:

def main():
    st.title("Multilingual AI Assistant")

    if st.button("Ask me anything!"):
        with st.spinner("Listening..."):
            text = takeCommand()
            response = gemini_model(text)
            text_to_speech(response)

            audio_file = open("speech.mp3", 'rb')
            audio_bytes = audio_file.read()

            st.text_area(label="Response:", value=response, height=350)
            st.audio(audio_bytes, format='audio/mp3')
            st.download_button(label="Download Speech",
                                data=audio_bytes,
                                file_name="speech.mp3",
                                mime="audio/mp3")

Here, Streamlit serves as the front end for the AI assistant, making it user-friendly and interactive. Once a user clicks the â€œAsk me anything!â€ button, the assistant listens to their query, generates a response, and presents it in both text and audio form.

The Benefits of This Approach

Voice-first Interaction: Using voice input enables hands-free operation and makes the assistant accessible to a broader audience, including users who prefer or need to interact with technology via speech.
Multilingual Support: By leveraging gTTS Google Gemini, the assistant can easily switch between languages, making it suitable for global users.
Generative AI for Intelligent Responses: Google Geminiâ€™s advanced capabilities allow the assistant to handle a wide range of questions, generating natural, human-like responses in real time.
Streamlit for Simplicity: The use of Streamlit simplifies deployment, offering a sleek interface for users while reducing the complexity involved in web development.

Conclusion

Building a multilingual AI assistant is an exciting project that combines the power of speech recognition, generative AI, and user-friendly platforms like Streamlit. This solution showcases how various tools and libraries can be integrated to create a functional, accessible, and intelligent assistant.

AI has immense potential, and projects like these pave the way for more innovative applications. Whether for personal use, education, or business, such assistants have the potential to revolutionize how we interact with technology, offering seamless, intuitive, and efficient communication.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Nasir Uddin Ahmedçš„æ›´å¤šæ–‡ç«

Confusion Matrix in the Training Process of Large Language Models (LLMs)

2025å¹´3æœˆ19æ—¥

Confusion Matrix in the Training Process of Large Language Models (LLMs)

Large Language Models (LLMs) like GPT, BERT, and LLaMA have revolutionized AI-powered natural language processingâ€¦
The Power of Focus: How Attention Mechanisms are Revolutionizing AI

2024å¹´11æœˆ29æ—¥

The Power of Focus: How Attention Mechanisms are Revolutionizing AI

What Is the Attention Mechanism? Think of attention as a way for machines to imitate human focus. When we read a bookâ€¦
Understanding Vision-Language Models: A New Era in Multimodal AI

2024å¹´10æœˆ22æ—¥

Understanding Vision-Language Models: A New Era in Multimodal AI

In recent years, the fields of artificial intelligence (AI) and machine learning (ML) have made significant stridesâ€¦
AI Explainability: Bridging the Gap Between Complexity and Trust

2024å¹´10æœˆ13æ—¥

AI Explainability: Bridging the Gap Between Complexity and Trust

In recent years, Artificial Intelligence (AI) has rapidly become an integral part of various industries, fromâ€¦
Mastering Transfer Learning with TensorFlow Part: 1

2024å¹´9æœˆ28æ—¥

Mastering Transfer Learning with TensorFlow Part: 1

Transfer Learning If we want to build a system using deep learning, we will need a lot of data. A significant amount ofâ€¦
End-to-End Data Engineering Project with Airflow, Python, and AWS

2024å¹´9æœˆ8æ—¥

End-to-End Data Engineering Project with Airflow, Python, and AWS

In this blog, weâ€™ll walk through an end-to-end data engineering project where we extract real-time data using Xâ€¦
Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

2024å¹´8æœˆ19æ—¥

Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

In today's data-driven world, extracting meaningful patterns from large datasets is essential for businesses looking toâ€¦
Beyond ML and DL: Understanding Measurement Models in Data Science

2024å¹´8æœˆ14æ—¥

Beyond ML and DL: Understanding Measurement Models in Data Science

In data science, the focus often gravitates toward building machine learning (ML) and deep learning (DL) models toâ€¦
Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

2024å¹´8æœˆ10æ—¥

Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

Importance of SQL as a Data Analyst SQL (Structured Query Language) is an essential tool for data analysts for severalâ€¦
Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python

2024å¹´7æœˆ31æ—¥

Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python

Pipelines in Scikit-learn streamline the process of machine learning model development by chaining multiple steps, fromâ€¦

See all articles

Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit

Nasir Uddin Ahmed

Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Deanâ€™s List Award Recipient, Universiti Malaya.

Key Components

2. Capturing User Voice Input:

é¢†è‹±æŽ¨è

4. Converting Text to Speech:

5. Bringing Everything Together with Streamlit:

The Benefits of This Approach

Conclusion

Nasir Uddin Ahmedçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Introduction to AI and ML in Mobile App Development

Building a Conversational Chatbot with GPT-4: Step-by-Step Guide

How to Build a Powerful Large Language Model with Cutting-Edge Development Services?

Googleâ€™s AI Shock Drop: Gemma 3 & Native Image Generation

RIP SaaS Applications: Welcome to the World of Agentic Applications

How does Generative AI affect search?

LLM Framework: How LangChain will Redefine Application Development in 2024

Customized Large Language Models: The Next Frontier for Enterprise AI

What is ImageChat?

Exploring the Landscape of Large Language Models (LLMs): A Comparative Guide

Key Components

2. Capturing User Voice Input:

é¢†è‹±æŽ¨è

4. Converting Text to Speech:

5. Bringing Everything Together with Streamlit:

The Benefits of This Approach

Conclusion

Nasir Uddin Ahmedçš„æ›´å¤šæ–‡ç«

Confusion Matrix in the Training Process of Large Language Models (LLMs)

The Power of Focus: How Attention Mechanisms are Revolutionizing AI

Understanding Vision-Language Models: A New Era in Multimodal AI

AI Explainability: Bridging the Gap Between Complexity and Trust

Mastering Transfer Learning with TensorFlow Part: 1

End-to-End Data Engineering Project with Airflow, Python, and AWS

Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

Beyond ML and DL: Understanding Measurement Models in Data Science

Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Introduction to AI and ML in Mobile App Development

Building a Conversational Chatbot with GPT-4: Step-by-Step Guide

How to Build a Powerful Large Language Model with Cutting-Edge Development Services?

Googleâ€™s AI Shock Drop: Gemma 3 & Native Image Generation

RIP SaaS Applications: Welcome to the World of Agentic Applications

How does Generative AI affect search?

LLM Framework: How LangChain will Redefine Application Development in 2024

Customized Large Language Models: The Next Frontier for Enterprise AI

What is ImageChat?

Exploring the Landscape of Large Language Models (LLMs): A Comparative Guide

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†