With Google Gemma 2 LLM – How to set up a Personal Voice AI Assistant on a Local Workstation with NVIDIA GPU
Prabir Guha, MS Comp. Sc.
LLM Builder, Researcher. I regularly publish articles on LLM building, domain adoption, and deployment.
This article will set up a powerful local AI assistant using the Gemma2 LLM served via the efficient llama.cpp server, combined with Retrieval-Augmented-Generation (RAG) using the FAISS vector database and utilizing Whisper for speech-to-text transcription and Piper for text-to-speech synthesis. This assistant will operate on a local Ubuntu workstation with an NVIDIA GPU.
Prerequisites
Before you start, make sure you have the following:
·?????? ?
The CPU and GPU requirements to run the described setup are very modest. I used a ten-year-old HP Z230 with a 4th Generation Core i7 CPU (3.6GHz) and 32GB of RAM. The GPU is an eight-year-old NVIDIA GTX 1080 with 2560 CUDA cores and 12GB of RAM. The OS on my workstation is Ubuntu 20.04 (LTS). To verify that a GPU, as well as correct NVIDIA CUDA version is installed, run the following command below:
?prabir@prabir-HP-Z230:~$ nvidia-smi
?
Fri Jan 17 10:22:47 2025??????
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03???????????? Driver Version: 535.216.03?? CUDA Version: 12.2???? |
|-----------------------------------------+----------------------+----------------------+
| GPU? Name???????????????? Persistence-M | Bus-Id??????? Disp.A | Volatile Uncorr. ECC |
| Fan? Temp?? Perf????????? Pwr:Usage/Cap |???????? Memory-Usage | GPU-Util? Compute M. |
|???????????????????????????????????????? |????????????????????? |?????????????? MIG M. |
|=========================================+======================+======================|
|?? 0? NVIDIA GeForce GTX 1080 Ti???? Off | 00000000:05:00.0? On |????????????????? N/A |
| 26%?? 26C??? P8????????????? 11W / 150W |??? 549MiB / 11264MiB |???? 10%????? Default |
|???????????????????????????????????????? |????????????????????? |????????????????? N/A |
+-----------------------------------------+----------------------+----------------------+
???????????????????????????????????????????????????????????????????????????????????????
+---------------------------------------------------------------------------------------+
| Processes:??????????????????????????????????????????????????????????????????????????? |
|? GPU?? GI?? CI??????? PID?? Type?? Process name??????????????????????????? GPU Memory |
|??????? ID?? ID???????????????????????????????????????????????????????????? Usage????? |
|=======================================================================================|
|??? 0?? N/A? N/A????? 1804????? G?? /usr/lib/xorg/Xorg?????????????????????????? 71MiB |
|??? 0?? N/A? N/A????? 3130????? G?? /usr/lib/xorg/Xorg????????????????????????? 328MiB |
|??? 0?? N/A? N/A????? 3323????? G?? /usr/bin/gnome-shell???????????????????????? 53MiB |
|??? 0?? N/A? N/A????? 5752????? G?? ...61653672,7649573882758483658,262144?????? 80MiB |
+---------------------------------------------------------------------------------------+
Note that on my system, nvidia-smi shows that I am using CUDA Version: 12.2 with a NVIDIA GeForce GTX 1080 Ti GPU.
?
Step 1: Set up the Environment
If you are running an older version of CUDA, upgrade CUDA first. Note that a minimum of CUDA version 12.xx is required.
1.1. Update rest of your system
Start by ensuring that your system packages are up to date:
sudo apt update && sudo apt upgrade -y
1.2. Install system dependencies
You'll need several system libraries for GPU acceleration and other dependencies:
sudo apt install -y build-essential python3 python3-dev python3-pip \
python3-venv git ffmpeg libsndfile1 libsm6 libxext6 libxrender-dev libglib2.0-0 cmake libasound2-dev portaudio19-dev g++ libopenblas-dev
?
Step 2: Set Up the Python Virtual Environment
2.1. Create and activate a Python virtual environment
To keep dependencies isolated and avoid conflicts with system packages, we'll create a virtual environment:
python3 -m venv ai-assistant-env
source ai-assistant-env/bin/activate
2.2. Upgrade pip and setuptools
It's a good idea to upgrade pip to the latest version:
pip install --upgrade pip setuptools
?
2.3. Install Pytorch and dependencies and other required Python packages
pip install torch==2.5
pip install openai-whisper requests sounddevice numpy==1.24.4 faiss-gpu sentence-transformers onnxruntime?????
?
Step 3: Install and compile llama.cpp and Set Up Gemma2 LLM Model
Gemma2 is a language model that you’ll run locally using?llama.cpp, which?is an optimized C++ server for running LLAMA models designed for performance and efficiency.
3.1. Clone the llama.cpp repository
First, clone the?llama.cpp?repository:
cd llama.cpp
3.2. Verify install dependencies
Ensure the system dependencies listed in Step 1.2 were all installed without errors. Additionally, cmake version 3.31.2 is needed to compile llama.cpp, a lower version of cmake may not work.
?
3.3. Build llama.cpp
Run the following commands to build the llama.cpp server.
mkdir build
cd build
cmake ..
make
If the compilation process is completed without errors, the build directory should contain many executables, includingllama-cli and llama-server.
3.4. Download Gemma2 Model
Download the?Gemma2?model. You can get the model from several sources (including Hugging Face). Once downloaded, place the model files in a directory on your local machine.
wget -O gemma-2-2b-it-Q4_K_S.gguf https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_S.gguf?download=true
mv gemma-2-2b-it-Q4_K_S.gguf ../models/
The gemma-2-2b-it-Q4_K_S.gguf refers to?a specific LLM containing a machine learning model called Gemma 2 2B(a 2 billion parameter model from Google), saved in a format called GGUF, which is designed for efficient inference on devices with limited resources, like a personal workstation, ?with Q4_K_S indicating a specific level of quantization applied to the model for further performance gains;?essentially, it's a compact, optimized version of the Gemma 2 2B language model suitable for tasks like text generation on personal workstations.
3.5. Run the llama.cpp server
Start the llama.cpp server on your workstation with the Gemma2 model that you downloaded.
.build/bin/llama-server -m ../models/gemma-2-2b-it-Q4_K_S.gguf -p 5000 -t 4 -c 1024 --gpu-layers 30 --host 192.168.1.251
Here,?-t 4?specifies using 4 threads for parallelism. Adjust the number based on your CPU/GPU performance.
Step 4: Install and Set Up Piper for Text-to-Speech
Piper is a lightweight TTS engine that runs locally.
4.1. Install Piper
To install Piper, first clone the repository:
git clone https://github.com/rhasspy/piper.git
cd piper
mkdir build cd build sudo cmake .. sudo make
Step 5: Set Up FAISS for Retrieval-Augmented Generation (RAG)
RAG helps combine a large language model (LLM) with a search index (FAISS) to query relevant data. This setup allows the assistant to answer questions based on indexed documents. This setup will help you integrate a pre-trained Large Language Model (LLM) with an efficient retrieval system, enabling your AI to answer questions or generate insights free of hallucinations based on large datasets or specific document collections.
5.1. Install FAISS
Earlier in Step 2.3, we install?FAISS?(Facebook AI Similarity Search):
Since we are using NVIDIA GPU, we installed the GPU version of FAISS:
pip install faiss-gpu
5.2. Create a FAISS Vector Database
You'll need a dataset to index. For simplicity, let's assume you have a collection of text files. You can use any collection of documents (e.g., PDFs, articles). The following Python code snippet sets up the FAISS vector database with an embedded all-MiniLM-L6-v2 model.
?
import whisper, requests, os, sounddevice as sd, numpy as np, tempfile, wave
import faiss
from sentence_transformers import SentenceTransformer
import torch
?
# Use GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
?
# Use a language embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
?
# Vector Database class to handle document embedding and search using FAISS
class VectorDatabase:
??? def init(self, dim):
??????? # Create FAISS index with specified dimension (384 for SentenceTransformer embeddings)
??????? self.index = faiss.IndexFlatL2(dim)
??????? self.documents = []
?
??? # Add documents and their embeddings to the FAISS index
??? def add_documents(self, docs):
??????? embeddings = embedding_model.encode(docs)? # Get embeddings for the docs
??????? self.index.add(np.array(embeddings, dtype=np.float32))? # Add them to the FAISS index
??????? self.documents.extend(docs)
?
??? # Search for the top K most relevant documents based on query embedding
??? def search(self, query, top_k=3):
??????? query_embedding = embedding_model.encode([query])[0].astype(np.float32)
??????? distances, indices = self.index.search(np.array([query_embedding]), top_k)
??????? return [self.documents[i] for i in indices[0]]
?
?
# Your own information/documents to be used in Retrieval-Augmented Generation (RAG)
docs = [
??? "The Oracle of Delphi was a sacred site in ancient Greece where a priestess, known as the Pythia, was believed to deliver prophetic messages from the god Apollo, providing guidance and answers to questions about the future to those who sought her counsel; it was people traveling from far and wide considered one of the most important religious centers in the Greek world, with to consult the oracle.",
??? "Key points about the Oracle of Delphi: Location: Situated on the slopes of Mount Parnassus in central Greece.? Priestess: The Pythia, a woman who would enter a trance-like state to deliver prophecies.",
??? "God associated: Apollo, the Greek god of prophecy, music, and healing.? Significance: People from across Greece would seek advice from the Oracle on important matters like war, politics, and personal decisions.? Cryptic nature: The Oracle's answers were often ambiguous and open to interpretation, requiring careful analysis."
领英推荐
]
?
# Create a VectorDatabase and add documents to it
db = VectorDatabase(dim=384)
db.add_documents(docs)
?
Now you have a vector index ready to use with the LLM for retrieval.
6. Now load the Whisper Speech to Text model for voice transcription
# Load Speech to text model for voice transcription
whisper_model = whisper.load_model("base").to(device)
# Current directory and path for beep sound files (used to indicate recording start and end)
current_dir = os.path.dirname(os.path.abspath(__file__))
bip_sound = os.path.join(current_dir, "assets/bip.wav")
bip2_sound = os.path.join(current_dir, "assets/bip2.wav")
?
# Play sound (beep) to signal recording start/stop
def play_sound(sound_file):
??? os.system(f"aplay {sound_file}")
?
# Record audio using sounddevice, save it as a 10 second .wav file
def record_audio(filename, duration=10, fs=16000):
?
??? play_sound(bip_sound)? # Start beep
??? print("10 seconds recording started...")
?
??? audio = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
??? sd.wait()? # Wait for the recording to complete
??? with wave.open(filename, 'wb') as wf:
??????? wf.setnchannels(1)
??????? wf.setsampwidth(2)
??????? wf.setframerate(fs)
??????? wf.writeframes(audio.tobytes())
??? play_sound(bip2_sound)? # End beep
??? print("recording completed")
6.1. Now load the Whisper Speech to Text model for voice transcription
# Transcribe recorded audio to text using Whisper
def transcribe_audio(filename):
??? return whisper_model.transcribe(filename, language="en")['text']
?
7. Here, we are setting up initial prompt(s) to alter the initial response behavior of the pre-trained Gemma 2 LLM to guide its output to the desired domain.
?
# Initial prompt to guide the LLaMA model's behavior
initial_prompt = ("You're an AI assistant specialized in AI development, kkk
????????????????? "Answer questions clearly and concisely in a friendly, professional tone. Do not use asterisks, do not ask new questions "
????????????????? "or act as the user. Keep replies short to speed up inference. If unsure, admit it and suggest looking into it further.")
?
# Current directory and path for beep sound files (used to indicate recording start and end)
current_dir = os.path.dirname(os.path.abspath(__file__))
bip_sound = os.path.join(current_dir, "assets/bip.wav")
bip2_sound = os.path.join(current_dir, "assets/bip2.wav")
?
8. Send the query and the context to the llama.cpp server. If all goes well, a focused response will be received from the LLM.
# Configuration for local LLM server
llama_url = "https://192.168.1.251:8080/completion"
# Send a query and context to LLaMA server for completion
def ask_llama(query, context):
??? data = {
??????? "prompt": f"{initial_prompt}\nContext: {context}\nQuestion: {query}\nAnswer:",
??????? "max_tokens": 80,? # Limit response length to avoid delays
??????? "temperature": 0.7? # Adjust temperature for balanced responses
??? }
??? response = requests.post(llama_url, json=data, headers={'Content-Type': 'application/json'})
??? if response.status_code == 200:
??????? return response.json().get('content', '').strip()
??? else:
??????? return f"Error: {response.status_code}"
# Generate a response using Retrieval-Augmented Generation (RAG)
def rag_ask(query):
??? context = " ".join(db.search(query))? # Search for related docs in the FAISS index
??? return ask_llama(query, context)? # Ask LLaMA using the retrieved context
?
9. Convert the answered text to speech with Piper TTS
?
# Convert text to speech using Piper TTS model
def text_to_speech(text):
??? os.system(f'echo "{text}" | /home/prabir/piper/build/piper --model /usr/local/share/piper/models/en_US-lessac-medium.onnx --output_file response.wav && aplay response.wav')
?
10. Test Your AI Assistant. This is the main() Loop.
# Main loop for the assistant
def main():
??? while True:
??????? # Create a temporary .wav file for the recording
??????? with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpfile:
??????????? record_audio(tmpfile.name)? # Record the audio input
??????????? transcribed_text = transcribe_audio(tmpfile.name)? # Convert speech to text
??????????? print(f"Agent heard: {transcribed_text}")
??????????? response = rag_ask(transcribed_text)? # Generate response using RAG and LLaMA
??????????? print(f"Agent response: {response}")
??????????? if response:
??????????????? text_to_speech(response)? # Convert response to speech
?
# Entry point of the script
if name == "__main__":
??? main()
?
Let us fire up a web browser and connect to the llama.cpp server on port 8080 running on the local workstation serving our plain pre-trained llama3.1 LLM. On 8080, the browser will connect to the chat interface. Now, let us ask the LLM a generic question, “What is oracle?” Here is the answer it gives us:
You're asking about the word "oracle"! It has two main meanings, and which one you're interested in depends on how you're using it.
1. The Ancient Oracle:
2. The Modern Oracle:
So, to answer your question:
"An oracle" is a word with two main meanings, depending on the context. One refers to a person or place thought to be a conduit to the divine, and the other is a tool for data analysis.
Let me know if you want to know more about one of these meanings in particular!
?
The LLM provides us two likely subject matters that match our original query about Oracle, the database software, and mystical fortunetellers from ancient Greece. Now let us say that we are only interested in the oracles of ancient Greece, so we provided additional information about the Oracles of ancient Greece and directives for the overall behavior of the response to the LLM to limit answers only to that domain. See the instructions snippets below:
?
# Your own information/documents to be used in Retrieval-Augmented Generation (RAG)
docs = [
??? "The Oracle of Delphi was a sacred site in ancient Greece where a
?
<<truncated>>
<<truncated>>
?
requiring careful analysis."
]
?
# Initial prompt to guide the LLaMA model's behavior
initial_prompt = ("You're an AI assistant specialized in AI development,
????????????????? "Answer questions clearly and concisely in a friendly, professional tone. Do not use asterisks, do not ask new questions "
????????????????? "or act as the user. Keep replies short to speed up inference. If unsure, admit it and suggest looking into it further.")
?
?We are ready to run the Python program called voice-ai-assistant.py, which has all the code snippets discussed so far. Once started, the program will wait for your voice input, and we ask the same question as before: “What is Oracle”? The program will then transcribe your question into text, combine it with the vectorized domain-specific (Mystical Oracles from Delphi) information provided to it, submit the combined data to the LLM server, receive the response from the LLM, generate a voice response from it, and say it back through the speaker.
recording completed
Agent heard:? What is Oracle?
Agent response: The Oracle is a sacred site where a priestess, known as the Pythia, delivers prophetic messages from the god Apollo.
[2025-01-20 17:59:10.835] [piper] [info] Loaded voice in 0.307025036 second(s)
[2025-01-20 17:59:10.836] [piper] [info] Initialized piper
response.wav
As can be seen clearly, this time, the answer is limited to the domain we are interested in—ancient Greek Oracles —and nothing about the Oracle Database software.
In conclusion, setting up a personal AI assistant with the powerful Gemma2 LLM on a local Ubuntu workstation using NVIDIA GPU offers an efficient and scalable solution for deploying advanced AI on their hardware. Users can build a fully functional assistant capable of handling complex tasks by integrating a robust stack that includes llama.cpp for model serving, Whisper for speech-to-text, Piper for text-to-speech, and FAISS for Retrieval-Augmented Generation. This setup ensures high performance and provides the flexibility of running everything locally, maintaining control over data privacy and resource usage. Whether for personal use or development purposes, this guide offers a clear path to creating an intelligent, responsive, and efficient AI assistant tailored to individual needs.