登录查看更多内容

With Google Gemma 2 LLM – How to set up a Personal Voice AI Assistant on a Local Workstation with NVIDIA GPU

Prabir Guha, MS Comp. Sc.

LLM Builder, Researcher. I regularly publish articles on LLM building, domain adoption, and deployment.

发布日期: 2025年1月21日

This article will set up a powerful local AI assistant using the Gemma2 LLM served via the efficient llama.cpp server, combined with Retrieval-Augmented-Generation (RAG) using the FAISS vector database and utilizing Whisper for speech-to-text transcription and Piper for text-to-speech synthesis. This assistant will operate on a local Ubuntu workstation with an NVIDIA GPU.

Prerequisites

Before you start, make sure you have the following:

A?local Ubuntu workstation?(Ubuntu 20.04 or later)
An?NVIDIA GPU?with the necessary CUDA drivers installed
Basic knowledge of the Linux command line
A working?Python 3?installation (preferably Python 3.10+)
Git?installed for cloning repositories
Access to the internet to download dependencies

·?????? ?

The CPU and GPU requirements to run the described setup are very modest. I used a ten-year-old HP Z230 with a 4th Generation Core i7 CPU (3.6GHz) and 32GB of RAM. The GPU is an eight-year-old NVIDIA GTX 1080 with 2560 CUDA cores and 12GB of RAM. The OS on my workstation is Ubuntu 20.04 (LTS). To verify that a GPU, as well as correct NVIDIA CUDA version is installed, run the following command below:

?prabir@prabir-HP-Z230:~$ nvidia-smi

Fri Jan 17 10:22:47 2025??????

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.216.03???????????? Driver Version: 535.216.03?? CUDA Version: 12.2???? |

|-----------------------------------------+----------------------+----------------------+

| GPU? Name???????????????? Persistence-M | Bus-Id??????? Disp.A | Volatile Uncorr. ECC |

| Fan? Temp?? Perf????????? Pwr:Usage/Cap |???????? Memory-Usage | GPU-Util? Compute M. |

|???????????????????????????????????????? |????????????????????? |?????????????? MIG M. |

|=========================================+======================+======================|

|?? 0? NVIDIA GeForce GTX 1080 Ti???? Off | 00000000:05:00.0? On |????????????????? N/A |

| 26%?? 26C??? P8????????????? 11W / 150W |??? 549MiB / 11264MiB |???? 10%????? Default |

|???????????????????????????????????????? |????????????????????? |????????????????? N/A |

+-----------------------------------------+----------------------+----------------------+

???????????????????????????????????????????????????????????????????????????????????????

+---------------------------------------------------------------------------------------+

| Processes:??????????????????????????????????????????????????????????????????????????? |

|? GPU?? GI?? CI??????? PID?? Type?? Process name??????????????????????????? GPU Memory |

|??????? ID?? ID???????????????????????????????????????????????????????????? Usage????? |

|=======================================================================================|

|??? 0?? N/A? N/A????? 1804????? G?? /usr/lib/xorg/Xorg?????????????????????????? 71MiB |

|??? 0?? N/A? N/A????? 3130????? G?? /usr/lib/xorg/Xorg????????????????????????? 328MiB |

|??? 0?? N/A? N/A????? 3323????? G?? /usr/bin/gnome-shell???????????????????????? 53MiB |

|??? 0?? N/A? N/A????? 5752????? G?? ...61653672,7649573882758483658,262144?????? 80MiB |

+---------------------------------------------------------------------------------------+

Note that on my system, nvidia-smi shows that I am using CUDA Version: 12.2 with a NVIDIA GeForce GTX 1080 Ti GPU.

Step 1: Set up the Environment

If you are running an older version of CUDA, upgrade CUDA first. Note that a minimum of CUDA version 12.xx is required.

1.1. Update rest of your system

Start by ensuring that your system packages are up to date:

sudo apt update && sudo apt upgrade -y

1.2. Install system dependencies

You'll need several system libraries for GPU acceleration and other dependencies:

sudo apt install -y build-essential python3 python3-dev python3-pip \

python3-venv git ffmpeg libsndfile1 libsm6 libxext6 libxrender-dev libglib2.0-0 cmake libasound2-dev portaudio19-dev g++ libopenblas-dev

Step 2: Set Up the Python Virtual Environment

2.1. Create and activate a Python virtual environment

To keep dependencies isolated and avoid conflicts with system packages, we'll create a virtual environment:

python3 -m venv ai-assistant-env

source ai-assistant-env/bin/activate

2.2. Upgrade pip and setuptools

It's a good idea to upgrade pip to the latest version:

pip install --upgrade pip setuptools

2.3. Install Pytorch and dependencies and other required Python packages

pip install torch==2.5

pip install openai-whisper requests sounddevice numpy==1.24.4 faiss-gpu sentence-transformers onnxruntime?????

Step 3: Install and compile llama.cpp and Set Up Gemma2 LLM Model

Gemma2 is a language model that you’ll run locally using?llama.cpp, which?is an optimized C++ server for running LLAMA models designed for performance and efficiency.

3.1. Clone the llama.cpp repository

First, clone the?llama.cpp?repository:

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

3.2. Verify install dependencies

Ensure the system dependencies listed in Step 1.2 were all installed without errors. Additionally, cmake version 3.31.2 is needed to compile llama.cpp, a lower version of cmake may not work.

3.3. Build llama.cpp

Run the following commands to build the llama.cpp server.

mkdir build

cd build

cmake ..

make

If the compilation process is completed without errors, the build directory should contain many executables, includingllama-cli and llama-server.

3.4. Download Gemma2 Model

Download the?Gemma2?model. You can get the model from several sources (including Hugging Face). Once downloaded, place the model files in a directory on your local machine.

wget -O gemma-2-2b-it-Q4_K_S.gguf https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_S.gguf?download=true

mv gemma-2-2b-it-Q4_K_S.gguf ../models/

The gemma-2-2b-it-Q4_K_S.gguf refers to?a specific LLM containing a machine learning model called Gemma 2 2B(a 2 billion parameter model from Google), saved in a format called GGUF, which is designed for efficient inference on devices with limited resources, like a personal workstation, ?with Q4_K_S indicating a specific level of quantization applied to the model for further performance gains;?essentially, it's a compact, optimized version of the Gemma 2 2B language model suitable for tasks like text generation on personal workstations.

3.5. Run the llama.cpp server

Start the llama.cpp server on your workstation with the Gemma2 model that you downloaded.

.build/bin/llama-server -m ../models/gemma-2-2b-it-Q4_K_S.gguf -p 5000 -t 4 -c 1024 --gpu-layers 30 --host 192.168.1.251

Here,?-t 4?specifies using 4 threads for parallelism. Adjust the number based on your CPU/GPU performance.

Step 4: Install and Set Up Piper for Text-to-Speech

Piper is a lightweight TTS engine that runs locally.

4.1. Install Piper

To install Piper, first clone the repository:

git clone https://github.com/rhasspy/piper.git

cd piper

mkdir build cd build sudo cmake .. sudo make

Step 5: Set Up FAISS for Retrieval-Augmented Generation (RAG)

RAG helps combine a large language model (LLM) with a search index (FAISS) to query relevant data. This setup allows the assistant to answer questions based on indexed documents. This setup will help you integrate a pre-trained Large Language Model (LLM) with an efficient retrieval system, enabling your AI to answer questions or generate insights free of hallucinations based on large datasets or specific document collections.

5.1. Install FAISS

Earlier in Step 2.3, we install?FAISS?(Facebook AI Similarity Search):

Since we are using NVIDIA GPU, we installed the GPU version of FAISS:

pip install faiss-gpu

5.2. Create a FAISS Vector Database

You'll need a dataset to index. For simplicity, let's assume you have a collection of text files. You can use any collection of documents (e.g., PDFs, articles). The following Python code snippet sets up the FAISS vector database with an embedded all-MiniLM-L6-v2 model.

import whisper, requests, os, sounddevice as sd, numpy as np, tempfile, wave

import faiss

from sentence_transformers import SentenceTransformer

import torch

# Use GPU if available, otherwise use CPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Use a language embedding model

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Vector Database class to handle document embedding and search using FAISS

class VectorDatabase:

??? def init(self, dim):

??????? # Create FAISS index with specified dimension (384 for SentenceTransformer embeddings)

??????? self.index = faiss.IndexFlatL2(dim)

??????? self.documents = []

??? # Add documents and their embeddings to the FAISS index

??? def add_documents(self, docs):

??????? embeddings = embedding_model.encode(docs)? # Get embeddings for the docs

??????? self.index.add(np.array(embeddings, dtype=np.float32))? # Add them to the FAISS index

??????? self.documents.extend(docs)

??? # Search for the top K most relevant documents based on query embedding

??? def search(self, query, top_k=3):

??????? query_embedding = embedding_model.encode([query])[0].astype(np.float32)

??????? distances, indices = self.index.search(np.array([query_embedding]), top_k)

??????? return [self.documents[i] for i in indices[0]]

# Your own information/documents to be used in Retrieval-Augmented Generation (RAG)

docs = [

??? "The Oracle of Delphi was a sacred site in ancient Greece where a priestess, known as the Pythia, was believed to deliver prophetic messages from the god Apollo, providing guidance and answers to questions about the future to those who sought her counsel; it was people traveling from far and wide considered one of the most important religious centers in the Greek world, with to consult the oracle.",

??? "Key points about the Oracle of Delphi: Location: Situated on the slopes of Mount Parnassus in central Greece.? Priestess: The Pythia, a woman who would enter a trance-like state to deliver prophecies.",

??? "God associated: Apollo, the Greek god of prophecy, music, and healing.? Significance: People from across Greece would seek advice from the Oracle on important matters like war, politics, and personal decisions.? Cryptic nature: The Oracle's answers were often ambiguous and open to interpretation, requiring careful analysis."

领英推荐

Maximizing AI Performance with Intel Arc A770 GPU on…

Plain Concepts 7 个月前

GPU Operator: Simplifying GPU Management on Kubernetes

INI8 LABS 1 个月前

Stream MultiProcessors in GPU

Heptarc Talent Acquisition 1 年前

]

# Create a VectorDatabase and add documents to it

db = VectorDatabase(dim=384)

db.add_documents(docs)

Now you have a vector index ready to use with the LLM for retrieval.

6. Now load the Whisper Speech to Text model for voice transcription

# Load Speech to text model for voice transcription

whisper_model = whisper.load_model("base").to(device)

# Current directory and path for beep sound files (used to indicate recording start and end)

current_dir = os.path.dirname(os.path.abspath(__file__))

bip_sound = os.path.join(current_dir, "assets/bip.wav")

bip2_sound = os.path.join(current_dir, "assets/bip2.wav")

# Play sound (beep) to signal recording start/stop

def play_sound(sound_file):

??? os.system(f"aplay {sound_file}")

# Record audio using sounddevice, save it as a 10 second .wav file

def record_audio(filename, duration=10, fs=16000):

??? play_sound(bip_sound)? # Start beep

??? print("10 seconds recording started...")

??? audio = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')

??? sd.wait()? # Wait for the recording to complete

??? with wave.open(filename, 'wb') as wf:

??????? wf.setnchannels(1)

??????? wf.setsampwidth(2)

??????? wf.setframerate(fs)

??????? wf.writeframes(audio.tobytes())

??? play_sound(bip2_sound)? # End beep

??? print("recording completed")

6.1. Now load the Whisper Speech to Text model for voice transcription

# Transcribe recorded audio to text using Whisper

def transcribe_audio(filename):

??? return whisper_model.transcribe(filename, language="en")['text']

7. Here, we are setting up initial prompt(s) to alter the initial response behavior of the pre-trained Gemma 2 LLM to guide its output to the desired domain.

# Initial prompt to guide the LLaMA model's behavior

initial_prompt = ("You're an AI assistant specialized in AI development, kkk

????????????????? "Answer questions clearly and concisely in a friendly, professional tone. Do not use asterisks, do not ask new questions "

????????????????? "or act as the user. Keep replies short to speed up inference. If unsure, admit it and suggest looking into it further.")

# Current directory and path for beep sound files (used to indicate recording start and end)

current_dir = os.path.dirname(os.path.abspath(__file__))

bip_sound = os.path.join(current_dir, "assets/bip.wav")

bip2_sound = os.path.join(current_dir, "assets/bip2.wav")

8. Send the query and the context to the llama.cpp server. If all goes well, a focused response will be received from the LLM.

# Configuration for local LLM server

llama_url = "https://192.168.1.251:8080/completion"

# Send a query and context to LLaMA server for completion

def ask_llama(query, context):

??? data = {

??????? "prompt": f"{initial_prompt}\nContext: {context}\nQuestion: {query}\nAnswer:",

??????? "max_tokens": 80,? # Limit response length to avoid delays

??????? "temperature": 0.7? # Adjust temperature for balanced responses

??? }

??? response = requests.post(llama_url, json=data, headers={'Content-Type': 'application/json'})

??? if response.status_code == 200:

??????? return response.json().get('content', '').strip()

??? else:

??????? return f"Error: {response.status_code}"

# Generate a response using Retrieval-Augmented Generation (RAG)

def rag_ask(query):

??? context = " ".join(db.search(query))? # Search for related docs in the FAISS index

??? return ask_llama(query, context)? # Ask LLaMA using the retrieved context

9. Convert the answered text to speech with Piper TTS

# Convert text to speech using Piper TTS model

def text_to_speech(text):

??? os.system(f'echo "{text}" | /home/prabir/piper/build/piper --model /usr/local/share/piper/models/en_US-lessac-medium.onnx --output_file response.wav && aplay response.wav')

10. Test Your AI Assistant. This is the main() Loop.

Run your local?Gemma2 LLM server?(llama.cpp).
Start your?Python AI assistant script.
Speak into the microphone. The assistant should:Transcribe your speech with Whisper.Query relevant data using FAISS for context.Generate a response with Gemma2.Synthesize and play the response via Piper.

# Main loop for the assistant

def main():

??? while True:

??????? # Create a temporary .wav file for the recording

??????? with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpfile:

??????????? record_audio(tmpfile.name)? # Record the audio input

??????????? transcribed_text = transcribe_audio(tmpfile.name)? # Convert speech to text

??????????? print(f"Agent heard: {transcribed_text}")

??????????? response = rag_ask(transcribed_text)? # Generate response using RAG and LLaMA

??????????? print(f"Agent response: {response}")

??????????? if response:

??????????????? text_to_speech(response)? # Convert response to speech

# Entry point of the script

if name == "__main__":

??? main()

Let us fire up a web browser and connect to the llama.cpp server on port 8080 running on the local workstation serving our plain pre-trained llama3.1 LLM. On 8080, the browser will connect to the chat interface. Now, let us ask the LLM a generic question, “What is oracle?” Here is the answer it gives us:

You're asking about the word "oracle"! It has two main meanings, and which one you're interested in depends on how you're using it.

1. The Ancient Oracle:

This is the most common meaning. It refers to a?person or place considered to be divinely inspired?and able to speak from the gods.Think of the oracle at Delphi in Greece:?People would travel there seeking wisdom and predictions from the god Apollo.Modern day:?While not as common, some people might use "oracle" to refer to someone who claims to have special wisdom or insight, but it usually comes with a lot of skepticism.

2. The Modern Oracle:

This meaning refers to?a tool for data analysis, specifically a software or service.Think of a "data oracle" as someone or something that gives you valuable insights from your data.In the business world,?this could mean a software platform that can analyze sales data, customer trends, etc.In a more casual way,?it might refer to someone or something that offers advice or answers based on data.

So, to answer your question:

"An oracle" is a word with two main meanings, depending on the context. One refers to a person or place thought to be a conduit to the divine, and the other is a tool for data analysis.

Let me know if you want to know more about one of these meanings in particular!

The LLM provides us two likely subject matters that match our original query about Oracle, the database software, and mystical fortunetellers from ancient Greece. Now let us say that we are only interested in the oracles of ancient Greece, so we provided additional information about the Oracles of ancient Greece and directives for the overall behavior of the response to the LLM to limit answers only to that domain. See the instructions snippets below:

# Your own information/documents to be used in Retrieval-Augmented Generation (RAG)

docs = [

??? "The Oracle of Delphi was a sacred site in ancient Greece where a

<<truncated>>

requiring careful analysis."

]

# Initial prompt to guide the LLaMA model's behavior

initial_prompt = ("You're an AI assistant specialized in AI development,

????????????????? "Answer questions clearly and concisely in a friendly, professional tone. Do not use asterisks, do not ask new questions "

????????????????? "or act as the user. Keep replies short to speed up inference. If unsure, admit it and suggest looking into it further.")

?We are ready to run the Python program called voice-ai-assistant.py, which has all the code snippets discussed so far. Once started, the program will wait for your voice input, and we ask the same question as before: “What is Oracle”? The program will then transcribe your question into text, combine it with the vectorized domain-specific (Mystical Oracles from Delphi) information provided to it, submit the combined data to the LLM server, receive the response from the LLM, generate a voice response from it, and say it back through the speaker.

recording completed

Agent heard:? What is Oracle?

Agent response: The Oracle is a sacred site where a priestess, known as the Pythia, delivers prophetic messages from the god Apollo.

[2025-01-20 17:59:10.835] [piper] [info] Loaded voice in 0.307025036 second(s)

[2025-01-20 17:59:10.836] [piper] [info] Initialized piper

response.wav

As can be seen clearly, this time, the answer is limited to the domain we are interested in—ancient Greek Oracles —and nothing about the Oracle Database software.

In conclusion, setting up a personal AI assistant with the powerful Gemma2 LLM on a local Ubuntu workstation using NVIDIA GPU offers an efficient and scalable solution for deploying advanced AI on their hardware. Users can build a fully functional assistant capable of handling complex tasks by integrating a robust stack that includes llama.cpp for model serving, Whisper for speech-to-text, Piper for text-to-speech, and FAISS for Retrieval-Augmented Generation. This setup ensures high performance and provides the flexibility of running everything locally, maintaining control over data privacy and resource usage. Whether for personal use or development purposes, this guide offers a clear path to creating an intelligent, responsive, and efficient AI assistant tailored to individual needs.

要查看或添加评论，请登录

Prabir Guha, MS Comp. Sc.的更多文章

DeepSeek-R1 - Emperor has no clothes?

2025年1月30日

DeepSeek-R1 - Emperor has no clothes?

On January 27, 2025, an open-source Large Language Reasoning Model (LLM) called DeepSeek-R1 from China, with comparable…
Setting Up a Retrieval-Augmented Generation (RAG) System Locally with chromadb & ollama

2025年1月18日

Setting Up a Retrieval-Augmented Generation (RAG) System Locally with chromadb & ollama

Building a powerful Retrieval-Augmented Generation (RAG) system locally can unlock advanced AI capabilities for various…
Conversational AI, voice interactions with avatars in real-time: ML at the EDGE with Jetson AGX Orin

2023年3月18日

Conversational AI, voice interactions with avatars in real-time: ML at the EDGE with Jetson AGX Orin

Nvidia AGX Orin embedded platform Part 1 (Introduction) In this article, I show you how to start with Nvidia RIVA…
Eight essential skills for a Machine Learning Engineer

2022年10月9日

Eight essential skills for a Machine Learning Engineer

A skilled ML engineer is in high demand, commanding salaries in the $200K-300K range. When asked, "How does one become…

4 条评论
Eight Rules For Generating Good Datasets Needed To Train An Object Recognition Deep Learning Model

2022年9月22日

Eight Rules For Generating Good Datasets Needed To Train An Object Recognition Deep Learning Model

When training a deep learning model, more good data always wins over a better model algorithm and is likely to produce…

See all articles

With Google Gemma 2 LLM – How to set up a Personal Voice AI Assistant on a Local Workstation with NVIDIA GPU

Prabir Guha, MS Comp. Sc.

LLM Builder, Researcher. I regularly publish articles on LLM building, domain adoption, and deployment.

领英推荐

Prabir Guha, MS Comp. Sc.的更多文章

社区洞察

其他会员也浏览了

What Is CUDA? Understanding Its Origins, Mechanics, Evolution, and Importance for AI

Demystifying CPU vs. GPU: Understanding the Key Differences

NVIDIA’s CUDA Monopoly Will End Itself

DeepSeek’s GPU Revolution: The AI Hack That Redefined Computing

SANA: Ultra HD Fast Text to Image Model from NVIDIA Step by Step Tutorial on Windows, Cloud & Kaggle — Generate 2048x2048 Images

Unleashing the Power of 1-Bit LLMs with bitnet.cpp: Accelerating Inference and Efficiency

Followup to my GPUs as Processors Article

Poison cache? It's easy

领英推荐

Prabir Guha, MS Comp. Sc.的更多文章

DeepSeek-R1 - Emperor has no clothes?

Setting Up a Retrieval-Augmented Generation (RAG) System Locally with chromadb & ollama

Conversational AI, voice interactions with avatars in real-time: ML at the EDGE with Jetson AGX Orin

Eight essential skills for a Machine Learning Engineer

Eight Rules For Generating Good Datasets Needed To Train An Object Recognition Deep Learning Model

社区洞察

其他会员也浏览了

What Is CUDA? Understanding Its Origins, Mechanics, Evolution, and Importance for AI

Demystifying CPU vs. GPU: Understanding the Key Differences

NVIDIA’s CUDA Monopoly Will End Itself

DeepSeek’s GPU Revolution: The AI Hack That Redefined Computing

SANA: Ultra HD Fast Text to Image Model from NVIDIA Step by Step Tutorial on Windows, Cloud & Kaggle — Generate 2048x2048 Images

Unleashing the Power of 1-Bit LLMs with bitnet.cpp: Accelerating Inference and Efficiency

Followup to my GPUs as Processors Article

Poison cache? It's easy