Run DeepSeek AI Assistant on Your Local Machine

Run DeepSeek AI Assistant on Your Local Machine

Introduction

DeepSeek is an advanced AI model tailored for natural language processing, featuring robust functionalities like text generation, summarization, and reasoning. Its ability to run locally makes it a fantastic option for users seeking privacy, control, and offline access to AI.

In this article, I will provide a quick overview of how to set up a local chatbot that can perform web searches using the open-source DeepSeek R1 model.

Writing this was quite challenging due to its technical nature. I grappled with it, especially since the newsletter aims to present tech concepts in a way that's easily understandable for the audience. Nevertheless, I went ahead because it would benefit those of you looking to embark on your AI journey.


Why Running Locally? Can’t I Use the DeepSeek Web Version?

Of course, you can!

When DeepSeek AI Assistant gets more and more popular, the frequency that I see this has increased:

I can imagine how busy it was.
I can imagine how busy it was

Besides, people might have privacy concerns about using these companies’ AI assistant services.

DeepSeek sends all the data it collects on Americans to servers in China, according to the company’s terms of service.

You would not have this concern if this is only running on your computer.

Lastly, this would also be a great start to becoming familiar with some development tools if you want to learn AI.


Getting to work

DeepSeek excels in its flexibility; while it operates on CPU-only systems, utilizing a dedicated GPU greatly enhances performance. On CPU, response times can lag, and larger models might need considerable RAM. DeepSeek significantly accelerates response generation through parallel processing by using a GPU, facilitating smoother real-time interactions.

This guide will walk you through the installation and setup of DeepSeek, ensuring you can get started with AI on your machine, whether you have a high-end GPU or not.

Before we begin, ensure your system meets the minimum requirements. While DeepSeek can run on a CPU-only machine, having a high-performance processor and sufficient RAM will improve execution speed. If a compatible GPU is installed, Ollama will automatically detect and utilize it for accelerated processing.

Software Requirements
Software Requirements

Step 1

  • Install Ollama

Ollama is a framework for running and managing large language models (LLMs) on local computing resources. It enables the loading and deployment of selected LLMs and provides access to them through an API.

Ollama is required to run DeepSeek models. It provides an optimized local runtime for running machine learning models efficiently.

Download and run the official Ollama installation script:

$ curl -fsSL https://ollama.com/install.sh | sh        

After installation, verify that Ollama is installed correctly by checking its version:

$ ollama --version        

Additionally, ensure that the Ollama service is running with:

$ systemctl is-active ollama.service        

  • Download DeepSeek-R1

Now, fetch the model you want to run. DeepSeek-R1 models vary in size, balancing speed, and accuracy based on your hardware capabilities. Larger models provide better reasoning and accuracy but require more RAM, VRAM, and disk space. To install the 7B model as an example, run:

$ ollama pull deepseek-r1:14b
pulling manifest
pulling 6e9f90f02bb3... 100% ▕██████████████████████████████████▏ 9.0 GB
verifying sha256 digest
writing manifest
success        

Other smaller distilled models are available for download on DeepSeek R1. However, I had a bad experience using models smaller than 14 billion parameters.

After you run the above command, you would expect to see an interactive terminal for you to enter your prompt:

Available DeepSeek-R1 Models, Hardware Requirements and Recommendations
Available DeepSeek-R1 Models, Hardware Requirements and Recommendations
Choosing the Right Model:

  • 1.5B – 7B models: Best for everyday tasks, chat applications, and lightweight inference.
  • 8B – 14B models: Balanced models offer improved reasoning while staying relatively efficient.
  • 32B – 70B models: Highly advanced, suitable for research and deep analysis, but require substantial resources.
  • 671B model: Requires data-center-level hardware. Used for cutting-edge AI research.

Even with 512+ GB RAM and multiple GPUs with 100+ GB VRAM, the DeepSeek-R1:671B model remains slow due to its massive 671 billion parameters, requiring an immense number of calculations per response. 
While multiple GPUs improve overall throughput, they don’t significantly reduce latency for a single request, as data movement, memory bandwidth, and computational limits create bottlenecks. 
Even high-end AI infrastructure struggles with this scale, making smaller models (7B–14B) far more practical for real-time applications. 
The 671B model is best suited for research and large-scale AI experiments, where precision outweighs speed.        

  • Begin Using DeepSeek

Once the model is downloaded, you can start interacting with it directly. To run the DeepSeek-R1 model, use:

$ ollama run deepseek-r1:14b
>>> Hey DeepSeek, how are you doing today!

<think>
Hi! I'm just a virtual assistant, so I don't have feelings, but thanks for asking! How can I help you today?
</think>        

  • Use a Local API for Integration

If you need to interact with DeepSeek programmatically, enable the API.

$ ollama serve & curl https://localhost:11434/api/generate -d '{"model": "deepseek-r1:14b", "prompt": "Hello, how are you?"}'        

Step 2

  • Running Your Search Engine

SearXNG is an open-source search engine that aggregates search results from various engines without storing or tracking user data.

Save the file as “docker-compose.yml”.

version: '3.8'
services:
  searxng:
    image: docker.io/searxng/searxng:latest
    container_name: searxng
    ports:
      - "4000:8080"
    volumes:
      - ./searxng:/etc/searxng
    restart: unless-stopped        

We can spin up the search engine by typing the following in the terminal at the same directory where you saved the “docker-compose.yml” file.

docker-compose up -d        

It will download the SearXNG image, and once you see the following, the search engine is running on your machine.

Open up your browser, and you will be able to see a search interface like this at https://localhost:4000/search :

SearXNG user interface.
SearXNG user interface


  • Install Gradio

Gradio is an open-source Python library that makes creating interactive web-based interfaces for your Python scripts and machine-learning models easy without needing any front-end development skills.

Let’s now install an essential component for our user interface:

pip install gradio        

  • Writing the app to make function calls

In the last step, let’s create a simple python script that combines all our resources and makes function calls:

import gradio as gr
import requests
import ollama

def search_web(query: str) -> list:
    SEARXNG_URL = "https://localhost:4000/search"
    params = {'q': query, 'format': 'json'}
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    }

    response = requests.get(SEARXNG_URL, params=params, headers=headers)
    if response.status_code != 200:
        print("Response status code:", response.status_code)
        print("Response text:", response.text)
        raise Exception(f"Search query failed with status code {response.status_code}")
    return response.json().get("results", [])

def chat_with_search(query: str, use_web_search: bool):
    # Optionally integrate web search based on user toggle
    if use_web_search:
        results = search_web(query)
        context_str = format_search_results(results, max_results=5)
    else:
        context_str = "No additional context provided."
    return generate_augmented_response(query, context_str)

def format_search_results(results: list, max_results: int = 5) -> str:
    """
    Format the top search results into a context string.
    """
    formatted = []
    for result in results[:max_results]:
        title = result.get("title", "No title")
        url = result.get("url", "No URL")
        snippet = result.get("content", "No snippet")
        formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
    return "\n\n".join(formatted)

def generate_augmented_response(query: str, context: str) -> str:
    """
    Combine the user's query with the search context and send it to DeepSeek R1 via Ollama.
    """
    # Create a composite prompt
    composite_prompt = f"""
{context}
Please use the following web search results to provide the detailed summary of the request above.
{query}
Answer:"""
    response = ollama.chat(
        model="deepseek-r1:14b",
        messages=[
            {"role": "user", "content": composite_prompt}
        ]
    )
    return response["message"]["content"]

iface = gr.Interface(
    fn=chat_with_search,
    inputs=[
        gr.Textbox(label="Query"),
        gr.Checkbox(label="Enable Web Search", value=True)
    ],
    outputs="text",
    title="Houshang - Deepseek-r1:14b AI Model",
    description="Ask questions and get answers augmented with real-time web search results."
)

iface.launch(share=False, debug=False, server_name="0.0.0.0")        

Let’s run it in the terminal:

python app.py        

You are expected to see this:

* Running on local URL:  https://127.0.0.1:7860        

You may now see the app on https://127.0.0.1:7860 by your browser:

Houshang!
The DeepSeek-r1:14B AI Assistant
Houshang - The DeepSeek-r1:14B AI Assistant

Also, you can use the app's API to interact with it:

Gradio

Congrats! You now have a local chatbot and do not have to rely so much on DeepSeek applications.

You can try typing in some prompts and interacting with the chatbot to see how it performs. Gradio provided a “Flag” button, a simple UI functionality that allows you to save the response and chat history in the project working directory as a CSV file.


Conclusion

This was just a simple demo of setting up a functional local chatbot. Many optimizations could be made, such as improving the response formatting, using a better-distilled model (more parameters), improving the UI, providing feedback to the model, and preserving context history. You might also want to refactor and organize the code into a proper software project when you become more seasoned in programming.

saber faramarzi

Cybersecurity Ph.D. candidate | Solution architect | Business analyst | Project Manager | Software Designer and Developer at Securities and Exchange Organization of Iran (SEO)

1 个月

???? ?????? . ???? ????? ???????

要查看或添加评论,请登录

Amir Azarmi var的更多文章

社区洞察

其他会员也浏览了