登录查看更多内容

Beginner’s Guide to Running Mistral 7B Locally on a Single GPU

Ibad Rehman

AI & Machine Learning Engineer

发布日期: 2024年7月27日

Mistral 7B is a state-of-the-art large language model developed by Mistral AI. It is designed to perform a wide range of natural language processing (NLP) tasks with high accuracy and efficiency. The model is built with 7 billion parameters, making it one of the largest and most powerful language models available.

In this blog post, we will explore the key features of Mistral 7B and provide a step-by-step guide to running the model using the Hugging Face library.

If you wish to run Llama 3 models locally then check my other guide Beginner's Guide to Running ?? LLama 3 Locally On a Single GPU

Note: Mistral 7 are gated repositories inside Hugging Face which means in order to access them you either need to login with your Huggingface account or generate a token within your Huggingface account and use it to download the repo.

Create a free account on Huggingface, search for mistralai/Mistral-7B-Instruct-v0.1, fill out the form and wait for the approval. Once approved, you'll receive an email and then navigate to Settings inside your Huggingface account go to Access Tokens and create a new token.

Make sure to add the Mistral 7B repository inside the Repositories permission if you have created a token of the type FINEGRAINED.

Key Features of Mistral 7B:

1. Large Scale: Mistral 7B is built with 7 billion parameters, making it one of the largest language models available.

2. High Performance: The model is designed to deliver high performance in various NLP tasks, thanks to its extensive training on diverse datasets.

3. Versatility: Mistral 7B can be fine-tuned for specific tasks, making it a versatile tool for developers and researchers.

4. Accessibility: The model is available through the Hugging Face library, making it easy to integrate into applications.

Step-by-Step Guide to Running Mistral 7B using Hugging Face:

Create a new directory for your project and access it via the terminal.

mkdir mistral_project
cd mistral_project

Inside this new directory, create a new Python virtual environment by running the following command in your terminal.

python3 -m venv venv_name

Activate the environment by running the command:

source venv_name/bin/activate

If you're on Windows, then run the following commands:

python -m venv venv_name

Then activate the virtual environment.

venv_name\Scripts\activate

1. Install the Required Libraries:

Ensure you have Python installed on your computer. Then, install the Hugging Face Transformers library and PyTorch.

pip install transformers torch

We will be using a quantized version of the model for better inference performance, let's also install the bitsandbytes package.

领英推荐

Tensor<T> in .NET9

David Shergilashvili 1 个月前

Mastering Long Document Insights: Advanced…

Gary Stafford 1 年前

Demystifying Tokenization: Preparing Data for Large…

Rany ElHousieny, PhD??? 12 个月前

pip install bitsandbytes

2. Load the Mistral 7B Model:

Inside your project directory, create a new Python file mistral-7b.py .

Import all dependencies.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login
import bitsandbytes as bnb

To ensure GPU access, add the following lines of code.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Now, login using your Hugging Face access token.

login('YOUR_ACCESS_TOKEN')

Use the Hugging Face library to load the Mistral 7B model and tokenizer.

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    load_in_8bit=True, 
    device_map="auto"
)

The above block of code will download the Mistral 7B models and the tokenizer from Huggingface. This can take a few minutes to run. After the model and tokenizer are downloaded, you can begin using the model. However, It's important that we save them to continue using it without downloading it every time.

3. Save the Mistral 7B Model and Tokenizer for Inference:

# Save the quantized model and tokenizer locally
model.save_pretrained("./quantized_model")
tokenizer.save_pretrained("./quantized_model")

After the model and tokenizer are saved, load the model from the local directory using the command below. At this point, you may replace the initial block of code from step #2 that downloaded the model and tokenizer from Hugging Face.

# Load the saved quantized model and tokenizer
model_dir = "./quantized_model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_dir, 
    load_in_8bit=True, 
    device_map="auto"
)

4. Run Locally Saved Mistral 7B For Text Generation

Finally, add the rest of the code for using the locally saved model for generating text using Mistral 7B models.

def generate_text(prompt):
   inputs = tokenizer(prompt, return_tensors="pt")
   outputs = model.generate(inputs["input_ids"], max_length=200)
   return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = """
Once upon a time there was a
"""

generated_text = generate_text(prompt)
print("Generated Text:", generated_text)

Full code for mistral-7b.py file.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login
import bitsandbytes as bnb

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

login('YOUR_ACCESS_TOKEN')

#Download Mistral7B from Hugging Face
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    load_in_8bit=True, 
    device_map="auto"
)

# Save the quantized model and tokenizer locally
model.save_pretrained("./quantized_model")
tokenizer.save_pretrained("./quantized_model")

# Load the saved quantized model and tokenizer
model_dir = "./quantized_model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_dir, 
    load_in_8bit=True, 
    device_map="auto"
)

def generate_text(prompt):
   inputs = tokenizer(prompt, return_tensors="pt")
   outputs = model.generate(inputs["input_ids"], max_length=200)
   return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = """
Once upon a time there was a
"""

generated_text = generate_text(prompt)
print("Generated Text:", generated_text)

The above code was tested on a machine with 32GB RAM and 16GB GPU. The token generation was not very fast but doable.

The Mistral 7B large language model is a powerful tool for natural language processing tasks. By following the steps outlined in this blog post, you can easily run the model on your computer using the Hugging Face library. Whether you are a developer, researcher, or enthusiast, Mistral 7B offers a versatile and accessible solution for your NLP needs.

The Myth Behind AI Systems

340 位关注者

Amar Doshi

IT Advisor, Consultant, Software Designer & Developer

1 个月

The variable 'device' is assigned a value but never used.

1 次回应

要查看或添加评论，请登录

查看全部

Beginner’s Guide to Running Mistral 7B Locally on a Single GPU

Ibad Rehman

AI & Machine Learning Engineer

Step-by-Step Guide to Running Mistral 7B using Hugging Face:

领英推荐

The Myth Behind AI Systems

340 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Applied Machine Learning: Naive Bayes, Linear SVM, Logistic Regression, and Random Forest

Understanding Transformers: A Deep Dive with PyTorch

Data Preparation for Fine-Tuning LLMs (Large Language Models) using Google Colab

Phi-2: A Small Language Model That Packs a Big Punch

Fine-Tuning Made Easy: The Game-Changing Benefits of LoRA for Language Models

What is Retrieval-Augmented Generation (RAG) and How to Secure RAG Solutions: A Technical Deep Dive

Evolution of Word Embeddings: A Journey Through NLP History

The Technical Architecture of RAG Models

Step-by-Step Guide to Running Mistral 7B using Hugging Face:

领英推荐

The Myth Behind AI Systems

340 位关注者

Run The Latest ?? Llama 3.2 Vision Locally On a Single GPU

2024年9月28日

Unveiling OpenAI's o1 Preview Model: A Leap Forward in AI Reasoning

2024年9月17日

Decoding the Future of Vulnerability Detection: Can LLMs Outperform Traditional Tools?

2024年9月13日

The Most Basic Guide to Understanding Transformers - The Backbone of LLMs

2024年6月20日

Reality or Simulation? Simulation Argument by Nick Bostrom - Explained!

2024年6月6日

Everything You Need to Know About Embeddings: The Backbone of LLMs

2024年6月3日

Beginner's Guide to Running ?? LLama 3 Locally On a Single GPU

2024年5月21日

The Internet is About to Disappear — Partially ??

2024年5月14日

The Capabilities of Large Language Models in Executing/Preventing Cyber Attacks ??

2024年5月8日

The Future of Creativity: Navigating the Generative AI Revolution ??

2024年5月5日

社区洞察

其他会员也浏览了

Applied Machine Learning: Naive Bayes, Linear SVM, Logistic Regression, and Random Forest

Understanding Transformers: A Deep Dive with PyTorch

Data Preparation for Fine-Tuning LLMs (Large Language Models) using Google Colab

Phi-2: A Small Language Model That Packs a Big Punch

Fine-Tuning Made Easy: The Game-Changing Benefits of LoRA for Language Models

What is Retrieval-Augmented Generation (RAG) and How to Secure RAG Solutions: A Technical Deep Dive

Evolution of Word Embeddings: A Journey Through NLP History

The Technical Architecture of RAG Models