登录查看更多内容

A Step-by-Step Guide to Running State-of-the-Art Open Source AI on Limited Compute Resources

Oumar Dia

Helping Ambitious Entrepreneurs Crush Overthinking & Distraction | Build a Focused, Profitable Online Business | Let’s Succeed Together

发布日期: 2024年4月8日

Artificial intelligence is rapidly advancing, with larger and more powerful language models emerging at a dizzying pace.

However, many of these cutting-edge models require immense computational resources, putting them out of reach for the average developer or researcher.

But what if you could harness the capabilities of a massive 47 billion parameter model - for free - using the limited resources of Google Colab?

That's exactly what we'll explore in this hands-on guide, walking through how to run the impressive open source Mixtrel model on Colab's free tier.

What is Mixtrel?

Mixtrel 8x7B is a state-of-the-art sparse mixture of experts (MoE) language model. Despite having 47 billion parameters, Mixtrel outperforms much larger models like the 70B parameter Llama 2 on most benchmarks while providing 6x faster inference.

This combination of high performance and efficiency makes it one of the most promising open source models available today.

The key to Mixtrel's capabilities is its MoE architecture.

The model's parameters are divided into 8 distinct expert groups.

For each input token, a gating function selects the 2 most relevant experts to process that token. The experts' outputs are then combined to produce the final result.

By activating only a fraction of the model for any given input, Mixtrel achieves excellent performance while remaining computationally lean.

diagram illustrating the MoE architecture

Offloading Experts for Low-Resource Environments To run such a large model on Colab's free tier, which provides a T4 GPU with 16GB VRAM, we'll leverage the techniques from the paper

"Fast Inference of Mixture of Experts Language Models with Offloading."

The key insight is that once the gating function selects the experts for a token, the inactive experts are no longer needed and can be offloaded from GPU memory to cheaper storage like system RAM or SSD.

This frees up VRAM for the active experts to perform inference.

Generative AI 3 个月前

TAI 112; Agent Capabilities Advancing; METR Eval and…

Towards AI 2 个月前

TAI #104; LLM progress beyond transformers with Samba?

Towards AI 4 个月前

The paper proposes a speculative expert loading algorithm to optimize this offloading process.

While the full details are beyond our scope, it enables Mixtrel to run efficiently on Colab by intelligently swapping experts between GPU and CPU memory as needed.

Setting Up the Colab Environment With the concepts in place, let's dive into running Mixtrel hands-on! We'll start by setting up our Colab environment:

Change the runtime type to "GPU" and make sure a T4 GPU is selected.
Install the required libraries and set key environment variables:

# Install necessary libraries
!pip install numpy transformers

# Set environment variables 
os.environ["PYTHONIOENCODING"] = "utf-8"
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Clone Mixtrel offloading repo
!git clone https://github.com/your_username/mixtrel_offloading.git
%cd mixtrel_offloading
!pip install -r requirements.txt

# Download Mixtrel model
!huggingface-cli repo download models/mixtrel-8x7b

This code snippet installs key dependencies like NumPy and Transformers, sets environment variables for GPU support, clones the Mixtrel offloading repo, and downloads the pre-trained model. It may take a few minutes to run.

Loading the Model Next, we'll import additional modules and load the quantized Mixtrel model:

import sys
sys.path.append('/content/mixtrel_offloading')

import torch
import torch.nn.functional as F
from src.hqq.core.quant import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from tqdm import tqdm 
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging
from src.build import OffloadConfig, QuantConfig, build_model

inputs = {
    "model_name": "Mixtrel",  
    "quant_model_name": "mixtrel-8x7b",
    "state_path": "models/mixtrel-8x7b"
}

config = AutoConfig.from_pretrained(inputs["quant_model_name"])
device = torch.device("cuda")

offload_config = OffloadConfig(n_offload=4)  
offload_config.set_memory_sizes(main_size="20G", offload_size="10G", buffer_size="6G")

attn_config = BaseQuantizeConfig(nbits=4, group_size=32, quant_zero=True, quant_scale=True)
ffn_config = BaseQuantizeConfig(nbits=2, group_size=32, quant_zero=True, quant_scale=True)
quant_config = QuantConfig(attn_cfg=attn_config, ffn_cfg=ffn_config)

model = build_model(
    name=inputs["model_name"],
    quant_config=quant_config, 
    offload_config=offload_config, 
    device=device,
    state_path=inputs["state_path"]
)

Here we configure key model settings like quantization (to compress the model and speed up inference), offloading (specifying how many experts to offload), and memory allocation.

Finally, we build the Mixtrel model with these configurations, preparing it for inference on the GPU.

Inference with Mixtrel Now the exciting part - let's generate some text with our mighty Mixtrel model!

from transformers import TextStreamer

tokenizer = AutoTokenizer.from_pretrained(inputs["quant_model_name"])
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True, timeout=10.)

while True:
    message = input("Enter your message: ")
    user_message = {"role": "user", "content": message}
    input_ids = tokenizer(user_message["content"], return_tensors="pt", max_length=1024, truncation=True).input_ids.to("cuda")

    if streamer.is_first:
        attention_mask = torch.ones_like(input_ids)  
    else:
        prev_len = streamer.num_tokens_generated
        attention_mask = torch.cat([torch.ones(input_ids.shape[1]), torch.zeros(prev_len)], dim=0).unsqueeze(0)

    gen_kwargs = {
        "max_new_tokens": 512,
        "do_sample": True,
        "top_k": 50, 
        "top_p": 0.9,
        "temperature": 1.0,
        "num_return_sequences": 1
    }

    with torch.no_grad():
        for output in model.generate(
            input_ids,
            streamer=streamer,
            attention_mask=attention_mask,
            past_key_values=streamer.past_key_values,
            **gen_kwargs
        ):  
            print(streamer.text, flush=True)

        streamer.update_with_sequence(input_ids.shape[1], model.past_key_values)

This code sets up a conversational loop with Mixtrel. For each user input, it tokenizes the message, generates an attention mask based on the conversation state, and calls the model's generate function with various inference settings.

The model's output is streamed back token-by-token, allowing for an interactive chat-like experience.

Conclusion In this hands-on guide, we've seen how to run the powerful 47B parameter Mixtrel language model on Google Colab's free tier. By leveraging expert offloading and other optimizations, Mixtrel achieves state-of-the-art performance while remaining computationally efficient.

As an open source model, Mixtrel opens up exciting possibilities for building advanced language AI applications - chatbots, writing assistants, code generators, and more - without the need for expensive compute resources. We encourage you to experiment further with Mixtrel and share your creations!

A Step-by-Step Guide to Running State-of-the-Art Open Source AI on Limited Compute Resources

Oumar Dia

Helping Ambitious Entrepreneurs Crush Overthinking & Distraction | Build a Focused, Profitable Online Business | Let’s Succeed Together

What is Mixtrel?

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Feature Store Architecture, the Year of Large Language Models, and the Top Virtual ODSC West 2023 Sessions to Watch

RAG to Riches

Issue #289 - The ML Engineer ??

Artificial Intelligence in Exascale Computing: Revolutionizing High-Performance Computing for the Future

Databricks just dropped an LLM bomb – DBRX ??

Run AI Code Generation on Your Own GPU! - Introducing DeciCoder ??

Beyond Data and Model Parallelism: Sequence Parallelism with Scatter and Gather Patterns

Meet CTW’s VP of Artificial Intelligence, Neil Yuki

Navigating the Evolution and Future of Machine Learning Infrastructure

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use

What is Mixtrel?

领英推荐

Learn ANY language faster with these ChatGPT prompts

2024年5月2日

How To Use ChatGPT To Learn ANY Skill Quickly

2024年4月30日

2 GENIUS Ways To Make Money with?ChatGPT

2024年4月29日

I tried TikTok shop for 30 days

2024年4月26日

I think I’ve found the ‘secret’ way to make easy money using AI

2024年4月23日

4 Python Things I Should Have Known Earlier But Somehow Didn’t

2024年4月17日

∞ Reasons Why You Should Learn Python (2024)

2024年4月14日

You Won’t Believe What These 11 ChatGPT Plugins Can Do!

2024年4月6日

How I’m Using Perplexity to Boost My Content Creation and Community Building

2024年4月5日

How to write and improve your Resume with the help of ChatGPT

2024年4月4日

社区洞察

其他会员也浏览了

Feature Store Architecture, the Year of Large Language Models, and the Top Virtual ODSC West 2023 Sessions to Watch

RAG to Riches

Issue #289 - The ML Engineer ??

Artificial Intelligence in Exascale Computing: Revolutionizing High-Performance Computing for the Future

Databricks just dropped an LLM bomb – DBRX ??

Run AI Code Generation on Your Own GPU! - Introducing DeciCoder ??

Beyond Data and Model Parallelism: Sequence Parallelism with Scatter and Gather Patterns

Meet CTW’s VP of Artificial Intelligence, Neil Yuki

Navigating the Evolution and Future of Machine Learning Infrastructure

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use