A Step-by-Step Guide to Running State-of-the-Art Open Source AI on Limited Compute Resources

A Step-by-Step Guide to Running State-of-the-Art Open Source AI on Limited Compute Resources

Artificial intelligence is rapidly advancing, with larger and more powerful language models emerging at a dizzying pace.

However, many of these cutting-edge models require immense computational resources, putting them out of reach for the average developer or researcher.

But what if you could harness the capabilities of a massive 47 billion parameter model - for free - using the limited resources of Google Colab?

That's exactly what we'll explore in this hands-on guide, walking through how to run the impressive open source Mixtrel model on Colab's free tier.

What is Mixtrel?

Mixtrel 8x7B is a state-of-the-art sparse mixture of experts (MoE) language model. Despite having 47 billion parameters, Mixtrel outperforms much larger models like the 70B parameter Llama 2 on most benchmarks while providing 6x faster inference.

This combination of high performance and efficiency makes it one of the most promising open source models available today.

The key to Mixtrel's capabilities is its MoE architecture.

The model's parameters are divided into 8 distinct expert groups.

For each input token, a gating function selects the 2 most relevant experts to process that token. The experts' outputs are then combined to produce the final result.

By activating only a fraction of the model for any given input, Mixtrel achieves excellent performance while remaining computationally lean.

diagram illustrating the MoE architecture

Offloading Experts for Low-Resource Environments To run such a large model on Colab's free tier, which provides a T4 GPU with 16GB VRAM, we'll leverage the techniques from the paper

"Fast Inference of Mixture of Experts Language Models with Offloading."

The key insight is that once the gating function selects the experts for a token, the inactive experts are no longer needed and can be offloaded from GPU memory to cheaper storage like system RAM or SSD.

This frees up VRAM for the active experts to perform inference.

The paper proposes a speculative expert loading algorithm to optimize this offloading process.

While the full details are beyond our scope, it enables Mixtrel to run efficiently on Colab by intelligently swapping experts between GPU and CPU memory as needed.

Setting Up the Colab Environment With the concepts in place, let's dive into running Mixtrel hands-on! We'll start by setting up our Colab environment:

  1. Change the runtime type to "GPU" and make sure a T4 GPU is selected.
  2. Install the required libraries and set key environment variables:

# Install necessary libraries
!pip install numpy transformers

# Set environment variables 
os.environ["PYTHONIOENCODING"] = "utf-8"
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Clone Mixtrel offloading repo
!git clone https://github.com/your_username/mixtrel_offloading.git
%cd mixtrel_offloading
!pip install -r requirements.txt

# Download Mixtrel model
!huggingface-cli repo download models/mixtrel-8x7b        

This code snippet installs key dependencies like NumPy and Transformers, sets environment variables for GPU support, clones the Mixtrel offloading repo, and downloads the pre-trained model. It may take a few minutes to run.

Loading the Model Next, we'll import additional modules and load the quantized Mixtrel model:

import sys
sys.path.append('/content/mixtrel_offloading')

import torch
import torch.nn.functional as F
from src.hqq.core.quant import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from tqdm import tqdm 
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging
from src.build import OffloadConfig, QuantConfig, build_model

inputs = {
    "model_name": "Mixtrel",  
    "quant_model_name": "mixtrel-8x7b",
    "state_path": "models/mixtrel-8x7b"
}

config = AutoConfig.from_pretrained(inputs["quant_model_name"])
device = torch.device("cuda")

offload_config = OffloadConfig(n_offload=4)  
offload_config.set_memory_sizes(main_size="20G", offload_size="10G", buffer_size="6G")

attn_config = BaseQuantizeConfig(nbits=4, group_size=32, quant_zero=True, quant_scale=True)
ffn_config = BaseQuantizeConfig(nbits=2, group_size=32, quant_zero=True, quant_scale=True)
quant_config = QuantConfig(attn_cfg=attn_config, ffn_cfg=ffn_config)

model = build_model(
    name=inputs["model_name"],
    quant_config=quant_config, 
    offload_config=offload_config, 
    device=device,
    state_path=inputs["state_path"]
)        

Here we configure key model settings like quantization (to compress the model and speed up inference), offloading (specifying how many experts to offload), and memory allocation.

Finally, we build the Mixtrel model with these configurations, preparing it for inference on the GPU.

Inference with Mixtrel Now the exciting part - let's generate some text with our mighty Mixtrel model!

from transformers import TextStreamer

tokenizer = AutoTokenizer.from_pretrained(inputs["quant_model_name"])
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True, timeout=10.)

while True:
    message = input("Enter your message: ")
    user_message = {"role": "user", "content": message}
    input_ids = tokenizer(user_message["content"], return_tensors="pt", max_length=1024, truncation=True).input_ids.to("cuda")

    if streamer.is_first:
        attention_mask = torch.ones_like(input_ids)  
    else:
        prev_len = streamer.num_tokens_generated
        attention_mask = torch.cat([torch.ones(input_ids.shape[1]), torch.zeros(prev_len)], dim=0).unsqueeze(0)

    gen_kwargs = {
        "max_new_tokens": 512,
        "do_sample": True,
        "top_k": 50, 
        "top_p": 0.9,
        "temperature": 1.0,
        "num_return_sequences": 1
    }

    with torch.no_grad():
        for output in model.generate(
            input_ids,
            streamer=streamer,
            attention_mask=attention_mask,
            past_key_values=streamer.past_key_values,
            **gen_kwargs
        ):  
            print(streamer.text, flush=True)

        streamer.update_with_sequence(input_ids.shape[1], model.past_key_values)        

This code sets up a conversational loop with Mixtrel. For each user input, it tokenizes the message, generates an attention mask based on the conversation state, and calls the model's generate function with various inference settings.

The model's output is streamed back token-by-token, allowing for an interactive chat-like experience.

Conclusion In this hands-on guide, we've seen how to run the powerful 47B parameter Mixtrel language model on Google Colab's free tier. By leveraging expert offloading and other optimizations, Mixtrel achieves state-of-the-art performance while remaining computationally efficient.

As an open source model, Mixtrel opens up exciting possibilities for building advanced language AI applications - chatbots, writing assistants, code generators, and more - without the need for expensive compute resources. We encourage you to experiment further with Mixtrel and share your creations!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了