Harnessing the Power of Distributed Language Models: An Excursion into LLaMA, BLOOM, and Petals
Imagine a world where the power of gargantuan language models like LLaMA-65B and BLOOM-176B can be harnessed straight from your humble laptop or desktop. Through the mystical world of the distributed computing revolution, we are opening a portal to such a world — an era of endless possibilities.
Debuting into the spotlight, Petals is a pioneering platform that collaboratively operates these behemoth language models. With a small piece of the model dwelling in your device, you join forces with individuals scattered across the globe, each wielding their share of the language model. Together, you breathe life into an intricate web of linguistic understanding.
For the curious minds, Python libraries such as 'transformers' and 'petals' serve as your magic wand. A simple spell involving a model name — "enoch/llama-65b-hf" for instance, or perhaps "bigscience/bloom" or "bigscience/bloomz" — sets the stage for an ensuing marvel. Unleash the raw power of the AutoTokenizer and the AutoDistributedModelForCausalLM, and voila! You can generate text and fine-tune these language models for your personalized tasks.
To ignite this magic, consider this enticing incantation:
from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM
model_name = "enoch/llama-65b-hf"
tokenizer = AutoTokenizer(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
In this intricate dance of code, the text input — "A cat sat" — stirs into life and transforms, much like the story that follows the phrase.
The Petals platform exemplifies an enchanting blend of flexibility and comfort, akin to the agile python it's born from. With single-batch inference that clocks at 3-4 steps/sec for LLaMA-65B and about 1 step/sec for BLOOM-176B, it outperforms offloading by up to 10 times. This impressive speed is sufficient for chatbots and other interactive apps and offers an avenue for personalized fine-tuning, sampling methods, and custom paths.
领英推荐
However, any magic requires a little groundwork. An Anaconda environment must be conjured to prepare your device to host a part of LLaMA-65B or its companions. This enchantment requires Linux and Python 3.7+ and is invoked with a few commands:
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
If you'd prefer an alternative, don't worry: Docker images are available. They operate smoothly on Linux, macOS, and Windows with WSL2.
The narrative doesn't end here; rather, it unfurls into a saga of tutorials, examples, and abundant resources. For novices and wizards alike, guides aid in prompt-tuning LLaMA-65B for text semantic classification or to breathe life into a personified chatbot with BLOOM.
Remember, in Petals, you aren't just an observer but an active participant. Your data joins the swarm of public information, contributing to a cooperative force propelling these models into action. However, if privacy is your fortress, you can conjure a private swarm within trusted confines.
This is a universe where you hold the power of titanic language models, yet you're not alone. You're part of a grand orchestra that collaboratively brings these models to life, igniting a new era in the language processing world — an era of distributed language models.