Fine-Tuning LLaMA 2 with Amazon SageMaker JumpStart

Fine-Tuning LLaMA 2 with Amazon SageMaker JumpStart

In the rapidly advancing realm of machine learning, fine-tuning pre-trained models has become a linchpin for customized AI solutions. At the forefront of this innovation is the Llama 2 model, a powerhouse in language comprehension and generation. This article delves into the intricacies of fine-tuning Llama 2 utilizing Amazon SageMaker JumpStart, a pioneering service designed to streamline the deployment and enhancement of machine learning models. We will explore how SageMaker JumpStart not only simplifies the process of adapting Llama 2 to specific domains but also amplifies its capabilities, thereby unlocking new potentials for AI applications across various industries.

Note: This article is part of the following article:

Fine-tuning Llama 2 on Amazon SageMaker JumpStart involves a nuanced process that leverages the advanced capabilities of SageMaker and the Llama 2 language models developed by Meta. This article delves into a more detailed exploration of this process.

  1. Llama 2 Models: Llama 2 is a collection of large language models (LLMs) developed by Meta, with a parameter range from 7 billion to 70 billion. These models are designed for generative text tasks such as chat applications and are optimized for dialogue use cases.
  2. Amazon SageMaker JumpStart: This platform is a hub for machine learning solutions, offering access to algorithms, models, and tools for ML development. It supports the deployment and fine-tuning of various foundation models, including Llama 2, providing an integrated environment for ML tasks.

Fine-Tuning Process

  1. Methods of Fine-Tuning: Llama 2 models can be fine-tuned using the SageMaker Studio UI or the SageMaker Python SDK. The UI method allows for a no-code approach, where users can configure hyperparameters and deployment settings, while the SDK method offers more flexibility and control for developers.
  2. Dataset Formats for Fine-Tuning: SageMaker JumpStart supports both domain adaptation and instruction tuning formats for datasets. Domain adaptation focuses on specific domain data, while instruction tuning improves performance for unseen tasks through zero-shot prompts.
  3. Optimizations for Large Models: Techniques like Low-Rank Adaptation (LoRA), Int8 quantization, and Fully Sharded Data Parallel (FSDP) are employed to manage the large size of these models, addressing memory requirements and optimizing training time.
  4. Hyperparameters and Instance Types: A range of hyperparameters like epoch, learning rate, batch size, and quantization options are available, influencing training speed and model performance. Different instance types with various configurations are provided to suit different model sizes and training needs.
  5. Handling Large Model Challenges: Issues such as output compression and SageMaker Studio kernel timeout are addressed with specific solutions like disabling output compression and using training job names for deployment in case of timeouts.


Training LLamas in Action:

SageMaker's Notebook instances

We will be using SageMaker's Notebook instances.

From Notebook instances, create a new Notebook instance.

When it becomes in-service, click "Open JupyterLab"

If you have a previous Notebook that you have stopped, you may need to start it.

Install Sagemaker libraries and Hugging Face datasets

!pip install --upgrade sagemaker datasets boto3        


Load Dataset

access_token = "hf_XXXXXXXX" # update the access_token 

model_id = "meta-llama/Llama-2-7b-hf"

dataset_name = "tatsu-lab/alpaca"        


from datasets import load_dataset
from transformers import AutoTokenizer 

from huggingface_hub.hf_api import HfFolder;
HfFolder.save_token(access_token)

# Load Tokenizer 

tokenizer = AutoTokenizer.from_pretrained(model_id,token=access_token)

# Load dataset from huggingface.co
dataset = load_dataset(dataset_name)

        

This code snippet involves loading a dataset and a tokenizer for natural language processing tasks using Hugging Face's libraries:

  1. Import Libraries:from datasets import load_dataset: Imports the load_dataset function from the datasets library, which is used to load datasets from the Hugging Face datasets hub.from transformers import AutoTokenizer: Imports AutoTokenizer from the transformers library, a utility for automatically selecting the appropriate tokenizer for a given model.
  2. Authentication Token:from huggingface_hub.hf_api import HfFolder; HfFolder.save_token(access_token): This imports HfFolder for authentication purposes and saves an access token for Hugging Face's hub. This is typically used for private datasets or models.
  3. Load Tokenizer:tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token): Loads a tokenizer that is appropriate for the model specified by model_id. The token parameter is used for authentication if accessing private models.
  4. Load Dataset:dataset = load_dataset(dataset_name): Loads a dataset from Hugging Face's datasets hub using the specified dataset_name.
  5. Downsampling Dataset:dataset = dataset.shuffle(42): This line shuffles the dataset with a seed of 42. To downsample, you'd typically use slicing like dataset = dataset.shuffle(seed=42).select(range(10000)).

This code is used in the context of preparing data and models for training or inference in natural language processing tasks, leveraging Hugging Face's extensive library of datasets and tokenizers.

Viewing Dataset Structure:

  • Explore the structure and types of data in the dataset:

Split into "validation". and "train"

if "validation" not in dataset.keys():
    dataset["validation"] = load_dataset(
        dataset_name,
        split="train[:5%]"
    )

    dataset["train"] = load_dataset(
        dataset_name,
        split="train[5%:]"
    )        

This code checks if a validation split is present in the dataset and, if not, creates one:

  1. Condition Check: if "validation" not in dataset.keys(): - This line checks if the key 'validation' is not in the dataset. If the dataset doesn't have a validation split, the code inside the if block is executed.
  2. Creating Validation Split:dataset["validation"] = load_dataset(dataset_name, split="train[:5%]") - This line loads the first 5% of the training data as the validation set. It assigns this subset to dataset["validation"].
  3. Updating Training Split:dataset["train"] = load_dataset(dataset_name, split="train[5%:]") - This line reloads the training data, excluding the first 5% that was used for validation. This updated training set excludes the data now in the validation set.

Essentially, this code ensures that there's a separate validation set by splitting the training data if a dedicated validation set is not originally present in the dataset.


Tokenization

The last step of the data preparation is to tokenize and chunk our dataset. We convert our inputs (text) to token IDs by tokenizing, which the model can understand. Additionally, we concatenate our dataset samples into chunks of 2048 to avoid unnecessary padding.

from itertools import chain
from functools import partial


def group_texts(examples,block_size = 2048):
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

column_names = dataset["train"].column_names

lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"],return_token_type_ids=False), batched=True, remove_columns=list(column_names)
).map(
    partial(group_texts, block_size=2048),
    batched=True,
)        

This code defines a function group_texts and applies it to a dataset using the map function from the datasets library:

  1. group_texts function:Concatenates texts from the input examples. Truncates the concatenated text to a multiple of block_size (default 2048 tokens). Splits the text into chunks of block_size. Creates a copy of input_ids as labels for a language modeling task.
  2. column_names:Retrieves the column names from the training dataset.
  3. lm_dataset creation:First map call tokenizes the text data using the tokenizer's __call__ function, removes the original columns, and returns tokenized outputs. The second map call applies the group_texts function to the tokenized data, preparing it for language modeling by creating chunks of sequences.

The result is a dataset prepared (tokenized) for language modeling, with sequences of block_size tokens suitable for training with the LLaMA.

3. Fine-tune Llama V2 model using Transformers + LORA using local GPU

We will use the 4 GPU's available in this notebook instance to launch a distributed training job using torch distributed(torchrun).

We will start by saving the tokenized data locally .

! torchrun --nnodes 1 \
        --nproc_per_node 4 \
        --master_addr localhost \
        --master_port 7777 \
        scripts/run_clm_lora.py \
        --bf16 True \
        --dataset_path processed/data \
        --output_dir model \
        --epochs 3 \
        --gradient_checkpointing True \
        --model_id {model_id} \
        --optimizer adamw_torch \
        --per_device_train_batch_size 1 \
        --access_token {access_token} \
        --max_steps 100         

This command is used to run a distributed training job using PyTorch's torchrun (previously known as torch.distributed.launch), which is a helper command to facilitate distributed training:

  • --nnodes 1: Number of nodes to use for training (1 node here).
  • --nproc_per_node 4: Number of processes per node (4 GPUs/CPU processes here).
  • --master_addr localhost: Address for the master node's process to setup distributed training.
  • --master_port 7777: Port for the master node's process.
  • scripts/run_clm_lora.py: The script to run, which is a causal language modeling script with Low-Rank Adaptation (LoRA).
  • --bf16 True: Use bfloat16 mixed precision training if available.
  • --dataset_path processed/data: Path to the processed dataset.
  • --output_dir model: Directory where the model checkpoints will be saved.
  • --epochs 3: Number of training epochs.
  • --gradient_checkpointing True: Enable gradient checkpointing to save memory.
  • --model_id {model_id}: ID of the model to be trained.
  • --optimizer adamw_torch: Optimizer to use for training, AdamW here.
  • --per_device_train_batch_size 1: Batch size per device.
  • --access_token {access_token}: Access token for authentication (if needed, e.g., with Hugging Face Hub).
  • --max_steps 100: Maximum number of training steps to perform.


=======

#This converts the peft model back into a full model used for inference

! python scripts/merge_peft_adapters.py --base_model_name_or_path meta-llama/Llama-2-7b-hf \
                                        --peft_model_path model/final_checkpoint         

The command runs the merge_peft_adapters.py script with specified arguments to convert a Parameter-Efficient Fine-Tuning (PEFT) model back into a full causal language model suitable for inference:

  • --base_model_name_or_path facebook/opt-13b: Specifies the base model (in this case, the 13 billion parameter model from Facebook's OPT series).
  • --peft_model_path
  • /final_checkpoint: The path to the PEFT model checkpoint that will be merged with the base model.

When executed, the script will:

  1. Load the base causal language model specified by --base_model_name_or_path.
  2. Load the PEFT model from --peft_model_path.
  3. Merge the PEFT adapters into the base model, converting it back to a full model.
  4. Save the merged model (and tokenizer) to the same path with "-merged" appended to the base model name, unless --push_to_hub is specified, in which case the model would be pushed to the Hugging Face Hub.

Here is the scripts/merge_peft_adapters.py

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

import os
import argparse

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--base_model_name_or_path", type=str, default="bigcode/starcoderbase-7b")
    parser.add_argument("--peft_model_path", type=str, default="/")
    parser.add_argument("--push_to_hub", action="store_true", default=False)

    return parser.parse_args()

def main():
    args = get_args()

    base_model = AutoModelForCausalLM.from_pretrained(
        args.base_model_name_or_path,
        return_dict=True,
        torch_dtype=torch.float16 
    )

    model = PeftModel.from_pretrained(base_model, args.peft_model_path)
    model = model.merge_and_unload()

    tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)

    if args.push_to_hub:
        print(f"Saving to hub ...")
        model.push_to_hub(f"{args.base_model_name_or_path}-merged", use_temp_dir=False, private=True)
        tokenizer.push_to_hub(f"{args.base_model_name_or_path}-merged", use_temp_dir=False, private=True)
    else:
        model.save_pretrained(f"{args.base_model_name_or_path}-merged")
        tokenizer.save_pretrained(f"{args.base_model_name_or_path}-merged")
        print(f"Model saved to {args.base_model_name_or_path}-merged")

if __name__ == "__main__" :
    main()        



Conclusion

Fine-tuning Llama 2 models on Amazon SageMaker JumpStart represents a confluence of advanced AI technology and robust cloud computing infrastructure. The process is marked by its flexibility, offering various methods and optimizations to cater to different requirements and complexities inherent in handling large-scale language models.

References:

https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/llama-2-finetuning.ipynb


https://aws.amazon.com/blogs/machine-learning/fine-tune-llama-2-for-text-generation-on-amazon-sagemaker-jumpstart/?source=post_page-----a5f62c51bea4--------------------------------



要查看或添加评论,请登录

社区洞察

其他会员也浏览了