登录查看更多内容

#41-42 Let's quiz on the Llama 3 Model. Shall we?

Riya Chhikara

Data Scientist at The Economist | Guest Teacher at LSE

发布日期: 2024年4月20日

I had to look into the latest Llama code that just got released on GitHub. It comes from the family of Large Language Models (LLMs) that has pre-trained and instruction-tuned generative text models. Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction-tuned variants.

What's the architecture like?

"Llama 3 is an auto-regressive language model that uses an optimized transformer architecture." Let's break it down. Auto-regressive models, like Llama 3, generate text by predicting the next word in a sequence based on the words that came before it.

The Transformer architecture is a type of neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. It is particularly well-suited for sequence transduction tasks, such as language modelling and machine translation.

The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supervised fine-tuning involves training a pre-trained model (like Llama 3) further on a specific task or dataset using labelled examples. In the context of language models, RLHF involves training the model to generate text that aligns with human preferences for factors such as helpfulness and safety.

Llama 3 was pre-trained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.

Training data

Llama 3 is pre-trained on over 15T tokens that were all collected from publicly available sources. Their training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. However, they do not expect the same level of performance in these languages as in English

To ensure Llama 3 is trained on data of the highest quality, they developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. They found that previous generations of Llama were surprisingly good at identifying high-quality data, hence they used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.

Scaling up pertaining

To train the largest Llama 3 models, they combined three types of parallelization: data parallelization, model parallelization, and pipeline parallelization. They performed training runs on two custom-built 24K GPU clusters. To maximize GPU uptime, they developed an advanced new training stack that automates error detection, handling, and maintenance. They also greatly improved our hardware reliability and detection mechanisms for silent data corruption, and we developed new scalable storage systems that reduce overheads of checkpointing and rollback. Those improvements resulted in an overall effective training time of more than 95%. Combined, these improvements increased the efficiency of Llama 3 training by ~three times compared to Llama 2.

Approach to post-training

Combination of supervised fine-tuning (SFT)

Rejection sampling

Proximal Policy Optimization

Direct Policy Optimization

They found that if they ask a model a reasoning question that it struggles to answer, the model will sometimes produce the right reasoning trace: The model knows how to produce the right answer, but it does not know how to select it. Training on preference rankings enables the model to learn how to select it.

Instruction fine-tuning

Instruction fine-tuning also plays a major role in ensuring the safety of these models. These instruction-fine-tuned models have been red-teamed (tested) for safety through internal and external efforts. Their red teaming approach leveraged human experts and automation methods to generate adversarial prompts that try to elicit problematic responses. For instance, they apply comprehensive testing to assess risks of misuse related to Chemical, Biological, Cyber Security, and other risk areas. All of these efforts are iterative and used to inform the safety fine-tuning of the models being released.

GitHub:

Link to complete code

Llama uses a transformer-based model in PyTorch.

Import Libraries

领英推荐

??Top ML Papers of the Week

DAIR.AI 11 个月前

Solving Complex Problems Using FastAPI, LangChain, and…

Juan Carlos Lanas Ocampo 6 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.

import math
from dataclasses import dataclass
from typing import Optional, Tuple

import fairscale.nn.model_parallel.initialize as fs_init
import torch
import torch.nn.functional as F
from fairscale.nn.model_parallel.layers import (
    ColumnParallelLinear,
    RowParallelLinear,
    VocabParallelEmbedding,
)
from torch import nn

2. Creating Classes

A data class is defined which has various model hyperparameters such as dimensionality (dim), number of layers (n_layers), number of attention heads (n_heads), vocabulary size (vocab_size).

@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    rope_theta: float = 500000

    max_batch_size: int = 32
    max_seq_len: int = 2048

RMSNorm implements RMS normalization. The input tensors are normalized along the last dimension using the root mean square.

class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

3. Functions

precompute_freqs_cis: Function to precompute frequencies used in the positional encoding for rotary embeddings.
reshape_for_broadcast: Function to reshape the frequencies tensor for broadcasting during rotary embedding computation.
apply_rotary_emb: Function to apply rotary embeddings to the queries and keys of the attention mechanism.
repeat_kv: Function to repeat the key-value pairs for multiple heads if the number of key-value heads is less than the total number of attention heads.

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device, dtype=torch.float32)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)


def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )

4. Attention class

Implements the multi-head attention mechanism. Uses column-parallel linear layers to compute query, key, and value projections. Applies rotary embeddings to queries and keys. Caches key-value pairs for efficient computation in subsequent iterations. Computes attention scores, applies softmax, and computes a weighted sum of values. Finally, projects the output using a row-parallel linear layer.

class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        model_parallel_size = fs_init.get_model_parallel_world_size()
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,
            input_is_parallel=True,
            init_method=lambda x: x,
        )

        self.cache_k = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()
        self.cache_v = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        bsz, seqlen, _ = x.shape
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

        self.cache_k = self.cache_k.to(xq)
        self.cache_v = self.cache_v.to(xq)

        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]

        # repeat k/v heads if n_kv_heads < n_heads
        keys = repeat_kv(
            keys, self.n_rep
        )  # (bs, cache_len + seqlen, n_local_heads, head_dim)
        values = repeat_kv(
            values, self.n_rep
        )  # (bs, cache_len + seqlen, n_local_heads, head_dim)

        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        keys = keys.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        values = values.transpose(
            1, 2
        )  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
        return self.wo(output)

5. Feedforward Class

Implements the feedforward layer of the transformer. Applies a linear transformation followed by the SiLU Activation function.

class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

6. TransformerBlock

This is a single transformer block. It has an attention mechanism followed by a feedforward layer. Uses RMSNorm for layer normalization.

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=4 * args.dim,
            multiple_of=args.multiple_of,
            ffn_dim_multiplier=args.ffn_dim_multiplier,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward(self.ffn_norm(h))
        return out

7. Transformer

This is the overall Transformer model which embeds input tokens using token embeddings. Contains a stack of TransformerBlocks. Applies positional encoding using precomputed frequencies. Outputs logits for the vocabulary using a linear layer

class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = VocabParallelEmbedding(
            params.vocab_size, params.dim, init_method=lambda x: x
        )

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(
            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
        )

        self.freqs_cis = precompute_freqs_cis(
            params.dim // params.n_heads,
            params.max_seq_len * 2,
            params.rope_theta,
        )

    @torch.inference_mode()
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)

            mask = torch.triu(mask, diagonal=1)

            # When performing key-value caching, we compute the attention scores
            # only for the new sequence. Thus, the matrix of scores is of size
            # (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for
            # j > cache_len + i, since row i corresponds to token cache_len + i.
            mask = torch.hstack(
                [torch.zeros((seqlen, start_pos), device=tokens.device), mask]
            ).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h).float()
        return output

Sources:

100 Days of Computer Vision

836 位关注者

Riya Chhikara

Data Scientist at The Economist | Guest Teacher at LSE

11 个月

Source: Meta's Blog on Llama 3 (https://ai.meta.com/blog/meta-llama-3/) #AI?#Llama3?#Meta?#LLM

1 次回应

要查看或添加评论，请登录

Riya Chhikara的更多文章

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

2024年10月21日

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

Got a cool idea ! We have Google Lens where you can upload images to search for the items. I want to build a…
#56 Connecting the app to AWS S3 bucket

2024年9月22日

#56 Connecting the app to AWS S3 bucket

Now that QualScan works well, and we have integrated Postgres tables into the workflow, we have one more thing left to…
#55: How to build a solid backend for a scalable app?

2024年9月22日

#55: How to build a solid backend for a scalable app?

Now that we have a functional app with a decent interface, we can focus on the backend database storage. I used…
#54: How to integrate alert system into a machine vision app ?

2024年9月20日

#54: How to integrate alert system into a machine vision app ?

This will be a tutorial with code snippets. So, if you are building/ planning to build your app in Python, and want to…
# 53 The app now tracks defects in real-time

2024年9月19日

# 53 The app now tracks defects in real-time

What do real time quality dashboards 'really look' like? I found some results on Google which seemed pretty…
#52: Looks better than yesterday

2024年9月18日

#52: Looks better than yesterday

Today, I made some functional changes. Looks better, and fixed the slider issue.
#51: And the winner for the final model is VGG16

2024年9月17日

#51: And the winner for the final model is VGG16

Quick Recap: Yesterday we created an app that took product images as inputs and predicted the % of defects in it. The…

2 条评论
#50: Machine Vision for checking defects

2024年9月16日

#50: Machine Vision for checking defects

BACK AT IT ! Well, today I read about machine vision used in manufacturing setups. We know that humans can inspect only…
#49: Product Design for Smarter iPhone Search

2024年6月22日

#49: Product Design for Smarter iPhone Search

In the previous article, I mentioned 5 main improvements to be made in the iPhone photo Search. Today, I design…
#48 Tech Review on iPhone's Image Search

2024年6月22日

#48 Tech Review on iPhone's Image Search

As a phone user, I found a pain point in accessing photos from my gallery. Today, I study all the features that Apple…

See all articles

#41-42 Let's quiz on the Llama 3 Model. Shall we?

Riya Chhikara

Data Scientist at The Economist | Guest Teacher at LSE

What's the architecture like?

Training data

Scaling up pertaining

Approach to post-training

Instruction fine-tuning

GitHub:

领英推荐

Sources:

100 Days of Computer Vision

836 位关注者

Riya Chhikara的更多文章

社区洞察

其他会员也浏览了

Is OpenAI’s O1 Model a Scam? An In-Depth Look at the Debate

The Software Industry's "Kodak Moment" - When Code Writes Itself

The Future of AI with Python: Trends and Predictions

A deep dive on Vector Search and its implementation

Part Beta: Information Discovery and Discoverability

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

A Step-by-Step Guide to Contextualized Query Response Systems Using Embeddings and Large Language Models

Data Transparency in Open-Source Models

Understanding RAG: Recent Advancements in Retrieval-Augmented Generation

What's the architecture like?

Training data

Scaling up pertaining

Approach to post-training

Instruction fine-tuning

GitHub:

领英推荐

Sources:

100 Days of Computer Vision

836 位关注者

Riya Chhikara的更多文章

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

#56 Connecting the app to AWS S3 bucket

#55: How to build a solid backend for a scalable app?

#54: How to integrate alert system into a machine vision app ?

# 53 The app now tracks defects in real-time

#52: Looks better than yesterday

#51: And the winner for the final model is VGG16

#50: Machine Vision for checking defects

#49: Product Design for Smarter iPhone Search

#48 Tech Review on iPhone's Image Search

社区洞察

其他会员也浏览了

Is OpenAI’s O1 Model a Scam? An In-Depth Look at the Debate

The Software Industry's "Kodak Moment" - When Code Writes Itself

The Future of AI with Python: Trends and Predictions

A deep dive on Vector Search and its implementation

Part Beta: Information Discovery and Discoverability

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

A Step-by-Step Guide to Contextualized Query Response Systems Using Embeddings and Large Language Models

Data Transparency in Open-Source Models

Understanding RAG: Recent Advancements in Retrieval-Augmented Generation