Microsoft Phi3 Chat Completion Cookbook
Image Credit - Beebom

Microsoft Phi3 Chat Completion Cookbook

Welcome to the phi3 chat completion cookbook notebook! This article acts as a Python notebook, that uses the 微软 model- "microsoft/Phi-3-mini-128k-instruct", it serves as a comprehensive guide for users looking to explore and execute chat completion tasks. Whether you’re a beginner or an experienced practitioner, this notebook provides a user-friendly interface to run experiments, analyze outputs, and refine your chat models.


Let's start with what we're doing here -


!pip install -q torch langchain bitsandbytes accelerate transformers sentence-transformers faiss-gpu        

Installing some essential Python packages for our chat completion tasks. This line of code installs several Python packages that are essential for working with language models and chat completions. Here’s a brief overview of each package:

  1. torch: An open-source machine learning library used for applications such as computer vision and natural language processing. ( PyTorch )
  2. langchain: A library for building language model applications. ( LangChain )
  3. bitsandbytes: Provides efficient implementations of quantized neural network layers. ( bitsandbytes )
  4. accelerate: Allows users to easily switch between CPU and GPU for faster computations.
  5. transformers: Developed by Hugging Face, this library provides general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
  6. sentence-transformers: A library for creating sentence, text, and image embeddings.
  7. faiss-gpu: Efficient similarity search and clustering of dense vectors, with GPU support.
  8. huggingface_hub: Enables users to download and publish models on the Hugging Face Model Hub. ( Hugging Face )


Before proceeding, please ensure you follow me on Medium -


import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline        

Now several essential libraries and modules for working with natural language processing tasks using PyTorch and the Hugging Face Transformers library. Let's break down each import statement:

Importing Libraries:

  • numpy (imported as np): A library for numerical computing in Python.
  • pandas (imported as pd): A data manipulation library that provides data structures and functions for working with structured data.
  • tqdm.auto: A package for creating progress bars in loops and other iterative processes.
  • torch: The PyTorch library for deep learning.
  • transformers: A library for working with pre-trained language models, including BERT, GPT, and others.

Language Model Setup:

  • AutoModelForCausalLM: A class from the transformers library that loads a pre-trained causal language model (e.g., GPT-2, GPT-3).
  • AutoTokenizer: A class for tokenizing text data to prepare it for input to the language model.
  • pipeline: A utility for creating a simple pipeline to perform specific tasks using the language model (e.g., text generation, sentiment analysis).


# Checking if GPU is available
if torch.cuda.is_available():
    print("GPU is available.")
    print('Using GPU: ', torch.cuda.get_device_name(0))
    print('Memory Usage: ')
    print('Allocated: ', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached: ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')        

Checks whether a GPU (Graphics Processing Unit) is available for computation and prints out relevant information if one is found.

Let's go through the code step by step:

  1. if torch.cuda.is_available():: This line checks if PyTorch detects a CUDA-enabled GPU. CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA for utilizing the power of GPUs for general-purpose processing, including deep learning tasks.
  2. print("GPU is available."): If a GPU is detected, this line prints a message indicating that a GPU is available for use.
  3. print('Using GPU: ', torch.cuda.get_device_name(0)): This line prints the name of the GPU being used. torch.cuda.get_device_name(0) retrieves the name of the GPU device at index 0. If there are multiple GPUs, they would be indexed accordingly (0, 1, 2, etc.).
  4. print('Memory Usage: '): This line simply prints a heading to indicate that the following information pertains to memory usage.
  5. print('Allocated: ', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB'): Here, we print the amount of memory currently allocated on the GPU in gigabytes (GB). torch.cuda.memory_allocated(0) returns the amount of memory allocated on the GPU device at index 0. The value is divided by 1024**3 to convert bytes to gigabytes and rounded to one decimal place for readability.
  6. print('Cached: ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB'): This line prints the amount of memory currently cached on the GPU in gigabytes. torch.cuda.memory_cached(0) returns the amount of memory cached on the GPU device at index 0. Similar to the allocated memory, the value is converted to gigabytes and rounded for display.
  7. else:: If no GPU is available, this block of code executes.
  8. print("GPU is not available."): This line prints a message indicating that no GPU is available for use.


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")        

This code snippet sets the device for computation to a CUDA-enabled GPU, if one is available; otherwise, it defaults to the CPU. Let's break down the code:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu"): This line creates a device variable that represents the target device for tensor computations. Here's how it works:

  • torch.cuda.is_available() checks if a CUDA-enabled GPU is available. If it is, the condition evaluates to True.
  • If a GPU is available (True), "cuda" is chosen as the device.
  • If no GPU is available (False), "cpu" is chosen as the device.

The ternary operator ("cuda" if torch.cuda.is_available() else "cpu") is used here to conditionally select the device based on GPU availability. If a GPU is present, computations will be performed on the GPU. Otherwise, computations will fall back to the CPU.


torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")        

This code snippet demonstrates several key steps in working with a pre-trained language model ("microsoft/Phi-3-mini-128k-instruct") using PyTorch and the Hugging Face Transformers library. Let's break down each part:

1. `torch.random.manual_seed(0)`: This line sets the manual seed for PyTorch's random number generator to 0. Setting a seed ensures that random operations performed by PyTorch (such as weight initialization or data shuffling) will be reproducible across runs. By using the same seed, you can obtain consistent results during model training or evaluation.

2. `model = AutoModelForCausalLM.from_pretrained(...)`: Here, we instantiate a pre-trained language model for causal language modeling (LM) using the AutoModelForCausalLM.from_pretrained method. Let's break down the arguments:

  • "microsoft/Phi-3-mini-128k-instruct": This specifies the pre-trained model's identifier or name. In this case, it refers to the Phi-3-mini model with 128k parameters, designed for instructional purposes.
  • device_map="cuda": This parameter specifies that the model should be loaded onto a CUDA-enabled GPU for accelerated computation. The "cuda" value indicates GPU acceleration.
  • torch_dtype="auto": This parameter specifies the tensor data type to use. Setting it to "auto" allows PyTorch to automatically select the appropriate data type based on the device (GPU or CPU) being used.
  • trust_remote_code=True: This parameter indicates that remote code (such as code from the Hugging Face model repository) can be trusted. It's important for securely loading models from external sources.

3. `tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")`: This line initializes a tokenizer for the same pre-trained model. The AutoTokenizer.from_pretrained method loads the tokenizer associated with the specified model. Tokenizers are used to preprocess text inputs into tokens that can be fed into the language model for processing.


def get_response(question, model, tokenizer):
    messages = [
        {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
        {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
        {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
        {"role": "user", "content": question},
    ]

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    generation_args = {
        "max_new_tokens": 4096,
        "return_full_text": False,
        "temperature": 0.0,
        "do_sample": False,
    }

    output = pipe(messages, **generation_args)
    output = output[0]['generated_text']
    return output        

This function, get_response, is designed to generate responses from a pre-trained language model given a user's question. Let's break down how the function works:

1. `def get_response(question, model, tokenizer):`: This line defines the function get_response, which takes three parameters: question (the user's question), model (a pre-trained language model), and tokenizer (the tokenizer associated with the model).

2. `messages = [...]`: This block initializes a list of messages representing a conversation context. Each message is a dictionary with two keys:

- "role": Indicates the role of the message ("system", "user", or "assistant").

- "content": Contains the actual text content of the message.

The messages include a system message indicating guidelines for the digital assistant's behavior, a user message asking about combinations of bananas and dragonfruits, an assistant message providing information about eating bananas and dragonfruits together, and finally, a user message containing the actual question passed to the function.

3. `pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)`: This line creates a text generation pipeline using the Hugging Face Transformers library. The pipeline is configured for text generation tasks and uses the specified model and tokenizer. This pipeline will be used to generate the response to the user's question.

4. `generation_args = {...}`: Here, a dictionary generation_args is defined to specify additional parameters for text generation. These parameters include:

- "max_new_tokens": Limits the maximum number of tokens (words) the generated text can have.

- "return_full_text": Controls whether the full generated text or only the new portion is returned.

- "temperature": Controls the randomness of the generated text. A temperature of 0.0 indicates deterministic (non-random) generation.

- "do_sample": Determines whether sampling is used during generation. Setting it to False means no sampling (deterministic generation).

5. `output = pipe(messages, generation_args)`**: This line generates text based on the provided messages and generation arguments. The pipe pipeline is invoked with the messages and generation arguments, producing an output.

6. `output = output[0]['generated_text']`: Finally, the generated text is extracted from the output. The generated text is typically the response to the user's question.

7. `return output`: The function returns the generated text as the output response.


get_response("when I was 6 my sister was half my age. Now I'm 70 how old is my sister?", model, tokenizer)        

The get_response function you provided is designed to generate responses from a pre-trained language model based on a given conversation context and a user's question.

The Output -


Congratulations on reaching the end of this breakdown! If you've followed along, you've gained insights into how to leverage a pre-trained language model for generating responses in a conversational context. By understanding the steps involved, from setting up the model and tokenizer to crafting a function for generating responses based on user input, you've learned a valuable skill in NLP and conversational AI development.

This approach not only allows you to interactively engage with users but also showcases the power of leveraging pre-trained models and modern NLP techniques. Whether you're building chatbots, virtual assistants, or exploring creative text generation tasks, mastering these fundamentals opens up a world of possibilities in natural language understanding and generation.

Keep exploring, experimenting, and honing your skills in LLM. The more you delve into these technologies, the more you'll discover innovative ways to enhance user experiences and solve real-world challenges through intelligent conversational interfaces. Happy coding!

Impressive walkthrough on leveraging the Phi-3 model for chat completions! To further elevate your experiments, consider integrating multimodal feedback loops, where textual output informs visual aids and vice versa, enhancing the chat model's adaptability and user engagement.

Fascinating read, combining Python and Microsoft models for chat completion! ??

要查看或添加评论,请登录

社区洞察