Microsoft Phi3 Chat Completion Cookbook
Ayush Thakur
Founder @ Reconfigure.in | Gen AI, LLM and Machine Learning | 25+ Research Publications | Patents & 10+ Copyrights Holder | IEEE & Scopus Author | Engineering & Technology Lead
Welcome to the phi3 chat completion cookbook notebook! This article acts as a Python notebook, that uses the 微软 model- "microsoft/Phi-3-mini-128k-instruct", it serves as a comprehensive guide for users looking to explore and execute chat completion tasks. Whether you’re a beginner or an experienced practitioner, this notebook provides a user-friendly interface to run experiments, analyze outputs, and refine your chat models.
Let's start with what we're doing here -
!pip install -q torch langchain bitsandbytes accelerate transformers sentence-transformers faiss-gpu
Installing some essential Python packages for our chat completion tasks. This line of code installs several Python packages that are essential for working with language models and chat completions. Here’s a brief overview of each package:
Before proceeding, please ensure you follow me on Medium -
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
Now several essential libraries and modules for working with natural language processing tasks using PyTorch and the Hugging Face Transformers library. Let's break down each import statement:
Importing Libraries:
Language Model Setup:
# Checking if GPU is available
if torch.cuda.is_available():
print("GPU is available.")
print('Using GPU: ', torch.cuda.get_device_name(0))
print('Memory Usage: ')
print('Allocated: ', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')
Checks whether a GPU (Graphics Processing Unit) is available for computation and prints out relevant information if one is found.
Let's go through the code step by step:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
This code snippet sets the device for computation to a CUDA-enabled GPU, if one is available; otherwise, it defaults to the CPU. Let's break down the code:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu"): This line creates a device variable that represents the target device for tensor computations. Here's how it works:
The ternary operator ("cuda" if torch.cuda.is_available() else "cpu") is used here to conditionally select the device based on GPU availability. If a GPU is present, computations will be performed on the GPU. Otherwise, computations will fall back to the CPU.
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-128k-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
This code snippet demonstrates several key steps in working with a pre-trained language model ("microsoft/Phi-3-mini-128k-instruct") using PyTorch and the Hugging Face Transformers library. Let's break down each part:
1. `torch.random.manual_seed(0)`: This line sets the manual seed for PyTorch's random number generator to 0. Setting a seed ensures that random operations performed by PyTorch (such as weight initialization or data shuffling) will be reproducible across runs. By using the same seed, you can obtain consistent results during model training or evaluation.
2. `model = AutoModelForCausalLM.from_pretrained(...)`: Here, we instantiate a pre-trained language model for causal language modeling (LM) using the AutoModelForCausalLM.from_pretrained method. Let's break down the arguments:
3. `tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")`: This line initializes a tokenizer for the same pre-trained model. The AutoTokenizer.from_pretrained method loads the tokenizer associated with the specified model. Tokenizers are used to preprocess text inputs into tokens that can be fed into the language model for processing.
def get_response(question, model, tokenizer):
messages = [
{"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": question},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 4096,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
output = output[0]['generated_text']
return output
This function, get_response, is designed to generate responses from a pre-trained language model given a user's question. Let's break down how the function works:
1. `def get_response(question, model, tokenizer):`: This line defines the function get_response, which takes three parameters: question (the user's question), model (a pre-trained language model), and tokenizer (the tokenizer associated with the model).
2. `messages = [...]`: This block initializes a list of messages representing a conversation context. Each message is a dictionary with two keys:
- "role": Indicates the role of the message ("system", "user", or "assistant").
- "content": Contains the actual text content of the message.
The messages include a system message indicating guidelines for the digital assistant's behavior, a user message asking about combinations of bananas and dragonfruits, an assistant message providing information about eating bananas and dragonfruits together, and finally, a user message containing the actual question passed to the function.
3. `pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)`: This line creates a text generation pipeline using the Hugging Face Transformers library. The pipeline is configured for text generation tasks and uses the specified model and tokenizer. This pipeline will be used to generate the response to the user's question.
4. `generation_args = {...}`: Here, a dictionary generation_args is defined to specify additional parameters for text generation. These parameters include:
- "max_new_tokens": Limits the maximum number of tokens (words) the generated text can have.
- "return_full_text": Controls whether the full generated text or only the new portion is returned.
- "temperature": Controls the randomness of the generated text. A temperature of 0.0 indicates deterministic (non-random) generation.
- "do_sample": Determines whether sampling is used during generation. Setting it to False means no sampling (deterministic generation).
5. `output = pipe(messages, generation_args)`**: This line generates text based on the provided messages and generation arguments. The pipe pipeline is invoked with the messages and generation arguments, producing an output.
6. `output = output[0]['generated_text']`: Finally, the generated text is extracted from the output. The generated text is typically the response to the user's question.
7. `return output`: The function returns the generated text as the output response.
get_response("when I was 6 my sister was half my age. Now I'm 70 how old is my sister?", model, tokenizer)
The get_response function you provided is designed to generate responses from a pre-trained language model based on a given conversation context and a user's question.
The Output -
Congratulations on reaching the end of this breakdown! If you've followed along, you've gained insights into how to leverage a pre-trained language model for generating responses in a conversational context. By understanding the steps involved, from setting up the model and tokenizer to crafting a function for generating responses based on user input, you've learned a valuable skill in NLP and conversational AI development.
This approach not only allows you to interactively engage with users but also showcases the power of leveraging pre-trained models and modern NLP techniques. Whether you're building chatbots, virtual assistants, or exploring creative text generation tasks, mastering these fundamentals opens up a world of possibilities in natural language understanding and generation.
Keep exploring, experimenting, and honing your skills in LLM. The more you delve into these technologies, the more you'll discover innovative ways to enhance user experiences and solve real-world challenges through intelligent conversational interfaces. Happy coding!
Impressive walkthrough on leveraging the Phi-3 model for chat completions! To further elevate your experiments, consider integrating multimodal feedback loops, where textual output informs visual aids and vice versa, enhancing the chat model's adaptability and user engagement.
Fascinating read, combining Python and Microsoft models for chat completion! ??