How I Created an AI Version of Myself
Keith McNulty
Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect
Generative AI could best be described as a frustrating breakthrough. When ChatGPT was first released in late 2022, there was open-mouthed, wide-eyed amazement at the quality of the natural language it produced. Since then, numerous updates to that product have come on the scene, as well as a plethora of competitor products.
But the initial excitement has given way to numerous frustrations about the technology. Controlling the output of these models seems challenging, with hallucinations meaning that the content of what they generate is not always reliable, even if the natural language is persuasive. Cut off dates often mean content is out of date. Models usually don’t have the background context for a specific request, meaning their response is often off the mark or too generic to be useful, especially in organizational or business settings.
Retrieval Augmented Generation (RAG) is a way of using large language models in a substantially more controlled way. Without expensive fine tuning, and using a fairly simple workflow, a model can be fed with relevant contextual information and restricted to only respond based on the information given, or to prioritize the information given. In this way, the true value of the large language model is released — as an automated natural language summarizer of content, and the undesirable behaviors such as hallucinations can be minimized or possibly even eliminated.
To illustrate this, I have constructed this technical tutorial where I have created a minimal version of a RAG application that answers questions related to my statistics textbook Handbook of Regression Modeling. At the end of this tutorial I will have constructed a simple pipeline that allows you to ask a question to a LLM and receive information only based on what is in my textbook. So it’s kind of like an AI version of myself, using only my statistics knowledge to answer your questions.
This tutorial is completely replicable as all of the material I use, including my textbook, is available open source. I even use an open LLM (Google’s Gemma-7b LLM, which is the foundation of its Gemini product). However — a word of warning: I am only able to run Gemma locally because I am using an extremely high spec machine (Macbook M3 Max with 128GB RAM). If you do not have such a high spec machine, you will need to use cloud resources or change the pipeline to use an API to a hosted LLM like chatGPT.
An overview of Retrieval Augmented Generation (RAG)
In a typical LLM interaction, a user will send a prompt directly to an LLM and receive a response purely based on the LLM’s training set. For general tasks, this can be useful. But when the prompt requires specialist knowledge or context, the response is usually unsatisfactory.
The idea behind RAG, is that the prompt pays a visit to a specialized knowledge database, picks up a few relevant documents, and takes them with it when it hits the LLM. The prompt can be constructed to restrict the LLM to answer only on the basis of the documents that accompanied the prompt, thus ensuring a higher quality of contextual response.
The architecture in the diagram above can be considered to have two components:
In this tutorial I will construct a minimal example of this architecture using Python. You can find the full Jupyter notebook here .
Retrieving and preparing my documents
The document store will contain content from my textbook, which exists in open source form at this Github repo . The book is structured into 14 chapters and sections containing text, code and mathematical formulas, each of which is generated from a markdown document.
First I will install some packages that I need.
import torch
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, AutoModelForCausalLM
from dotenv import load_dotenv
from huggingface_hub import login
import os
import pandas as pd
Now I will pull down the text of all 14 chapters of my textbook and store them in a Pandas dataframe:
import requests
# chapters are Rmd files with the following names
chapter_list = [
"01-intro",
"02-basic_r",
"03-primer_stats",
"04-linear_regression",
"05-binomial_logistic_regression",
"06-multinomial_regression",
"07-ordinal_regression",
"08-hierarchical_data",
"09-survival_analysis",
"10-tidy_modeling",
"11-power_tests",
"12-further",
"13-solutions",
"14-bibliography"
]
# create a function to obtain the text of each chapter
def get_text(chapter: str) -> str:
# URL on the Github where the rmd files are stored
github_url = f"https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/r/{chapter}.Rmd"
result = requests.get(github_url)
return result.text
# iterate over the chapter URLs and pull down the text content
book_text = []
for chapter in chapter_list:
chapter_text = get_text(chapter)
book_text.append(chapter_text)
# write to a dataframe
book_data = dict(chapter = list(range(14)), text = book_text)
book_data = pd.DataFrame.from_dict(book_data)
Now, at this point I have 14 documents that have quite a large length. Thinking ahead, any documents that I send to my LLM will need to fit in its context window, which is a maximum for the number of tokens (words) that the LLM can process. With the current document length, I can’t control this, so I am going to need to split these documents into a larger number of shorter documents.
However, I can’t just randomly split them, I need to semantically split them so that any individual document is not cut off at some random point, and where the totality of the document makes semantic sense. I can use a neat function in the langchain Python package to do this. I’m going to limit my documents to 1000 words and allow up to a 150 word overlap between them.
# semantically split chapters to a max length of 1000
loader = DataFrameLoader(book_data, page_content_column="text")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
# how many docs?
len(docs)
## 578
My 14 original documents have been split into 578 smaller documents. Let’s take a look at the first document so we can see what langchain‘s document format looks like:
# examine a document to ensure it looks as we expect
docs[0]
## Document(page_content="`r if (knitr::is_latex_output()) '\\\\mainmatter'`\n\n# The Importance of Regression in People Analytics {#inf-model}\n\nIn the 19th century,
## when Francis Galton first used the term 'regression' to describe a statistical phenomenon (see Chapter \\@ref(linear-reg-ols)), little did he know how important that
## term would be today. Many of the most powerful tools of statistical inference that we now have at our disposal can be traced back to the types of early analysis that
## Galton and his contemporaries were engaged in. The sheer number of different regression-related methodologies and variants that are available to researchers and practitioners
## today is mind-boggling, and there are still rich veins of ongoing research that are focused on defining and refining new forms of regression to tackle new problems.",
## metadata={'chapter': 0})
We can see that the document object contains a page_content key, which has our text of interest inside it, and a metadata key, which is not of interest to us.
Setting up the vector database to contain my documents
Now that I have my documents of an appropriate length, I will need to load them to a vector database. A vector database stores text in both its original form, but also as embeddings, which are large arrays of floating point numbers, fundamental to how large language models process language. Words, sentences or documents that have ‘close’ embeddings in multidimensional space will be closely related to each other in content. See the diagram above for a 2D graphical simplification of the concept of embeddings.
When I submit a prompt, the vector database will use the embedding of the prompt, and find the closest embeddings from the documents it contains. There are numerous options for how to define ‘closest’. In this case I will use cosine similarity, which uses the cosine of the angle between two embeddings to determine distance. The higher the cosine similarity, the more related two documents are.
To make this easy I will use the chromadb Python package, which allows me to set up a vector database on my local machine. First I need to set up the database and define the language model and distance metric that it will use to generate embeddings. I will pick a standard, efficient language model for this purpose.
import chromadb
from chromadb.utils import embedding_functions
from chromadb.utils.batch_utils import create_batches
import uuid
# set up the ChromaDB
CHROMA_DATA_PATH = "./chroma_data_regression_book/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "regression_book_docs"
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)
# enable the DB using Cosine Similarity as the distance metric
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=EMBED_MODEL
)
collection = client.create_collection(
name=COLLECTION_NAME,
embedding_function=embedding_func,
metadata={"hnsw:space": "cosine"},
)
Now I am ready to load my documents. Note that vector databases have limits to how many documents can be loaded in any one operation. In my case I only have a few hundred documents so I should be fine, but I am going to set up a batch load in any case, just to be on the safe side.
# write text chunks to DB in batches
batches = create_batches(
api=client,
ids=[f"{uuid.uuid4()}" for i in range(len(docs))],
documents=[doc.page_content for doc in docs],
metadatas=[{'source': './handbook_of_regression_modeling', 'row': k} for k in range(len(docs))]
)
for batch in batches:
print(f"Adding batch of size {len(batch[0])}")
collection.add(ids=batch[0],
documents=batch[3],
metadatas=batch[2])
## Adding batch of size 578
There was only one batch needed, but in case you try this with a longer document set, you can use the code above to load multiple batches.
Now my document store is set up. I can test it to see if it returns documents that have some similarity to my query. I’ll ask a statistics question and request the three closest matching docs.
results = collection.query(
query_texts=["Which method would you recommend for ordered category outcomes?"],
n_results=3,
include=['documents']
)
results
# {'ids': [['371eeda2-c01f-420b-80c6-2b61894d0069',
# '94fc3891-4b37-49b9-a797-e73a11eec739',
# '3360431f-6857-47ac-a329-af10b6528c61']],
# 'distances': None,
# 'metadatas': None,
# 'embeddings': None,
# 'documents': [['# Proportional Odds Logistic Regression for Ordered Category Outcomes {#ord-reg}',
# "# Multinomial Logistic Regression for Nominal Category Outcomes\n\n`r if (knitr::is_latex_output()) '\\\\index{multinomial logistic regression|(}'`\nIn the previous chapter we looked at how to model a binary or dichotomous outcome using a logistic function. In this chapter we look at how to extend this to the case when the outcome has a number of categories that do not have any order to them. When an outcome has this nominal categorical form, it does not have a sense of direction. There is no 'better' or 'worse'‍, no 'higher' or 'lower'‍, there is only 'different'‍.\n\n## When to use it\n\n### Intuition for multinomial logistic regression \n\nA binary or dichotomous outcome like we studied in the previous chapter is already in fact a nominal outcome with two categories, so in principle we already have the basic technology with which to study this problem. That said, the way we approach the problem can differ according to the types of inferences we wish to make.",
# 'In fact, there are numerous known ways to approach the inferential modeling of ordinal outcomes, all of which build on the theory of linear, binomial and multinomial regression which we covered in previous chapters. In this chapter, we will focus on the most commonly adopted approach: *proportional odds* logistic regression. Proportional odds models (sometimes known as constrained cumulative logistic models) are more attractive than other approaches because of their ease of interpretation but cannot be used blindly without important checking of underlying assumptions. \n\n## When to use it\n\n### Intuition for proportional odds logistic regression {#ord-intuit}']],
# 'uris': None,
# 'data': None}
Looks pretty good to me! Now I have completed the information retrieval layer. Time to move on to the LLM layer.
Setting up Google’s Gemma-7b-it LLM on my machine
Gemma-7b-it is the 7 billion parameter instruction-tuned version of the foundation model for Google’s Gemini. It is about 20GB in size, and is available on Huggingface. You will need to agree to terms and then get an access key to download and use the model. For a model of this size, you’ll need some pretty impressive CPU, GPU and RAM to run it.
First I will access the model and download it. This might take a while the first time you do it, but it will load quickly once downloaded and in your cache.
# log in to Huggingface using my token
load_dotenv()
login(token=os.getenv("HF_TOKEN"))
# Download Gemma-7b-it
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", padding=True, truncation=True, max_length=512)
Let’s test it out. This model likes to receive prompts in the form of a conversation between model and user. Here is a recommended prompt format:
prompt = """
<start_of_turn>user
What food should I try in New Mexico?<end_of_turn>
<start_of_turn>model
"""
Now I am going to turn this prompt into an embedding that Gemma understand, send it to Gemma to generate a response, decode that response back into text, and slice off the new part that Gemma generated for me.
# embed the prompt
input_ids = tokenizer(prompt, return_tensors="pt")
# generate the answer
outputs = model.generate(**input_ids, max_new_tokens=512)
# decode the answer
tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]
# "New Mexico is known for its unique cuisine, blending Native American,
# Spanish, and Mexican influences. Here are some must-try foods in the
# Land of Enchantment:\n\n**Traditional Native American Foods:**\n\n*
# **Indian tacos:** Corn tortillas filled with meat (often mutton or beef),
# beans, cheese, lettuce, tomato, and red chili powder.\n*
# **Chiles en nogada:** Layers of red and green chiles, potatoes,
# and vegetables in a savory sauce.\n* **Sopaipillas:** Puffy fried
# dough balls often served with honey or jam.\n*
# **Posole:** Stewed pork in a flavorful broth, served with corn
# tortillas and red chili powder.\n\n**Spanish and Mexican Influences:**\n\n*
# **Hatch chiles:** Green and red chiles grown in Hatch, New Mexico,
# known for their unique flavor and heat.\n* **Red and green chile stew:**
# A hearty stew made with red and green chiles, vegetables, and meat.\n*
# **Carne adovada:** Slow-roasted beef marinated in red chile powder.\n*
# **Biscochitos:** Crispy fried dough cookies dusted with cinnamon and sugar.\n*
# **Sopapillas:** A sweet or savory treat made with a fried dough pastry filled
# with fruit, honey, or cheese.\n\n**Other Must-Try Foods:**\n\n*
# **Green chile cheeseburger:** A twist on the classic cheeseburger,
# with green chile instead of red.\n* **Blue corn:** A unique variety of
# corn with a deep blue color, often used in tortillas and other dishes.\n*
# **Turquoise ice cream:** A refreshing ice cream made with turquoise-colored
# ice cream powder.\n\n**Additional Tips:**\n\n*
# **Consider your spice tolerance:** New Mexican cuisine is known for its
# spiciness. If you're not used to eating spicy food, be sure to ask for
# mild versions.\n* **Try a local restaurant:** There are many great
# local restaurants in New Mexico that offer traditional cuisine.\n*
# **Visit a food festival:** New Mexico has a number of food festivals
# throughout the year.\n* **Be sure to try the local beer:**
# New Mexico has a thriving craft beer scene.
Oh man, that sounds delicious! We can see that Gemma generated a good quality response here. So we are ready for my final step, which is to set up the entire RAG pipeline, allowing the IR component to flow into the LLM component.
Setting up the RAG pipeline
In my basic pipeline here, I will define a function that takes a question and sends it to my vector database to retrieve some documents that closely match the question. Then it will combine the original question and the retrieved documents into a Gemma-friendly prompt, with instructions to stick to the documents in answering the question. Finally, it will encode the new prompt, use Gemma to generate a response, and decode the response.
def ask_question(question: str, model: AutoModelForCausalLM = model, tokenizer: AutoTokenizer = tokenizer, collection: str = COLLECTION_NAME, n_docs: int = 3) -> str:
# Find close documents in chromadb
collection = client.get_collection(collection)
results = collection.query(
query_texts=[question],
n_results=n_docs
)
# Collect the results in a context
context = "\n".join([r for r in results['documents'][0]])
prompt = f"""
<start_of_turn>user
You are an expert on statistics and its applications to People Analytics.
Here is a question: {question}\n\n Answer it with reference to the following information and only using the following information: {context}.<end_of_turn>
<start_of_turn>model
"""
# Generate the answer using the LLM
input_ids = tokenizer(prompt, return_tensors="pt")
# Return the generated answer
outputs = model.generate(**input_ids, max_new_tokens=512)
return tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]
Testing this AI version of me
Now let’s ask a few questions to see if it responds based on the knowledge which I wrote in my textbook.
ask_question("What method would you recommend I use to model ordered category outcomes and why?")
## Sure, here is the answer to the question:\n\nThe text suggests that the
## recommended method for modeling ordered category outcomes is
## proportional odds logistic regression. This is because proportional odds
## models are easy to interpret and are commonly adopted in the field.
## However, it is important to note that proportional odds models
## have some underlying assumptions that should be checked
## before using them.\n\nTherefore, the recommended method for modeling ordered
## category outcomes is proportional odds logistic regression, but it is
## important to check the underlying assumptions of this model before using it.
That sounds like an answer I would give. Let’s try another:
ask_question('What should I look out for when using Proportional Odds regression?')
## When using Proportional Odds regression, you should look out for the following:\n\n
## * **Test of Proportional Odds Assumption:** Before running the model, you
## should test the proportional odds assumption to see if the model is
## appropriate for your data. If the assumption fails, you should
## consider alternative models for ordinal outcomes.\n
## * **Variable Removal:** If the assumption fails, you may consider
## removing variables that do not impact the outcome. However,
## whether or not you are comfortable doing this will depend very
## much on the impact on overall model fit.\n
## * **Alternative Models:** If you are not comfortable removing variables,
## you should consider alternative models for ordinal outcomes. The most
## common alternatives include models such as cumulative logit models,
## rank-based models, and threshold-based models.'
Again, sounds like something I would say. And what if I ask it about something not covered in the textbook?
ask_question('What is the standard model of Physics?')
## The text does not provide information about the standard model of Physics, \
## therefore I cannot answer this question.'
Nice! It’s staying in it’s lane, just like I would.
So what have we learned?
Not only has creating AI Keith been fun, but it also illustrates a simple architecture that can harness the power of LLMs but enhance them with contextual information and control their tendency to hallucinate. This has significant applications in knowledge search, where this architecture offers a route to summarizations of searches of large knowledge repositories.
As for AI Keith, I won’t be launching him any time soon. For a start, this example is too small scale, and trying to host an architecture like this in production and make it always available would likely be too expensive to justify its use. But I hope you can see the possibilities for larger scale situations.
What did you think of AI Keith? Have you played around with RAG architectures? Feel free to comment!
Excellent use case, sounds great! We're using a very similar approach.