Community-Driven Building: How Open Source is Fueling Progress in the LLM Ecosystem, and a Short Guide to Dataset Generation & Transfer Learning
Fred Bliss
AI & LLM Advisory & Research - DM for inquiries (previously: founder @ Aptitive, acquired / exited, Applied AI & Innovation @ Vantage Discovery)
Welcome back! That’s a mouthful of a headline that I asked several different LLMs to improve upon** (ultimately tossing away dozens of variations and sticking with my own) - but they tend to generate clickbait-y titles, even when you explicitly ask them not to. There’s a lesson in there about data - and how the training data impacts the end result - which is actually the topic at hand.
Today we're going to get a little more technical (ok, maybe a lot more technical), but that's OK - if little makes sense, whether you're technical or not, I'll aim to provide some explanations along the way that try to explain why you would do what you're doing. This isn't meant to be a tutorial, a hands-on guide, or a general overview - but something in-between all of those.
This article originally began as a conversation a friend and I had over the weekend. I mentioned some updates as to what I built and got working over the weekend, he asked me to back up and explain what the last four acronyms I used meant, and I backed up to explain it. Soon enough, we had a conversation that opened both our eyes to new possibilities and ideas - my own was continuing to raise awareness and educate those interested in what's happening and the momentum that's building in the space. I've included some code snippets to help folks get started (and 'unstuck') with some of the more complex, less documented components of working with local LLMs.
It's impossible to keep up with everything going on in AI - when you consider every new research papers that drops on a given day, all the different subfields that go under the super wide umbrella of 'AI' or even 'LLMs' - it's impossible to go deep into everything. I'll do my best here to catch folks up to some of the most exciting things happening in the open source space when it comes to open source LLMs, retrieval (aka 'chat with your data / documents'), building synthetic datasets for the purpose of finetuning your own models, and preparing those datasets to finetune base foundational models to achieve your goals and tasks (aka 'transfer learning').
With all the community-created LLaMA finetuned variants, RedPajama, OpenLLaMA, the many other open source models and variants being worked on, the research papers and techniques and architectures that build on top of each other each day, and the developers, hobbyists, and builders creating tools, frameworks, and ideas - along with all that collaboration (that whole 'standing on the shoulder of giants' thing is very applicable in this space) - we're right in the middle of a Cambrian explosion in open source LLM development.
Let’s set the agenda for this article - what we’re going to cover here in detail, what we’re going to briefly talk through, and where we’ll link off to much better written in-depth technical guides for further exploration:
Data engineering, modeling, embedding pipelines, LlamaIndex, vector stores, and how they relate to data existing design principles in modern data solutions.
Key Points:
Data engineering pipelines and how you model your data/embeddings are crucial in AI and LLM use cases, just as they are for data warehouses. In fact, the design principles aren’t so different from the ones that any data engineer (or ETL developer) is already familiar with. These pipelines are similar to data warehouses and dashboards in their design patterns, and they play a vital role in embedding pipelines and retrieval queries. At the end of the day, the quality of the data and how it’s structured is the key ingredient to the quality of your finetuned LLM model. We’re in a data-centric world of AI; not a model-centric one. Organizations with high quality data should take note of this - you have leverage and a strong competitive advantage, even if you have zero AI or ML capabilities to date.
In my end-to-end example, I took public company quarterly earnings reports and earnings transcripts, primarily PDF documents, and used them as data sources for my downstream pipelines. Why? Because my goal was to build a finetuned model that could interpret future earnings reports and transcripts, and see how it performed against state of the art such as GPT-4 and Claude (or maybe just curiosity).
The open-source LLM ecosystem consists of the models, developer tools, finetuning frameworks, retrieval systems, and more. These tools and frameworks enable developers to optimize, generate datasets, finetune, and achieve transfer learning with LLMs.
Key Points:
I’ve written about LLamaIndex as part of a prior article, and won’t rehash it here - except to say that the direction and roadmap seems to be tying both the unstructured and structured retrieval scenarios together. There’s much more to this than just Text2SQL - but LlamaIndex abstracts away a large chunk of the complexity, while still allowing you to override their default choices via lower level APIs. That’s neat, and I highly encourage everyone to start with it before DIY’ing their own way to deal with embeddings and indices.
The retrieval of data is an essential aspect of LLM use cases. To enable analysts and business users to leverage this technology effectively, a UI paradigm shift is required beyond just the prompt and chatbots. Providing citations and links to the source data is also crucial for trust and adoption.
Life on the bleeding edge - AutoGPTQ and ExLlama - quantization of LLMs that make running state of the art LLMs on consumer hardware (and even phones) a reality
4-bit quantization is a technique used to reduce the memory footprint and computational requirements of neural networks. By quantizing the model weights to 4 bits, you can achieve faster inference times and lower memory usage at the cost of some loss in model accuracy. The trade-off between performance and perplexity depends on the specific task and model architecture.
When working with local models on consumer hardware, whether you’ve got a brand new GPU or are running on a laptop, you need a way to load them without getting out-of-memory errors or waiting an hour for a sentence to get generated from your initial query.?
That's where AutoGPTQ and ExLlama come in. Both of these reduce the memory footprint in extremely clever ways, minimizing the accuracy loss while significantly cutting down the resources required to run a given LLM. This puts 33B parameter models in the hands of folks running GPUs with 24GB of VRAM - that’s amazing.
AutoGPTQ is tied to the popular transformers library from Hugging Face, which makes working with it in other open source projects much easier than ExLlama (for now, anyway, who knows what tomorrow brings).
Your first step is to grab a GPTQ model. TheBloke has been a huge contributor to the community by creating quantized models (including every GGML variant, which is a llama.cpp model format, which is outside the scope of this article, but TLDR is it lets you run these models on CPU+RAM, no GPU needed) of almost every new model that comes out - usually within a day, sometimes faster. He's also been a major contributor to the community in projects like AutoGPTQ, and an overall champion in raising awareness and making the world of local LLMs more accessible to everyone. You can find all his models here: https://huggingface.co/TheBloke
I've provided a code snippet which should work for almost any project that leverages the transformers library. For more detail, check out their GitHub repos here: AutoGPTQ and ExLlama.
# AutoGPTQ model loading snippet
# use this instead of loading directly via the transformers library
# includes some things that may not be needed in the future, such as the tokenizer json and eos/bos id
import os
import json
import torch
from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantized_model_dir = os.path.join(llm_model_path, "TheBloke_WizardLM-30B-GPTQ")
model_basename = "wizardlm-30b-GPTQ-4bit.act.order"
use_triton = False
tokenizer_config_path = os.path.join(quantized_model_dir, "tokenizer_config.json")
# Load the tokenizer config as a dict
with open(tokenizer_config_path, "r") as f:
? ? tokenizer_config = json.load(f)
# Now initialize the tokenizer with the config
tokenizer = AutoTokenizer.from_pretrained(
? ? quantized_model_dir, use_fast=True, return_token_type_ids=False, **tokenizer_config
)
# Verify the start and stop tokens
print(f"Start token: {tokenizer.bos_token}, ID: {tokenizer.bos_token_id}")
print(f"End token: {tokenizer.eos_token}, ID: {tokenizer.eos_token_id}")
model = AutoGPTQForCausalLM.from_quantized(
? ? quantized_model_dir,
? ? model_basename=model_basename,
? ? use_safetensors=True,
? ? trust_remote_code=False,
? ? device="cuda:0",
? ? use_triton=use_triton,
? ? quantize_config=None,
)
# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
print("\n\n*** Generate:")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# Set the bos_token_id and eos_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)2
This code snippet should handle your GPTQ model. Ensure you follow the prompt format correctly - every model has a different one (sometimes they have multiple), and the difference between following it and not following it can be the difference between 'this model is awful' and 'this model is amazing'.
Here's an example of a common prompt template - specifically, from the WizardLM instruction-tuned model:
system_prompt = """
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n
USER: {prompt}\n
ASSISTANT: "
"""
# wrapped between every query/prompt
wrapper_prompt = "USER: {prompt}\nASSISTANT: "
Open-source Embedding Models: Turning
Embedding models are neural networks that map text data (or images, or audio files, or anything!) to high-dimensional vector spaces. These embeddings can be used to represent and compare documents, sentences, or words in a way that captures their semantic meaning. In the near future, we'll see a bigger emphasis on multimodal models (which means they're capable of working with different types of inputs and outputs, such as audio/images/text all combined). And as a sidenote, GPT-4 was trained on image data - which is almost certainly why it's so good at depicting things visually that text has a hard time capturing.
These embeddings can then be stored in a vector store (or a vector database) for efficient retrieval and search. In the case of my quest to create a finetuned model from earnings report data, I used the e5-large-v2 model to generate embeddings from my source data: a bunch of documents and call transcript PDFs.?I stored these in JSON files to keep things simple, but I could have also stored them in a vector database (which gives me more power over search, at the cost of added complexity), or in any other database.
I could have also used MiniLM-L6-v2. Most people tend to work with OpenAI, which means they tend to use OpenAI’s ada model for this task. In the grand scheme of things, this is not an area of focus I’d spend too much attention on, aside from the fact that once you embed your data with one model, you can’t ‘retrieve’ it using another. At least not yet. Each model takes data and turns it into numbers, but each model turns it into different numbers based on the internals of its model.?
That means if you use MiniLM to embed your data for a million documents and spent an entire week of compute doing so, it’s probably a bad idea to change your mind and decide to use a different one a week later. I’ve seen efforts to bridge this via lookup tables (and even some startups that do nothing but translation of embeddings), but I haven’t followed this closely enough to be able to speak intelligently on it.
As a proof-of-concept, I loaded my vector store JSON into a Snowflake table with VARIANT data types - for no other reason than it making sense, at least for a prototype, to put my embeddings side-by-side with my other key data. Now I can store my PDFs, my structured data, and my embeddings, all in a single database, and join them via keys. Data modeling 101.
领英推荐
This article does a fantastic job talking about the various components of semantic search, including embedding models, and how there’s rarely a one-size-fits-all approach (such is life) - nor is there a single pattern for how to go from data to LLM-based retrieval. Based on various projects over the months, though, I'm at least seeing a design emerge that aligns with other basic principles in data - but that's for a future piece.
This image does a great job of speaking to the current problems / opportunities at hand:
QLoRA, Datasets, Prompting Strategies: The Art of Finetuning a Model from Home and on a Budget
Key Points:
Prompting strategies are techniques used to guide the behavior of LLMs by providing them with carefully crafted input prompts. Different prompting strategies, such as Alpaca, Reflect, Chat, and ShareGPT, are designed to elicit specific types of responses from the model. To put it simply: a model designed for instructions, such as question answering, is not going to perform well at chat. Or at least, theoretically. Some of the open source models can do well in certain chat scenarios - but the way you build your datasets, and the types of data you use, is going to significantly impact the tasks you can (or rather, should) handle with your newly finetuned LLM.?
Yes, you can gather a dataset of trivia Q&A data, tune the prompts for that data to get good at learning trivia questions and answering correctly, and then try to use it as a chatbot - but it’s going to probably end up like that annoying friend of yours who can’t stop sharing all the random tidbits of unrelated facts at dinner that have absolutely nothing to do with the conversation.
By tailoring your dataset to a particular prompting strategy, you can improve your model's performance on your target task. This involves selecting or generating data that aligns with the desired prompting strategy and training the model accordingly.
In my case, I had indexed a large number of public company earnings reports and call transcripts. My data was sitting in a local folder, in a JSON vector store (which LlamaIndex facilitated the creation of). Here’s an example of what my vector_store.json data looks like:
Neat, huh??
I can then leverage a nifty experimental function from LlamaIndex that generates questions from this data. Essentially - it’s going to do retrieval on all the documents and use my local LLM to come up with questions by prompting it with something along the lines of, ‘What are some questions that would appropriately answer the following chunk of text?’
Many of these questions are complete crap, but I have a (half baked) process to automatically evaluate these. Some of the questions that get tossed out are bad, but we still get a few that make it through that are interesting. ‘Why are these statements not forward looking statements?’, at first glance, seems like one to toss out - but maybe it’ll generate an interesting answer pair against it. So to do that, I take that array of questions and I run them against the vector store again, this time asking it to answer the question based on the data stored in the vector store. It doesn’t do a fantastic job (generating data like this against GPT-4 is how many finetuned models were born, but almost every major LLM provider out there has a TOS against this - so I stuck with my local models to generate the synthetic data.
OK - I have my data, which was synthetically created from real data using a local LLM that was also finetuned on synthetic data. Now what?
Enter the LoRA. Which sounds like it should be a beast from Where the Wild Things Are. Alas, a LoRA is just weight matrices (more math) - but its value is in how small and lightweight and resource-light these things can be to create. They allow us to keep the original model's weights frozen so as to not "forget" things they were trained on (sometimes we want them to forget that, in which case we'd take a different approach). And now you can add it on top of an existing model - instead of the model making predictions based on its own weights, it will now use the additional weights of the LoRA as part of its matrix multiplication process. Effectively making a LoRA portable.
To expand on LoRA - QLoRA (Quantized LoRA) is “an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance”. It was built to address the challenges of working with large models, such as high memory and computational requirements. Most importantly, it now allows folks like myself (and you!) to finetune models with consumer hardware.?
In the Stable Diffusion / image generation space, LoRAs are popular for representing specific characters (ie: Spiderman) or styles (ie: the 1000000th variation of anime). But when you combine a bunch of these together - you most often get noise - because at some point, cascading matrices on top of each other is going to no longer serve the original purpose.
By using QLoRAs, you can create a more efficient and task-specific version of the base model. This allows you to optimize the model for your specific use case, improving its performance and reducing its resource requirements. It’s like building a small extension that can sit on top of any base model - you can share it around, mix and match, and see how it performs with different variations.
Of course - given it’s essentially a way for the weights of the model to calculate different values without updating the original base model weights - when you stack a bunch of these on top of each other - you’re probably going to get something very broken (or very amazing? That’s the beauty of this field.).
While our models will likely be nowhere near the quality of big research labs (unless high quality data continues to show that it trumps all else), the learning opportunity it gives many of us is incredible. We can now tinker and see exactly what works and what doesn’t work, and for folks on the sidelines wanting an opportunity to jump in - you now can. There’s all kinds of guides out there to help you, and an incredible ecosystem of builders and tinkerers waiting to help.
Axolotl: Your One-stop Shop for Finalizing Your Dataset, Selecting Prompting Strategies, and Finetuning
Axolotl is an open-source toolkit for preparing datasets, fitting them to prompting strategies appropriate to your task, and finetuning LLMs. It provides a streamlined process for optimizing LLMs for specific tasks and use cases. Axolotl Github repo is a great place to start.
Using Axolotl, you can:
Once you've created a QLoRA using Axolotl, you can merge it back into the base model (e.g., OpenLlama 7B) to create a finetuned model specific to your data and task goals.
Rather than write up a new variation of using Axolotl for finetuning, I’ve linked to a fantastic guide by Timothy Bogdala. Not only does it include the code you need to do this on your own, but he effectively breaks down each part of it to explain what’s happening in each part. It was immensely helpful for me while doing it the first time this weekend.
So now this creation of yours - your finetuned model - will ideally be more efficient and effective at performing the target task. In my case, it should do a better job at interpreting data related to public company earnings reports (I haven’t tested it yet, so who knows).
What’s most exciting about this space is that the model could do all kinds of other things that I just never even anticipated. Maybe the way I randomly sampled earnings report transcripts happened to grab a bunch that were high-stakes drama. If I grabbed recent ones, and they were from tech company earnings calls, there’s a pretty high likelihood that AI was hyped up quite a bit in them. Maybe I ended up creating a finetuned model that thinks the answer to every organizational challenge is to sprinkle in a little AI to your company’s strategy and vision.
There’s an anecdote out there about OpenAI discovering that their model was good at translation tasks, despite being trained on only a tiny fraction of non-English data. I forgot where I heard it (probably a podcast), but that sort of thing happens all the time in this space.?
It’s fun and it’s exciting - but the only way this happens is by putting it in the hands of people to use. Giving the community something that might be of value so they can both learn from it and find its strengths and weaknesses - and then improve upon it.
That’s the value of the open source community - and why it’s important to not just share the end results - the models - but the datasets you trained it on, the methods you used, and where possible, the code that allows for it to be reproducible. We all learn better together in this strange new world when we’re open by default, especially when it comes to better understanding and improving upon an incredible technology that will, in some shape or form, become a key part of all of our careers and lives.
** Without fail, after every article I write, I'm asked if I used AI to write the article - the answer is always 'no', but I do use them for research and for help poking holes in structure. I often use it to second guess myself and ask for different variations (like coming up with a new title), but I'll usually discard them, or if I do take pieces of them, heavily edit them. Despite many, many attempts, I've never been able to get an LLM to use (what I consider) my own distinct tone and style, so using them to write these types of articles is often more time-consuming than the stream-of-thought style, review/edit, add sources, refine, publish style that I've always used.
It truly is 'all about the data'!
How many fingers does it take to shake an AI hand?