SLMs: Outperforming ChatGPT on a budget
Written by Tarik Hasic - ML Intern @ Tailbox
The power to outperform OpenAI's cutting-edge models right from your laptop!
Impossible? Think again!
The media has frequently portrayed the custom-LLM landscape as a multi-billion dollar investment only suitable for large companies with deep pockets.
Startups have been forced to resort to closed-source, privatized LLM providers like OpenAI and Google. However, small language models (SLMs) have the potential to free these smaller companies, granting them the ability to create custom, cheaper language models for their specific use cases - eliminating their reliance on these AI conglomerates.
Complete Reliance on OpenAI
Born at MIT - Tailbox is a travel companion that personalizes your day — but not only a planning tool. We are using gen AI to build storytelling experiences and provide live updates on events and activities tailored to your exact location.
As the technical team, we are responsible for creating innovative ML solutions and features to build the company, following our value proposition on the real time and location-based information.
Early on, one significant financial pain was our reliance on OpenAI’s GPT models. These massive, multi-trillion parameter models were used for relatively low-difficulty tasks.?
In our case, we used these LLMs to transform unstructured data scraped from the Internet into structured formats like JSON. Although these tasks were too complex to be done computationally through a Python script, they did not require the immense power of large language models (LLMs) such as OpenAI’s GPT-4. Moreover, these LLMs did not produce consistent and accurate results - their large training corpus left them clueless as to the details of our desired output.
There had to be an alternative: wasting valuable capital on our data collection and processing pipeline (for only sub-par results) would only snowball as the company grew in size and scope.
The Solution: Small Language Models
The solution revealed itself in a different category of language models: small language models (SLMs). These models are a fraction of the size of their larger counterparts - typically around 1.5 to 8 billion parameters. SLMs are far inferior to LLMs - they have smaller context windows, lack strong reasoning and problem-solving capabilities, and generally need more knowledge about the world around them. However, SLMs have a key distinguishing quality that proves vital in countless applications: they are quick, cheap, and easy to fine-tune.
For reference, here is a comparison of some common SLMs we used as base models for fine-tuning and OpenAI’s closest model:
Since SLMs have so few parameters, they are easily molded like clay into any shape the machine learning engineer chooses. So long as the task is relatively simple, these SLMs can be hyper-focused to achieve natural language tasks to high precision while remaining utterly useless in all other use cases.?
However, this is precisely what Tailbox needed! Instead of a jack-of-all-trades capable of writing Shakespearean poems and discussing quantum mechanics, we needed a simple worker who takes unstructured data as input and outputs structured JSONs suitable for our database.
Towards Fine-Tuning: Data Collection
The first step of any fine-tuning and model training process is data collection. The model needs a large corpus of example inputs and outputs to understand what is expected. In our case, Tailbox had accumulated an enormous trove of inputs and outputs in the form of a database! We instantly had access to thousands of rows of example input data, along with the metadata and structured information that should be returned.
Transforming this data into a dataset was relatively simple. Each data point consisted of three pieces: the instruction, the input, and the output. In our case, the instruction was the same for each data point: “Given this unstructured data, convert it into JSON format with the following keys: key1, key2, …”.
With this dataset ready, the next step was to begin the fine-tuning process!
The Memory Monster
Perhaps overly ambitious about the capabilities of our NVIDIA A100 GPU, we instantly loaded the entire language models, in our case the Llama3-8b, Qwen2-1.5b, and TinyLlama, into memory. We planned to retrain all of their parameters using abstractions provided by the sentence-transformers library.
Despite our eagerness, we were quickly met with a slap of reality: our GPU exploded (metaphorically). When we ran the training script, the memory utilization curve jumped to the top of its box. After some research, we discovered that our approach to fine-tuning, known as full fine-tuning (or instruction fine-tuning), was a costly process only available to companies with millions of dollars to spend on computing. This is because the sheer number of parameters, gradients, and optimizers that needed to be stored on our (now humble) A100 GPU was far too great of a load for it to bear.
领英推荐
We needed to pivot if we were serious about saving ourselves from reliance on OpenAI.
Parameter-Efficient Fine-Tuning (PEFT)
The solution to our memory issues revealed itself in a paper by Zeyu Han et al. about a method known as parameter-efficient fine-tuning (PEFT). This process involved “freezing” almost all weights of a language model’s neural network, only fine-tuning a small subset (1-5% of the total weights).
PEFT is achieved through the Low-Rank Adaptation (LoRA) and QLoRA algorithms. Imagining each layer of a neural network as a matrix of weight tensors, LoRA trains two smaller weight matrices that can be multiplied to reconstruct the original layer’s weights. These smaller matrices, known as “LoRA Adapters,” are superimposed upon the original language model’s neural network during inference time.
QLoRA builds upon the LoRA algorithm by introducing quantizations, which decrease the floating-point precision of the tensors stored in the matrices. Common quantizations include: storing the tensors in 8-bit or 4-bit floating point precision, as well as paging the training optimizers. QLoRA, in particular, would prove vital in conquering the memory issues we encountered during our first attempt at fine-tuning.
A diagram describing the differences between full fine-tuning, LoRA, and QLoRA from QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers et al. QLoRA (far right) makes use of a 4-bit quantized model (bottom block), LoRA Adapters (middle three blocks), and a paged optimizer (top three blocks and CPU). Paged optimizers allow for the GPU to offload memory to the CPU; the CPU holds these pages of memory that can be used as needed by the GPU during the training loop.
QLoRA+: Unsloth ??
The final puzzle piece was to find an efficient framework to implement QLoRA. Unfortunately, many of our initial attempts at implementing QLoRA resulted in dependency conflicts, mysterious errors, and continued memory “explosions.”?
This was until we stumbled upon a gem of a GitHub repository: Unsloth. Written in Triton by a former NVIDIA engineer, Unsloth is a collection of rewritten and optimized mathematical computations involved in deep learning processes - including back-propagation! Complete with a collection of quantized models on the HuggingFace ?? Hub, Unsloth was the first and only framework that finally allowed me to fine-tune an entire language model on a single GPU.
Upon completing the training, our hypothesis was correct: the fine-tuned models produced JSONs in the exact format that we expected! Tailbox could now convert its raw, scraped data from the Internet into clean JSON format with a smaller local model, cutting costs and eliminating reliance on third-party providers such as OpenAI.
In terms of specific rankings, we found that the Qwen2-1.5B model produced the most accurate results at the quickest pace - averaging 25 seconds per inference call. Trailing was Llama3-8B, which clocked in at an average of 27.5 seconds for lower-quality results. Finally was TinyLlama, which provided poor quality results in an average of 22 seconds.
Speeding-Up the Sloth
The only challenge that remained was optimizing the inference of our fine-tuned model. The local model was clocking in an average raw data to JSON time of 16 seconds - immensely slow compared to its LLM counterparts.
The solution lay in another memory trick that used the server's underutilized GPU memory during inference: batch processing.
In a very high-level sense, batch processing transfers the inference calls to the GPU to be executed in parallel since memory resources on the GPU often go underutilized during serialized inference. If you are curious about batch processing and inference optimizations, we encourage you to check out this insightful resource from NVIDIA.
With batch processing implemented, we cut the average inference time to 0.85 seconds - a 19-fold decrease in inference latency!
Conclusion and Insights
We sincerely hope you found this journey of circumventing reliance on OpenAI models helpful and insightful.
Using OpenAI's GPT models for most natural language tasks is often like using a sledgehammer to crack a walnut—it's overkill. Instead, these natural language tasks can be split into smaller pieces that are ideal for SLMs. These SLMs offer a comparable degree of efficacy at a fraction of the cost, proving to be a much more cost-effective alternative to expensive third-party solutions.
Now that you know a bit of the tech behind Tailbox, why don't you give it a try?