Here's Why LLM Compression Matters
The above image was ChatGPT's output when prompted to remake the below one in 1920x1080 pixels format:
The above hallucination was the result of AI compute and energy limitations that even OpenAI faces despite having raised over $14 billion from Microsoft, Google, Amazon and others.
Large Language Models (LLMs, the back-end to generative AI front-ends) are the most important new technology since the Internet and will dramatically increase the efficiencies of people, companies, industries and economies that make use of them.
Before that can happen, though, LLMs need to get more accurate and more efficient.
Hundreds of LLM-focused startups have formed to make LLM development and deployment easier, cheaper, faster, less energy intensive and more user friendly.
They have received tens of billions in funding from investors. Some now have solutions in production, and you're starting to see them exhibit at conferences like Ai4 which I attended in Las Vegas last week.
Typically 80-90% of those startups' employees are deeply technical, with only handfuls of people in Sales, Marketing and other less-technical functions. That's to be expected at this early adopter phase. Their buyers, for now, are equally technical.
For LLM startups to cross Geoffrey Moore's chasm from early adopters to business pragmatists, however, they will need to make their solutions understandable to executives outside the CTO and CIO organizations.
More importantly, if they are to grab the lion's marketshare, they'll need the type of broad market awareness that comes from effective marketing.
To effectively market, in turn, these startups will need to come up with analogies for the GenAI technical challenges their solutions bring value to.
To give you an idea of what that might look like, let's examine the biggest challenge companies face in deploying LLMs: compute and energy costs can be millions to tens of millions of dollars annually.
To address the skyrocketing compute and energy costs of LLMs, companies are turning to compression technologies such as quantization, pruning, low-rank approximation, knowledge distillation and others. Some of these are open source; others are being built by startups, but it is clear that LLM compression tools will be a massive market opportunity.
What is quantization, you ask? And how does it compare to low-rank approximation? That's easy - just look at this chart!
Lost? Here's what quantization is:
If you're packing a suitcase and want to bring as many clothes as possible without exceeding the airline's weight limit, instead of folding them neatly you roll them up tightly to use less space. The clothes might get a bit wrinkled, but they’re still wearable, and you’ve managed to fit everything you need into the suitcase. In this analogy:
So, quantization is like rolling up your clothes to fit more into your suitcase. It compresses the LLM by using a more compact representation of the data, allowing the model to take up less space and use fewer resources, while still retaining most of its original functionality.
How about black-box knowledge distillation, another LLM compression technique? This will help:
Black-box KD usually prompts the teacher LLM to generate a distillation dataset for fine-tune the student LM, thereby transferring capabilities from teacher LLM to the student LM. In Black-box KD, teacher LLMs such as ChatGPT (gpt-3.5-turbo) and GPT4 (OpenAI, 2024) are typically employed, while smaller LMs (SLMs), such as GPT2 (Radford et al., 2019), T5 (Raffel et al., 2020), FlanT5 (Chung et al., 2024), and CodeT5 (Wang et al., 2021), are commonly utilized as student LMs. On the other hand, researchers find that LLMs have emergent abilities, which refers to a significant improvement in performance when the model reaches a certain scale, showcasing surprising capabilities. Lots of Black-box KD methods try to distill emergent abilities from LLMs to student LMs, and we introduce three commonly used emergent ability distillation methods: Chain-ofThought (CoT) Distillation, In-Context Learning (ICL) Distillation, and Instruction Following (IF) Distillation. [SOURCE: https://arxiv.org/pdf/2308.07633]
Still confused? Try this analogy:
Imagine you have an experienced, wise teacher who knows a vast amount of information on many subjects. This teacher spends years learning and understanding complex topics in great depth. Now, the teacher has an assistant who needs to learn enough to teach others, but doesn't need to know every single detail the teacher does. The teacher takes the time to explain the most important concepts to the assistant, simplifying the knowledge and focusing on what’s essential to do the job effectively. In this analogy:
So, knowledge distillation in LLMs is like a wise teacher passing on the most important and useful knowledge to an assistant. The result is a smaller, more efficient model that retains much of the original model's capabilities, but in a more compact form.
CONCLUSION: These visual and written metaphors let non-technical finance, operations and executive leaders know what value an LLM startup's technology will bring to their GenAI initiatives.
LLM startups that hope to start an industry movement based on their ground-breaking technology will need to analogize while proving they are necessary and unique.
Do that and you can market, generate awareness and inbound, and quickly get deals through the pipeline.
Analogies are the way to cross the LLM chasm.