登录查看更多内容

LLMs in Production From language models to successful products

Pankaj Gajjar

Husband|Father|Speaker|Enterprise Architect? (TOGAF?)|MDM(PIM/DAM/MXM) Architect|ACE(Multi Cloud)|ex-AWS CB|Lead Solution Architect @Datastax|Generative AI |AI Consulting

发布日期: 2024年9月11日

+ 关注

Learn how to put Large Language Model-based applications into production safely and efficiently.

Why Large Language Models Have Captured Attention

LLMs are exciting because they work within the same framework (language) as humans
Society has been built on language, so effective language models have limitless applications such as chatbots, programming assistants, video games, and AI assistants.
LLMs are excellent at many tasks and can even pass high-ranking medical and law exams
LLMs are wrecking balls not hammers, and should be avoided for simple problems, problems that require low latency, and problems with high risks.
Reasons to buy include: Quickly get up and running to conduct research and prototype use cases Easy access to highly optimised production models Access to vendors technical support and system
Reasons to build include: Getting a competitive edge for your business use case Keeping costs low and transparent Ensuring reliability of the model Keeping your data safe Controlling model output on sensitive or private topics
There is no technical moat that is preventing you from competing with larger companies since open source frameworks and models provide the building blocks to pave your own path.

A deep dive into language modelling

The five components of linguistics are phonetics, syntax, semantics, pragmatics, and morphology. Phonetics can be added through a multimodal model that processes audio files and is likely to improve LLMs in the future, but current datasets are too small.Syntax is what current models are good at.Semantics is added through the embedding layer.Pragmatics can be added through engineering efforts.Morphology is added in the tokenisation layer.
Language does not necessarily correlate with reality. Understanding the process that people use to create meaning outside of reality is useful to training meaningful (to people) models.
Proper tokenisation can be a major hurdle due to too many <UNK> tokens, especially when it comes to specialised problems like code or math.
Multilingual processing has always outperformed monolingual processing, even on monolingual tasks without models.
Each language model type in sequence show a natural and organic growth of the LLM field as more and more linguistic concepts are added and make the models better.
Language modelling has seen an exponential increase in efficacy, correlating to how linguistics-focused the modelling has been.
Attention is a mathematical shortcut for solving larger context windows faster and is the backbone of modern architectures - Encoders, Decoders, and Transformers.Encoders improve the semantic approximations in embeddings.Decoders are best at text generation.Transformers combine the two.
Larger models demonstrate emergent behaviour suddenly being able to accomplish tasks they couldn’t before.

Building a platform for LLMs

LLMs are difficult to work with mostly because they are big. Which impacts a longer time to download, load into memory, and deploy forcing us to use expensive resources.
LLMs are also hard to deal with because they deal with natural language and all its complexities including hallucinations, bias, ethics, and security.
Regardless if you build or buy, LLMs are expensive and managing costs and risks associated with them will be crucial to the success of any project utilising them.
Compressing models to be as small as we can will make them easier to work with; quantisation, pruning, and knowledge distillation are particularly useful for this.
Quantisation is popular because it is easy and can be done after training without any fine-tuning.
Low Rank Approximation is an effective way at shrinking a model and has been used heavily for Adaptation thanks to LoRA.
There are three core directions we use to parallelise LLM workflows: Data, Tensor, and Pipeline. DP helps us increase throughput, TP helps us increase speed, and PP makes it all possible to run in the first place.
Combining the parallelism methods together we get 3D parallelism (Data+Tensor+Pipeline) where we find that the techniques synergise, covering each others weaknesses and help us get more utilisation.
The infrastructure for LLMOps is similar to MLOps, but don’t let that fool you since there are many caveats where “good enough” no longer works.
Many tools are offering new features specifically for LLM support.
Vector Databases in particular are interesting as a new piece of the infrastructure puzzle needed for LLMs that allow quick search and retrievals of embeddings.

领英推荐

Small Language Models (SLMs): The Future of Business…

Bharat Bhushan 2 个月前

Data Labeling for Large Language Models

Objectways 11 个月前

Exploring LangChain's Expression Language (LCEL)

Rany ElHousieny, PhD??? 6 个月前

Setting up for success

Data Engineers have unique datasets to acquire and manage for LLMs, like model weights, evaluation datasets, and embeddings.
No matter your task, there is a wide array of open source models to choose from and acquire to use to fine-tune your own model.
Text based tasks are harder to evaluate than just simple equality metrics you’d find in traditional ML tasks, but there are many industry benchmarks to help you get started.
Evaluating LLMs for more than just performance like bias and potential harm is your responsibility.
You can use the Evaluate library to build your own evaluation metrics.
There are many large open source datasets, but most of them come from scraping the web and require cleaning.
Instruct schemas and annotating your data can be effective ways to clean and analyse your data.
Fine-tuning a model on a dataset that has an appropriate distribution of speech acts for the task you want your model to perform will help your model generate context appropriate content.
Building your own subword tokeniser to match your data can greatly improve your model’s performance.
Many problems teams are trying to use LLMs for can be solved by simply using embeddings from your model instead.

#LLMs #Production #Building #Product #Modelling #Apis #Code

Credits : Manning

It's all about data

630 位关注者

要查看或添加评论，请登录

Pankaj Gajjar的更多文章

Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

2025年3月17日

Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

Working with text data LLMs require textual data to be converted into numerical vectors, known as embeddings, since…

2 条评论
Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

2025年3月10日

Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

Understanding large language models LLMs have transformed the field of natural language processing, which previously…
Latency by Pekka Enberg #booksummary Wait-Free Synchronization

2025年3月4日

Latency by Pekka Enberg #booksummary Wait-Free Synchronization

Wait-free synchronization is an alternative to traditional mutual exclusion for reducing latency in concurrent systems.…
Latency by Pekka Enberg #booksummary Eliminating Work

2025年2月25日

Latency by Pekka Enberg #booksummary Eliminating Work

Eliminating work in low-latency application is critical as we move from data to code because sometimes the only way to…
Latency by Pekka Enberg #booksummary Caching

2025年2月18日

Latency by Pekka Enberg #booksummary Caching

Caching is a technique for speeding up data retrieval by storing temporary copies of data closer to where it’s…
Latency by Pekka Enberg #booksummary Partitioning

2025年2月11日

Latency by Pekka Enberg #booksummary Partitioning

Partitioning is a technique employed in distributed systems to divide logical data into multiple, smaller physical…
Latency by Pekka Enberg #booksummary Replication

2025年2月4日

Latency by Pekka Enberg #booksummary Replication

Replicating data has multiple benefits reducing latency, improving reliability and availability, and helping prevent…
Generative AI in Computer Vision by Vladimir Bok #booksummary

2025年1月31日

Generative AI in Computer Vision by Vladimir Bok #booksummary

Diffusion Models: Reverse Diffusion Forward Diffusion Process:Gradually adds noise to data samples over a series of…
Latency by Pekka Enberg #booksummary Colocation

2025年1月28日

Latency by Pekka Enberg #booksummary Colocation

Colocation is a technique of bringing two components closer to reduce latency, which can be beneficial for applications…
Generative AI in Computer Vision by Vladimir Bok #booksummary

2025年1月24日

Generative AI in Computer Vision by Vladimir Bok #booksummary

Diffusion Models: Forward Diffusion Diffusion models: A class of generative models that gradually transform simple…

See all articles

LLMs in Production From language models to successful products

Pankaj Gajjar

Husband|Father|Speaker|Enterprise Architect? (TOGAF?)|MDM(PIM/DAM/MXM) Architect|ACE(Multi Cloud)|ex-AWS CB|Lead Solution Architect @Datastax|Generative AI |AI Consulting

Why Large Language Models Have Captured Attention

A deep dive into language modelling

Building a platform for LLMs

领英推荐

Setting up for success

It's all about data

630 位关注者

Pankaj Gajjar的更多文章

其他会员也浏览了

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

July 16th Part 3 - Benchmark Tests for Large Language Models | Relationship between LLMs, KGs, Ontology

Evaluating Large Language Models: Key Metrics for Comprehensive Performance Assessment

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

On to Knowledge-infused Language Models

Navigating the Landscape of Large Language Models (LLMs): Training, Deployment, and Beyond

Unveiling the Causes of LLM Hallucination and Overcoming LLM Hallucination

What's better than a large language model?

The Long-Term Impact of AI Language Prediction: Homogenization of Thought

Why Large Language Models Have Captured Attention

A deep dive into language modelling

Building a platform for LLMs

领英推荐

Setting up for success

It's all about data

630 位关注者

Pankaj Gajjar的更多文章

Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

Build a Large Language Model (From Scratch) #booksummary by Sebastian Raschka

Latency by Pekka Enberg #booksummary Wait-Free Synchronization

Latency by Pekka Enberg #booksummary Eliminating Work

Latency by Pekka Enberg #booksummary Caching

Latency by Pekka Enberg #booksummary Partitioning

Latency by Pekka Enberg #booksummary Replication

Generative AI in Computer Vision by Vladimir Bok #booksummary

Latency by Pekka Enberg #booksummary Colocation

Generative AI in Computer Vision by Vladimir Bok #booksummary

其他会员也浏览了

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

July 16th Part 3 - Benchmark Tests for Large Language Models | Relationship between LLMs, KGs, Ontology

Evaluating Large Language Models: Key Metrics for Comprehensive Performance Assessment

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

On to Knowledge-infused Language Models

Navigating the Landscape of Large Language Models (LLMs): Training, Deployment, and Beyond

Unveiling the Causes of LLM Hallucination and Overcoming LLM Hallucination

What's better than a large language model?

The Long-Term Impact of AI Language Prediction: Homogenization of Thought