登录查看更多内容

Exploring the LLM Infra Stack, Part 2: The Model Layer

Theory Ventures

We invest $1-25m in early stage software companies that leverage technology discontinuities into go-to-market advantages

发布日期: 2023年11月15日

2. Model Layer

This is the second post in our 4-part series on the LLM infra stack. You can catch up on our last post on the Data Layer here. If you are building in this space, we’d love to hear from you at [email protected].

The Model Layer

The core component of any LLM system is the model itself. They provide the fundamental capabilities that enable new products.

LLMs are foundation models: models trained on a broad set of data that can be adapted to a wide range of tasks. They tend to fall into two categories: large, hosted, closed-source models like GPT-4 and PaLM 2 or smaller, open-source models like Llama 2 and Falcon.

It’s unclear which models will dominate. The state of the art is changing rapidly – a new best-in-class model is announced monthly. So few products have been built with LLMs to test trade-offs. There are two emerging paths:?

1. Large foundation models dominate: Closed-source LLMs continue to provide capabilities other models can’t match. A managed service offers simplicity & rapid cost reduction. These vendors solve security concerns.

2. Fine-tuned models provide the best value: Smaller, less-expensive models prove just as good for most applications after fine-tuning. Businesses prefer them because they control the model and can create intellectual property.

In this world, many companies will still start building with managed foundation models and move to self-hosted fine-tuned models over time.

Large foundation models will work best for use cases that require broad knowledge and reasoning. Asking a model to plan a month-long vacation itinerary in Southeast Asia for a vegetarian couple interested in Buddhism requires planning and lots of working memory. These models are also best for new tasks where you don’t have data.

Smaller, fine-tuned models will excel in products composed of repeatable, well-defined tasks. They’ll perform just as well as larger models for a fraction of the cost. Need to categorize customer feedback as positive or negative? Extracting details from an email into a spreadsheet? A smaller model will work great.

Which models are safer and more secure? Which are best for compliance and governance? It’s still unclear.?

Many companies would prefer to self-host their models to avoid sending their data to a third party. However, emerging research suggests fine-tuning may compromise safety. Foundation models are tuned to avoid inappropriate language and content. LLMs, especially smaller ones, can forget these rules when fine-tuned. Large model providers may also find it easier to demonstrate that their models and data are compliant with nascent regulations. More work needs to be done to establish these conclusions.

In the long term, we can’t predict how LLMs will develop. A large research org could discover a new type of model that is an order of magnitude better than what we have today.?

Progress in model development will drastically change the structure of the Model Layer. Regardless of the direction, it will have the following key components:

Core model:

Training LLMs from scratch (also known as pre-training) costs hundreds of thousands to tens of millions of dollars for each model. OpenAI is said to have spent over $100 million on GPT-4. Only companies with huge balance sheets can afford to develop them.

Conveniently, fine-tuning a pre-trained model with your own data can provide similar results at a fraction of the cost. This can cost well under $100 with data you already have on hand. During inference, fine-tuned models can be faster and an order of magnitude cheaper to boot.

Fine-tuning is supported by a rapidly improving set of open-source models. The most capable ones, like Meta’s Llama 2, perform similarly to the best LLMs from 6-12 months ago (e.g., GPT-3.5). When fine-tuned for a concrete task, open-source models often reach near-parity with today’s best closed-source LLMs (e.g., GPT-4).

Serving/compute:

Training LLMs is expensive, but inference isn’t cheap, either. LLMs require an extreme amount of memory and substantial computing. For a product owner experimenting on OpenAI, each GPT-3.5 query will cost $0.01-0.03. GPT-4 is an order of magnitude higher at up to $3.00 per query.

If you try to self-host for privacy/security or cost reasons, you won’t find it easy to match OpenAI. Virtual machines with NVIDIA A100 GPUs (the best one for most large foundation models) run $3.67 to $40.55 per hour on Google Cloud. Availability of these machines is very limited and sporadic, even through major cloud providers.

For LLM applications with thousands or millions of users, product owners will face new challenges. Unlike most classic SaaS applications, they’ll need to think about costs. At $3 per query, it’s hard to build a viable product. Similarly, users won’t hang around long if it takes 30 seconds to return an answer.

领英推荐

The Rise of Open-Source LLMs in Enterprises

Data Science Dojo 1 年前

The Evolving LLM Landscape: 8 Key Trends to Watch

Open Data Science Conference (ODSC) 5 个月前

Issue 7 - February Recap

DataArt 1 年前

It will be critical to serve LLMs in a performant and cost-effective way. Even “smaller” models described above still have billions of parameters.

How can you reduce cost and latency in production? There are lots of approaches. On the inference side, batch queries or cache their results. Optimize memory allocation to reduce fragmentation. Use speculative decoding (when a smaller model suggests tokens and a large model “accepts” them). Infrastructure usage can be optimized by comparing GPU prices in real-time and sending workloads to the cheapest option. Use spot pricing if you can manage spotty availability and failovers.?

Most companies won’t have in-house expertise to do all these themselves. Startups will rise to fill this need.

Model routing/abstraction:?

For some LLM applications, it will be obvious which model to use where. Summarize emails with one LLM. Transform the data to a spreadsheet with another.

But for others, it can be unclear. You might have a chat interface where some customer messages are simple, and others are complex. How will you know which should go to a large model vs. a small one?

For these types of applications, there may be an abstraction layer to analyze each request and route it to the best model.?

As a product owner, it is already difficult to evaluate the behavior of an LLM in production. If your system might route to many different LLMs, that problem could get even more challenging.

However, if dynamic routing improves capabilities and performance, the trade-off might be worth it. This layer could almost be thought of as a separate model itself and evaluated on its own merit.?

Fine-tuning & optimization:?

As we wrote about above, many product owners will fine-tune a model for their use case.?

For most use cases, fine-tuning will not be a one-off endeavor. It will start in product development and continue indefinitely as the product is used.?

The key to great fine-tuning will be fast feedback loops. Today, product owners triage user feedback/bugs in code and ask engineers to fix them. For LLM applications, product owners will monitor qualitative and quantitative data on LLM behavior. They will work with engineers, data ops teams, product/marketing, and evaluators to curate new data and re-fine-tune models.?

We’ll discuss post-deployment monitoring more in our next post on the

Deployment Layer:

To fine-tune a model, there are two major components. There is an important cohort of LLM infrastructure companies serving each of them.

Data ops/data curation: Data quality is the most important factor in fine-tuning results. Creating a high-quality dataset is challenging. Product owners must curate the examples from millions of data points with conceptual or semantic search. They might want to modify or augment examples with synthetic data. Infrastructure to programmatically search, label, and modify data is critical for effective fine-tuning. Fine-tuning data ops may converge with the production Data Layer described in our last post.
Fine-tuning ops: Companies will need a platform to orchestrate fine-tuning. It will use curated datasets and actually run the process to update model weights. The fine-tuning process itself is pretty straightforward; it has less product impact on end LLM quality compared to the fine-tuning data. However, it’s important to have simple, frictionless tooling to spin up runs and track model versions/behavior over time.?Some fine-tuning platforms will also provide tooling for performance optimization. This process tries to decrease model size (e.g., by changing from 16-bit to 4-bit precision) while maintaining its capabilities.

Key open questions:

What trajectory will LLMs development take? Many companies building in the Model Layer assume that there will be many self-hosted, fine-tuned models. If the vast majority of applications end up using managed foundation models, there could be less of a need for supporting infrastructure in the model layer.
Which models and hosting mechanisms provide the best safety and security? Will security be managed at the model layer or elsewhere in the stack? Today, these issues drive demand for model ownership and self-hosting. However, it may prove difficult to meet safety/security requirements with smaller, fine-tuned models.
How will LLM cost and performance evolve over time? If they decrease by orders of magnitude, then optimization at the model and compute level could become less important.?
What workflows will emerge for updating models with usage data? Will companies be continually fine-tuning them, or will a single pass be good enough??

Companies working on the Model Layer

Core model: Google, OpenAI, Anthropic, Adept, Imbue, Cohere, Hugging Face, Stability AI, Mistral, Contextual, DeepInfra
Serving/compute: Modular, Lambda Labs, Union, Exafunction, Modal, HippoML, Banana, SkyPilot, Texel, Paperspace, Foundry, Goose
Model routing/abstraction: Martian, Lite LLM, NotDiamond
Fine-tuning & optimizationData ops: Manual labeling: Scale, Appen, Hive, Labelbox, Surge; Programmatic labeling: Snorkel, Watchful, LilacFine-tuning ops/optimization: Arcee.ai, Lamini, Predibase, Together, Watchful, Superintel, Thirdai, GenJet, Glaive, LMFlow, Nolano

If you are building in this space, we’d love to hear from you at [email protected]! In our next post, we’ll explore the Deployment Layer. Subscribe here to follow along!

Theory Ventures Blog

3,259 位关注者

Azhar H.

Partner at Reflexive Capital

1 年

Latency and reliability/stability are the biggest issues I run into working with this generation of models. Think there is definitely a role of software optimization such as the batching, caching, RAG/finetuning. I wonder if we also need a step-function improvement in hardware as software surpasses ability.

1 次回应

要查看或添加评论，请登录

Theory Ventures的更多文章

See all articles

Exploring the LLM Infra Stack, Part 2: The Model Layer

Theory Ventures

We invest $1-25m in early stage software companies that leverage technology discontinuities into go-to-market advantages

2. Model Layer

The Model Layer

领英推荐

Key open questions:

Companies working on the Model Layer

Theory Ventures Blog

3,259 位关注者

Theory Ventures的更多文章

社区洞察

其他会员也浏览了

Inside the System Design and Implementation of BloombergGPT By Nashet Ali | Expert in Cloud, AI, and Enterprise Solutions Architecture

LLMs / RAG with Extremely Large Contextual Window - Case Study

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

October 2024 DVC Pulse!

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Everything You Need to Know About KX CON [23]

FiftyOne Computer Vision Community Update – October 2023

Beginner's Guide to Vector Databases

How to Implement a GenAI Agent using Autogen or LangGraph

Navigating GenAI RAG Complexity: A Guide for IT Leaders on Building AI Capabilities

2. Model Layer

The Model Layer

领英推荐

Key open questions:

Companies working on the Model Layer

Theory Ventures Blog

3,259 位关注者

Theory Ventures的更多文章

Ambient AI: why the best assistants will run on-device

Billions-scale personalization with AI agents: Our investment in Aampe

Already superhuman: What jobs do AIs do better than humans?

Why a plateau in foundation model performance wouldn’t matter for AI applications

A new paradigm in foundation models: Why o1 is different and how it will transform LLM applications

Where will AI disrupt or sustain knowledge workers?

Building For the Next Generation of Data Management - Our Investment in Allium

Composable software platforms in practice

Theory Ventures 2024 GTM Survey

A transformation framework that understands your data: our investment in Tobiko Data

社区洞察

其他会员也浏览了

Inside the System Design and Implementation of BloombergGPT By Nashet Ali | Expert in Cloud, AI, and Enterprise Solutions Architecture

LLMs / RAG with Extremely Large Contextual Window - Case Study

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

October 2024 DVC Pulse!

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Everything You Need to Know About KX CON [23]

FiftyOne Computer Vision Community Update – October 2023

Beginner's Guide to Vector Databases

How to Implement a GenAI Agent using Autogen or LangGraph

Navigating GenAI RAG Complexity: A Guide for IT Leaders on Building AI Capabilities