Exploring the LLM Infra Stack, Part 2: The Model Layer
By Andy Triedman

Exploring the LLM Infra Stack, Part 2: The Model Layer

2. Model Layer

This is the second post in our 4-part series on the LLM infra stack. You can catch up on our last post on the Data Layer here. If you are building in this space, we’d love to hear from you at [email protected].

The Model Layer

The core component of any LLM system is the model itself. They provide the fundamental capabilities that enable new products.

LLMs are foundation models: models trained on a broad set of data that can be adapted to a wide range of tasks. They tend to fall into two categories: large, hosted, closed-source models like GPT-4 and PaLM 2 or smaller, open-source models like Llama 2 and Falcon.

It’s unclear which models will dominate. The state of the art is changing rapidly – a new best-in-class model is announced monthly. So few products have been built with LLMs to test trade-offs. There are two emerging paths:?

1. Large foundation models dominate: Closed-source LLMs continue to provide capabilities other models can’t match. A managed service offers simplicity & rapid cost reduction. These vendors solve security concerns.

2. Fine-tuned models provide the best value: Smaller, less-expensive models prove just as good for most applications after fine-tuning. Businesses prefer them because they control the model and can create intellectual property.

  • In this world, many companies will still start building with managed foundation models and move to self-hosted fine-tuned models over time.

Large foundation models will work best for use cases that require broad knowledge and reasoning. Asking a model to plan a month-long vacation itinerary in Southeast Asia for a vegetarian couple interested in Buddhism requires planning and lots of working memory. These models are also best for new tasks where you don’t have data.

Smaller, fine-tuned models will excel in products composed of repeatable, well-defined tasks. They’ll perform just as well as larger models for a fraction of the cost. Need to categorize customer feedback as positive or negative? Extracting details from an email into a spreadsheet? A smaller model will work great.

Which models are safer and more secure? Which are best for compliance and governance? It’s still unclear.?

Many companies would prefer to self-host their models to avoid sending their data to a third party. However, emerging research suggests fine-tuning may compromise safety. Foundation models are tuned to avoid inappropriate language and content. LLMs, especially smaller ones, can forget these rules when fine-tuned. Large model providers may also find it easier to demonstrate that their models and data are compliant with nascent regulations. More work needs to be done to establish these conclusions.

In the long term, we can’t predict how LLMs will develop. A large research org could discover a new type of model that is an order of magnitude better than what we have today.?

Progress in model development will drastically change the structure of the Model Layer. Regardless of the direction, it will have the following key components:

Core model:

Training LLMs from scratch (also known as pre-training) costs hundreds of thousands to tens of millions of dollars for each model. OpenAI is said to have spent over $100 million on GPT-4. Only companies with huge balance sheets can afford to develop them.

Conveniently, fine-tuning a pre-trained model with your own data can provide similar results at a fraction of the cost. This can cost well under $100 with data you already have on hand. During inference, fine-tuned models can be faster and an order of magnitude cheaper to boot.

Fine-tuning is supported by a rapidly improving set of open-source models. The most capable ones, like Meta’s Llama 2, perform similarly to the best LLMs from 6-12 months ago (e.g., GPT-3.5). When fine-tuned for a concrete task, open-source models often reach near-parity with today’s best closed-source LLMs (e.g., GPT-4).

Serving/compute:

Training LLMs is expensive, but inference isn’t cheap, either. LLMs require an extreme amount of memory and substantial computing. For a product owner experimenting on OpenAI, each GPT-3.5 query will cost $0.01-0.03. GPT-4 is an order of magnitude higher at up to $3.00 per query.

If you try to self-host for privacy/security or cost reasons, you won’t find it easy to match OpenAI. Virtual machines with NVIDIA A100 GPUs (the best one for most large foundation models) run $3.67 to $40.55 per hour on Google Cloud. Availability of these machines is very limited and sporadic, even through major cloud providers.

For LLM applications with thousands or millions of users, product owners will face new challenges. Unlike most classic SaaS applications, they’ll need to think about costs. At $3 per query, it’s hard to build a viable product. Similarly, users won’t hang around long if it takes 30 seconds to return an answer.

It will be critical to serve LLMs in a performant and cost-effective way. Even “smaller” models described above still have billions of parameters.

How can you reduce cost and latency in production? There are lots of approaches. On the inference side, batch queries or cache their results. Optimize memory allocation to reduce fragmentation. Use speculative decoding (when a smaller model suggests tokens and a large model “accepts” them). Infrastructure usage can be optimized by comparing GPU prices in real-time and sending workloads to the cheapest option. Use spot pricing if you can manage spotty availability and failovers.?

Most companies won’t have in-house expertise to do all these themselves. Startups will rise to fill this need.

Model routing/abstraction:?

For some LLM applications, it will be obvious which model to use where. Summarize emails with one LLM. Transform the data to a spreadsheet with another.

But for others, it can be unclear. You might have a chat interface where some customer messages are simple, and others are complex. How will you know which should go to a large model vs. a small one?

For these types of applications, there may be an abstraction layer to analyze each request and route it to the best model.?

As a product owner, it is already difficult to evaluate the behavior of an LLM in production. If your system might route to many different LLMs, that problem could get even more challenging.

However, if dynamic routing improves capabilities and performance, the trade-off might be worth it. This layer could almost be thought of as a separate model itself and evaluated on its own merit.?

Fine-tuning & optimization:?

As we wrote about above, many product owners will fine-tune a model for their use case.?

For most use cases, fine-tuning will not be a one-off endeavor. It will start in product development and continue indefinitely as the product is used.?

The key to great fine-tuning will be fast feedback loops. Today, product owners triage user feedback/bugs in code and ask engineers to fix them. For LLM applications, product owners will monitor qualitative and quantitative data on LLM behavior. They will work with engineers, data ops teams, product/marketing, and evaluators to curate new data and re-fine-tune models.?

We’ll discuss post-deployment monitoring more in our next post on the

Deployment Layer:

To fine-tune a model, there are two major components. There is an important cohort of LLM infrastructure companies serving each of them.

  • Data ops/data curation: Data quality is the most important factor in fine-tuning results. Creating a high-quality dataset is challenging. Product owners must curate the examples from millions of data points with conceptual or semantic search. They might want to modify or augment examples with synthetic data. Infrastructure to programmatically search, label, and modify data is critical for effective fine-tuning. Fine-tuning data ops may converge with the production Data Layer described in our last post.
  • Fine-tuning ops: Companies will need a platform to orchestrate fine-tuning. It will use curated datasets and actually run the process to update model weights. The fine-tuning process itself is pretty straightforward; it has less product impact on end LLM quality compared to the fine-tuning data. However, it’s important to have simple, frictionless tooling to spin up runs and track model versions/behavior over time.?Some fine-tuning platforms will also provide tooling for performance optimization. This process tries to decrease model size (e.g., by changing from 16-bit to 4-bit precision) while maintaining its capabilities.

Key open questions:

  • What trajectory will LLMs development take? Many companies building in the Model Layer assume that there will be many self-hosted, fine-tuned models. If the vast majority of applications end up using managed foundation models, there could be less of a need for supporting infrastructure in the model layer.
  • Which models and hosting mechanisms provide the best safety and security? Will security be managed at the model layer or elsewhere in the stack? Today, these issues drive demand for model ownership and self-hosting. However, it may prove difficult to meet safety/security requirements with smaller, fine-tuned models.
  • How will LLM cost and performance evolve over time? If they decrease by orders of magnitude, then optimization at the model and compute level could become less important.?
  • What workflows will emerge for updating models with usage data? Will companies be continually fine-tuning them, or will a single pass be good enough??

Companies working on the Model Layer

  • Core model: Google, OpenAI, Anthropic, Adept, Imbue, Cohere, Hugging Face, Stability AI, Mistral, Contextual, DeepInfra
  • Serving/compute: Modular, Lambda Labs, Union, Exafunction, Modal, HippoML, Banana, SkyPilot, Texel, Paperspace, Foundry, Goose
  • Model routing/abstraction: Martian, Lite LLM, NotDiamond
  • Fine-tuning & optimizationData ops: Manual labeling: Scale, Appen, Hive, Labelbox, Surge; Programmatic labeling: Snorkel, Watchful, LilacFine-tuning ops/optimization: Arcee.ai, Lamini, Predibase, Together, Watchful, Superintel, Thirdai, GenJet, Glaive, LMFlow, Nolano

If you are building in this space, we’d love to hear from you at [email protected]! In our next post, we’ll explore the Deployment Layer. Subscribe here to follow along!

Azhar H.

Partner at Reflexive Capital

1 年

Latency and reliability/stability are the biggest issues I run into working with this generation of models. Think there is definitely a role of software optimization such as the batching, caching, RAG/finetuning. I wonder if we also need a step-function improvement in hardware as software surpasses ability.

要查看或添加评论,请登录

Theory Ventures的更多文章

社区洞察

其他会员也浏览了