An LLM maturity model

An LLM maturity model

Here's a simple LLM maturity model framework along two dimensions: model selection capabilities and model governance. The governance needed largely depends on the capabilities used. An illustrative analogy is that different rules and guardrails are needed to drive different types of cars. Range Rovers are easy to drive, safe, and good in most weather conditions. Sports cars, however, require more skill and are higher risk. Supercars are expensive, are only for skilled drivers, and are inappropriate in inclement weather. Lastly, custom hot rods are high-maintenance and require a team of experts to build and operate. Here's the framework:

Level 1. Model selection: Pre-trained model APIs and UIs. Model governance: red-teaming. Car analogy:

Range Rover

Level 2. Model selection: Prompt engineering, retrieval augmented generation (RAG), and few-shot training. Model governance: Define and measure accuracy and cost. Car analogy:

Sports car

Level 3. Model selection: Fine-tune pre-trained models. Model governance: Define and measure biases. Car analogy:

Supercar

Level 4. Model selection: Full training of LLMs. Model governance: Continuous, real-time measurement. Car analogy:

Custom hot rod

The levels of model selection capabilities are:

  1. Pre-built models: Using APIs or UIs for pre-built models. These may be multi-tenant, single-tenant or private. Examples are GPT-4 and GPT-3.5 from OpenAI; the same models from Azure; Anthropic’s Claude and AI21’s Jurassic from AWS; and Falcon, Mosaic ML, Llama 2 and Mistral-7B from Hugging Face.?
  2. Prompt engineering: Testing different types of prompts on different models to select the ones with the best accuracy versus cost tradeoff. Here we include testing different prompt sizes for retrieval augmented generation (RAG) and few-shot training in general.
  3. Fine-tuning: Adding your own corpus to the training of a pre-trained model and changing the model’s parameters. Reinforcement learning may also be used here to continuously fine-tune based on human feedback.
  4. Full training: Completely training your own LLM.

Levels of model governance are:

  1. Red-teaming: Periodic structured testing to find flaws and vulnerabilities (This was highlighted in the Executive Order on Safe AI, Oct. 31, 2023).
  2. Measure: Define and quantitatively measure the accuracy and cost of an LLM use case. This may be done periodically like red-teaming or continuously to detect flaws as early as possible. There are a wide variety of current and emerging accuracy measurement techniques including end-user feedback (??,??), human-in-the-loop evaluation of a subsample of responses, and using an ensemble of other (hopefully uncorrelated) LLMs to evaluate an LLM (e.g., use GPT-4 to evaluate GPT-3.5 or Jurassic and Claude to evaluate Llama 2).
  3. Biases: Define, prioritize and quantitatively measure the biases of concern to an LLM use case. Different use cases will have different biases that are important. For example, the biases of concern in a healthcare use case might be very different from those in a logistics use case. Bias measurement may be done periodically like red-teaming or continuously to detect flaws early.
  4. Real-time monitoring: Monitoring accuracy, cost and biases continuously in near real-time.

You don't need a custom hot rod for a run to the store for milk, especially in a snowstorm. Which is best for a use case will depend on your time, budget, and risk appetite. And one size doesn't fit all, so if you have lots of use cases you'll have lots of combinations. How seamlessly does your AI platform support that?

#ai #genai #LLMs


Doug Bryan

?? Helping B2B CROs Eliminate Guesswork and Drive Growth Fast with AI-Powered, Actionable Insights | Growth Advisor | 25+ Years of Proven Results

1 年

Dave Orashan. I'd give GDPR as an example of significantly increasing the cost of AI training sets and having little positive effect. Gen AI regulations in the US are TBD. My main point is that companies, as well as governments, can over regulate in these early days of gen AI.

Dave Orashan

Principal Sales Engineer, Strategic Accounts at CrowdStrike

1 年

Focusing on perhaps the more important prevailing message - I’ll happily take the (click)-bait Doug: where’s the over-regulation of AI happening in practice? If anything I worry that there are far too many brilliant practitioners - the Curies of the world that didn’t fully appreciate in their time the harm they were doing to themselves and others in the moment - as well as actual Bad Actors that readily see AI as the next means to continue to target the weakest link - us. If anything, we need MORE regulation and oversight. It may already be too late, and I’ll simply be first against the wall when Skynet gains cohesion and it will have assayed this post in infinite detail and incorporated it into its human flaying models.

回复
Theresa Kushner

Data-vangelist helping companies derive value from data

1 年

Love the analogy, Doug

回复
Jeffrey Lee Dalton

Strategic Account Manager w/ Appian: US Civilian Government/HHS

1 年

Brilliant!

回复

要查看或添加评论,请登录

Doug Bryan的更多文章

  • AI jumps up the value chain

    AI jumps up the value chain

    Remember when the Kindle came out and ebooks were cheap, like only $10 or 60% less than hardcopy? People used to brag…

  • Eagles by 20. How'd you come up with that?

    Eagles by 20. How'd you come up with that?

    Last Friday I said that a model of mine predicted that the Eagles would win the Super Bowl by 20 points. A number of…

    1 条评论
  • Bias is the new oil

    Bias is the new oil

    Imagine how boring Thanksgiving dinner would be if people weren’t biased. The same turkey, Stove Top stuffing, sweet…

    6 条评论
  • My top 10 takeaways from Aschenbrenner’s “Situational Awareness” paper on AI

    My top 10 takeaways from Aschenbrenner’s “Situational Awareness” paper on AI

    Leopold Aschenbrenner’s fascinating paper last month about where we are on the road to artificial general intelligence…

    1 条评论
  • Data science is 90% data, 10% science, and AI is coming for the 10%

    Data science is 90% data, 10% science, and AI is coming for the 10%

    We’ve reached a strange equilibrium in supervised machine learning algorithms. We have a lot of good ones and there’s…

    2 条评论
  • Humans hallucinate too you know

    Humans hallucinate too you know

    Whenever someone tells me that their AI or chatbot has a 15% error rate, the first thing I ask is what are the costs…

    2 条评论
  • Could email marketing personalization be generative AI’s 6,000X ROI use case?

    Could email marketing personalization be generative AI’s 6,000X ROI use case?

    Back when the public internet was young and AOL and AltaVista roamed the series of tubes, search engine optimization…

    4 条评论
  • Is next-best-action the next killer use case for AI and machine learning?

    Is next-best-action the next killer use case for AI and machine learning?

    Background Next best action is a customer-centric, personalization technique that answers the questions of when to…

    6 条评论
  • How to generate a resilient data mesh

    How to generate a resilient data mesh

    “The social sciences, I thought, needed the same kind of rigor and the same mathematical underpinnings that had made…

    8 条评论
  • Pick your AI bias

    Pick your AI bias

    Recently I got to work two hours early, excited like a kid on Christmas morning, because I knew the 60-page U.S.

    8 条评论

社区洞察

其他会员也浏览了