A Guide: Choosing The Perfect Language Model For Your Use Case

A Guide: Choosing The Perfect Language Model For Your Use Case



The landscape of Large Language Models (LLMs) is rapidly changing with many models emerging, but which model is suitable for our specific task?

This project will guide you in choosing an LLM for your specific needs using criteria and public resources.



TL;DR

  • Consider 3 key factors: 1) task performance, 2) computational efficiency, and 3) commercial terms when choosing a model
  • Use publicly available comparisons such as leaderboards and metrics as actionable insights
  • Make a prototype of an auto-coding app on a suitable LLM



LLM Usage Types

Among the 5 deployment types, we will focus on the common approaches particularly relevant to app development: Scope 1 and Scope 3:

5 Deployment Types of LLM - Ref. AWS Generative AI Security Scoping Matrix



Key Factors to Consider

Once we have defined the task scope, here are 3 key factors to consider when choosing a model.

1. Task Performance

Key areas to consider for successful task completion are here:

  • Accuracy: Percentage of correct answers/completions
  • Domain Closeness: Relevance of the training data to our task domain
  • Fluency: Readability and naturalness of generated text
  • Informativeness: How well the response addresses the prompt or question
  • Engagement: How well the chatbot keeps the user interested
  • Robustness: How well the model can handle unexpected inputs
  • Safety & Fairness: Avoiding harmful or offensive outputs


Using technical benchmarks allows us to quantify the task performance.

Major metrics by task are here:

< Coding Tasks >

  • HumanEval: Probability of returning the correct code out of the 164 coding challenges using pass@k metrics
  • MBPP (Mostly Basic Python Problems): Probability of returning the correct code out of 1K Python programming challenges

< Chatbot Assistance >

  • MT-Bench (Multi-turn Benchmark): Conversation flow and instruction-following capabilities of the model
  • TruthfulQA: Probability of generating incorrect information out of 817 questions across the 38 categories
  • Context Window: The number of tokens the model can take as input when generating responses

< Reasoning Tasks >

  • ARC (AI2 Reasoning Challenge): Deeper knowledge & reasoning evaluation using 7.5K questions
  • HellaSwag: Probability to complete sentences in correct ways using a single selection test

< Question Answering and Language Understanding >

  • MMLU (Multitask Multidomain Language Understanding): Broad knowledge evaluation on the 57 diverse subjects
  • TriviaQA: Reading comprehension using over 650K question-answer-evidence triples


2. Computational Efficiency

Assess if the model can run efficiently while aligning with our environment's capabilities to avoid performance bottlenecks.

  • Parameter Size: Larger models with more parameters generally need more resources. Smaller models offer faster processing but may compromise accuracy.
  • Inference Speed: How quickly the model can generate output can impact efficiency.
  • Hardware: The type of hardware (CPU vs GPU) impacts how efficiently the model runs.


3. Commercial Terms

Choose a model that fits our long-term business goals & technical evolvement.

  • Usage Price & Limits: Price per million tokens, account usage limits set by the model provider, and data server cost to run the model
  • Ecosystem: Consider ecosystem and deployment flexibilities with platformers such as Hugging Face, Azure ML, or Amazon SageMaker.
  • Commercial Availability: If we plan to generate revenue from the project, ensure the chosen LLM has commercial licensing options.



Leveraging Public Resources

Evaluating LLMs from scratch can be a time-intensive process. Fortunately, there are valuable public resources available to streamline your selection process:

Leaderboard & Metrics Comparison


Commercial Terms

Credit: Philipp Schmid / Providers covered:

(Azure) OpenAI

Anthropic

Google Vertex AI

Amazon Bedrock

Mistral

Anyscale

MosaicML

Together.AI


Broader Search Based on Technical Specs and Task Objectives



Example: Build an Auto-coding App

Let's explore how to choose the right LLM for building apps, using auto-coding applications as a practical example.

Step 1. Define task and performance metrics

  • Task: Text-to-text generation (Understand text inputs and return Python source code in text format)
  • Use HumanEval as a technical parameter

Step 2. Choose a model using public resources

Refer to the HuggingFace leaderboard focusing on Python coding tasks and choose CodeLlama 13B;

  • Achieve a high score on the leaderboard (as of Mar 2024)
  • Expect an efficiency advantage with the smallest model size among the top 3 contenders
  • Offers serverless deployment options for easy integration
  • Available for commercial use

Compare LLMs by task completion score (HuggingFace / As of March 2024)


Step 3. Deployment & Result

Use an inference API to deploy an app:

Result:




Conclusion

Future of LLMs - Niche Domination or Universal Powerhouse..?

The guideline enables us to choose an LLM that excels at our target tasks while aligning with our resource limitations and commercial requirements. We can focus on our needs over the general popularity of the model, leveraging public resources.

On the other hand, the rapid advancement of AI technology suggests a shift towards dominance by a smaller number of highly competitive models - either universally or domain-specifically. To navigate the dynamic LLM landscape, we need to:

  • Mind the Metrics: Evaluation results can vary depending on the provider's methodology and testing environment. Consider hands-on testing of a few potential LLMs for a more practical understanding.
  • Revisit Initial Choice: As new models and evaluation methods emerge regularly, be prepared to adapt our choice as technology advances.


In the next article, we will explore LLMs' architecture, encoder & decoder, and deploy a chatbot.



Reference:

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Magicoder: Source Code Is All You Need

Bias Testing and Mitigation in LLM-based Code Generation

Hugging Face CodeLlama 13b Python (Model Card)

Use Hugging Face with Amazon SageMaker

Exploring LLM Platforms and Models: Unpacking OpenAI, Azure, and Hugging Face

要查看或添加评论,请登录

社区洞察

其他会员也浏览了