LLMs in production: Lessons from the trenches

LLMs in production: Lessons from the trenches

Dr. Uday Kamath, Ph.D. , Chief Analytics Officer at Smarsh , presented a lecture at the QuantUniversity AI Fall School on November 11, 2024, titled "LLMs in Production." The presentation explored the practical aspects of deploying Large Language Models (LLMs) in real-world applications.

LLM Applications and Metrics

Dr. Kamath began by outlining various LLM applications, mapping them to relevant Natural Language Processing (NLP) tasks and corresponding evaluation metrics. He provided examples such as:

Conversational applications: Chatbots and AI assistants utilizing text generation, summarization, and dialogue management. Common metrics for these applications include BLEU, Perplexity, and human evaluation for naturalness and coherence.

Search and Information Retrieval: Search engines and knowledge base search utilizing information retrieval, semantic search, and summarization. Metrics like Precision, Recall, Mean Reciprocal Rank (MRR), and F1 Score are commonly used.

Content Creation: Social media content generation and marketing copywriting, employing text generation, paraphrasing, and summarization tasks. ROUGE, BLEU, BERTScore, and human evaluation for creativity and coherence are relevant metrics.

Coding Assistants: Tools like GitHub Copilot, using code generation, completion, error detection, and natural language understanding. BLEU, and Code Execution Accuracy are used to evaluate these applications.

Translation and Multilingual Applications: Website translation and content localization, leveraging machine translation, language identification, and multilingual generation. BLEU, METEOR, and TER (Translation Edit Rate) are common metrics for this domain.

Document Analysis and Processing: Legal document review and financial report analysis utilizing summarization, document classification, and information extraction. ROUGE, F1 Score, Precision, and Recall are frequently used.

Sentiment and Intent Analysis: Social media sentiment tracking and customer feedback analysis employing sentiment analysis, intent detection, and text classification. Accuracy, F1 Score, Precision, and Recall are common evaluation metrics.

Question-Answering Systems: FAQ bots and educational tutoring systems relying on question answering, knowledge retrieval, and contextual reasoning. Exact Match (EM), F1 Score, and Mean Reciprocal Rank (MRR) are used to assess these systems.

Categorization of Evaluation Metrics

Dr. Kamath discussed different ways to categorize LLM evaluation metrics:

With References vs. Without References: Metrics can compare model output to correct answers (e.g., BLEU, ROUGE) or assess fluency and coherence without a reference (e.g., Perplexity).

Character-based, Word-based, and Embedding-based: Metrics can focus on character-level correctness, n-gram overlap in words (e.g., BLEU), or semantic similarity using vector embeddings (e.g., BERTScore).

Human Evaluation vs. LLM Evaluation: Metrics can involve human judges assessing relevance, fluency, and coherence or use another model for automated evaluation.

He then provided detailed explanations of commonly used metrics including Perplexity, BLEU, ROUGE, BERTScore, and Pass@k, along with their formulas and pros and cons.

LLM Selection Criteria

Dr. Kamath emphasized the importance of choosing the right LLM for production success. He highlighted key attributes to consider:

-Analytic Quality

-Inference Latency

-Total Cost of Ownership (TCO)

-Adaptability and Maintenance

-Data Security and Licensing

He compared open-source and closed-source models, discussing their advantages and disadvantages in terms of flexibility, cost, customization, ease of use, and adaptability.

Evaluation and Optimization

Dr. Kamath stressed the significance of evaluating LLMs to ensure they meet specific task requirements. He recommended using benchmarks like the HuggingFace Open LLM Leaderboard to compare model performance, complemented by domain-specific tests for application-aligned performance.

He also discussed inference latency and cost optimization, pointing out that model size, number of layers, and numeric precision can impact latency. He suggested reducing response length as an optimization tip. For TCO, he advised considering costs beyond tokens, such as setup, labor, and maintenance, and recommended tools like HuggingFace's TCO calculator.

Adaptability, Maintenance, and Security

Dr. Kamath addressed the practical aspects of adaptability, maintenance, and security. He compared open-source and closed-source models in terms of adaptability and maintenance requirements. He cautioned about data privacy, especially with third-party models, and emphasized the importance of considering licensing aspects, including software licensing and data use restrictions. He specifically advised regulated industries to prioritize control over security and adaptability to ensure compliance and safeguard data integrity.

LLMOps

Dr. Kamath introduced LLMOps as an extension of MLOps, focusing on deploying and managing LLMs in production.He outlined key areas of LLMOps:

-Experiment Tracking

-Version Control

-Deployment and CI/CD

-Monitoring and Observability

He provided an example architecture for LLMOps, highlighting the use of prompt templates, adapter models, CI/CD pipelines, and production metrics for feedback and improvement.

Dr. Kamath's presentation provided valuable insights into the practical considerations and challenges of deploying LLMs in production environments. He covered a wide range of topics, from application-specific metrics to LLMOps best practices, equipping attendees with a comprehensive understanding of the LLM production landscape.


Slides and Video of the workshop

The slides and video from yesterday's workshop are available on www.qu.academy

If you don't have a www.qu.academy account, register using the code "QUFallSchool24" to get access to the video and slides. If you already have an account, just login and you will see this and all the other lectures from the QuantUniversity AI Fall school!


Join 5000+ subscribers to the QuantUniversity 's weekly edition of the AI&Risk Management Newsletters to get valuable insights from academics, industry professionals and thought leaders. You will also be alerted about the weekly guest lecture series I host every week!

Yours truly

Sri Krishnamurthy, CFA, CAP

QuantUniversity

要查看或添加评论,请登录

Sri Krishnamurthy, CFA, CAP的更多文章

社区洞察

其他会员也浏览了