登录查看更多内容

Tracking LLMs with Comet

Elvis S.

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ??

发布日期: 2023年8月9日

+ 关注

When building with LLMs, you will spend a lot of time optimizing prompts and diagnosing LLMs.

As you put your solutions into production, you need LLMOps tools to track LLMs and analyze prompts.

Here is a demo of how this process might look (use case included):

Step 1 - The Tool

I’ll use Comet new LLMOps functionalities to support our solution.

Comet ML provides prompt engineering tools to analyze prompts at scale.

Step 2 - Our Use Case

For the use case, I’ve built an ML paper tagger that uses ChatGPT-3.5 to extract the model names mentioned in paper abstracts.

After the LLM makes the tag predictions, we evaluate the accuracy of the results using another LLM (referred to as the LLM evaluator).

I’ve built a small validation dataset to evaluate the solution.

I'm experimenting with few-shot and zero-shot prompting. Here is an example of a zero-shot prompt to perform the tagging:

Step 3 - Creating LLM Project

I use Comet’s LLM tools to track how both the paper tagger and the LLM evaluator perform.

The first step is to create an LLM Project in Comet.

Step 4 - Logging Prompts in Comet

The next step is to use Comet’s LLM SDK to log all the prompts and related information to Comet. The functionalities are available via the open-source comet-llm Python library.

The comet-llm library helps to log prompts, responses, prompt templates, variables, and other related metadata you want to track.

For this use case, I am logging the prompts, the expected LLM response, the predicted LLM response, and the final verdict of the LLM-powered evaluator. Remember, we are interested in assessing and tracking the performance and quality of the LLM evaluator so we also want to create helpful tags to help us track this easily.

Here is an example of the prompt information and metadata I am logging:

领英推荐

??Pre-Christmas Reads: New Research, Sora, Python…

Oxylabs.cn 2 个月前

Llama 2, ChatGPT for Web Scraping, & Latest Python News

Oxylabs.cn 1 年前

Outdated Models, Data Scraping, and Batch Jobs

Trudo 1 个月前

Step 5 - Analyzing LLM Behavior

We are particularly interested in analyzing the LLM evaluator so it’s useful to filter by the final verdict tag (INCORRECT/CORRECT) and other bits of information to take a closer look at the behavior and quality of the LLM-powered evaluator.

Comet 's LLMOps functionalities allow us to quickly track and get insights about the effectiveness of the LLM evaluator.

We can easily navigate prompts/responses and metadata. We can also filter and group the prompt logs to get faster overall trends/insights.

Having the prompt logs in Comet allows us to track the performance of the LLM evaluator in real-time which is super useful.

Step 6 - Search for Prompts

You can also perform quick searches on your prompt logs via the interface.

For our use case, we are interested in quickly finding specific keywords, errors, or prompts of interest.

I search for relatively new ML concepts (Llama 2 or GPT-4) as I suspect the model might struggle to extract these names from the paper abstracts.

From a few searches and navigating the logs, I have managed to gather a few insights and have a better understanding of the behavior of the LLMs.

For instance, I observed that the LLM evaluator is misclassifying the prompt responses as INCORRECT even when the predicted tags overlap with the expected tags in the evaluation dataset. Here is an example of this behavior:

Next Steps

There is more work to do on improving the LLM evaluator itself. I can continue iterating and improving my solution and use Comet to continue monitoring the quality of the LLM evaluator.

The next step would be to potentially experiment with better LLM evaluators or directly improve the prompt template used by the LLM evaluator chain.

This is a simple use case, but you can use Comet LLMOps functionalities to track and debug your LLMs for a wide range of use cases. The team is working on many other features such as tracking user feedback, better grouping features, and viewing and diffing prompt chains.

If there is enough interest, I might do a follow-up thread to demonstrate other steps we can take to keep improving our solution and share more insights.

Find the notebook I used for this demo here: https://github.com/dair-ai/llm-evaluator

Comet LLM SDK: https://github.com/comet-ml/comet-llm

Docs: https://www.comet.com/docs/v2/guides/large-language-models/overview/

Ranjan Dailata

1 年

Is there a Restful API for?Comet's LLMOps?

Abby Morgan

AI/ML Growth Engineer @ Comet Opik | Technical Writer | Community Organizer | Mentor

1 年

Love to see this and can't wait for more! ??

3 次回应

Comet

1 年

Fantastic! Thanks for sharing! ??

3 次回应

查看更多评论

要查看或添加评论，请登录

Elvis S.的更多文章

OpenAI Introduces Operator & Agents

2025年1月23日

OpenAI Introduces Operator & Agents

OpenAI Introduces Operator & Agents! Here is everything you need to know: Operator is a system that can use a web…

1 条评论
My Favorite LLM Papers for October

2023年10月30日

My Favorite LLM Papers for October

Here's a list of my favorite LLM papers I read this month: 1/ Zephyr LLM - a 7B parameter model with competitive…

2 条评论
How To Build a Custom Chat LLM on Your Data

2023年7月3日

How To Build a Custom Chat LLM on Your Data

This is one of the fastest ways to build a custom ChatGPT-like system on top of your data. It's called ChatLLM (by…

2 条评论
Data Exploration with Chat Powered by GPT-4

2023年3月30日

Data Exploration with Chat Powered by GPT-4

As an ML Engineer, this is one of the most useful applications of GPT-4 I've seen. Chat Explore is a powerful…

6 条评论
Open Source Solution Replicates ChatGPT Training Process

2023年2月21日

Open Source Solution Replicates ChatGPT Training Process

ChatGPT is the biggest buzz in AI today! ChatGPT demonstrates remarkable capabilities so there is a high interest to…

7 条评论
New Conversational AI Tool Lets You “Chat” With Your Data

2023年2月14日

New Conversational AI Tool Lets You “Chat” With Your Data

As an ML engineer, one area where I spend a lot of time is data engineering. Can we use conversational AI technologies…

8 条评论
Analyzing Worldwide Energy Production with Kibana?Lens

2019年12月23日

Analyzing Worldwide Energy Production with Kibana?Lens

While there are many tools that can be used to perform a quick analysis of large-scale data, data analysis in itself is…

1 条评论
XLNet outperforms BERT on several NLP Tasks

2019年6月30日

XLNet outperforms BERT on several NLP Tasks

Two pretraining objectives that have been successful for pretraining neural networks used in transfer learning NLP are…

1 条评论

See all articles

Tracking LLMs with Comet

Elvis S.

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ??

Step 1 - The Tool

Step 2 - Our Use Case

Step 3 - Creating LLM Project

Step 4 - Logging Prompts in Comet

领英推荐

Step 5 - Analyzing LLM Behavior

Step 6 - Search for Prompts

Next Steps

Elvis S.的更多文章

社区洞察

其他会员也浏览了

DataPanthy #92

How to Create An AI-Powered Python Web App With Flask And GPT-4 API

CROPLAND's top picks from the rstudio conf 2022: Machine Learning, A.I. and MLOPs

The Anti-Framework Guide for Building LLM Apps

Bigbird, TensorFlowJS and LinkedIn — Web models for your network.

Handling Long Context RAG for LLMs with Contextual Summarization

Live, Online Distribution Estimation Using t-Digests

TensorFlow.js Monthly #3: Case studies, talks, and demos.

DeepSeek vs LLaMA: Detangling Open Source and Special Purpose

Step 1 - The Tool

Step 2 - Our Use Case

Step 3 - Creating LLM Project

Step 4 - Logging Prompts in Comet

领英推荐

Step 5 - Analyzing LLM Behavior

Step 6 - Search for Prompts

Next Steps

Elvis S.的更多文章

OpenAI Introduces Operator & Agents

My Favorite LLM Papers for October

How To Build a Custom Chat LLM on Your Data

Data Exploration with Chat Powered by GPT-4

Open Source Solution Replicates ChatGPT Training Process

New Conversational AI Tool Lets You “Chat” With Your Data

Analyzing Worldwide Energy Production with Kibana?Lens

XLNet outperforms BERT on several NLP Tasks

社区洞察

其他会员也浏览了

DataPanthy #92

How to Create An AI-Powered Python Web App With Flask And GPT-4 API

CROPLAND's top picks from the rstudio conf 2022: Machine Learning, A.I. and MLOPs

The Anti-Framework Guide for Building LLM Apps

Bigbird, TensorFlowJS and LinkedIn — Web models for your network.

Handling Long Context RAG for LLMs with Contextual Summarization

Live, Online Distribution Estimation Using t-Digests

TensorFlow.js Monthly #3: Case studies, talks, and demos.

DeepSeek vs LLaMA: Detangling Open Source and Special Purpose