Tracking LLMs with Comet

Tracking LLMs with Comet

When building with LLMs, you will spend a lot of time optimizing prompts and diagnosing LLMs.

As you put your solutions into production, you need LLMOps tools to track LLMs and analyze prompts.

Here is a demo of how this process might look (use case included):

Step 1 - The Tool

I’ll use Comet new LLMOps functionalities to support our solution.

Comet ML provides prompt engineering tools to analyze prompts at scale.

Step 2 - Our Use Case

For the use case, I’ve built an ML paper tagger that uses ChatGPT-3.5 to extract the model names mentioned in paper abstracts.

After the LLM makes the tag predictions, we evaluate the accuracy of the results using another LLM (referred to as the LLM evaluator).

I’ve built a small validation dataset to evaluate the solution.

I'm experimenting with few-shot and zero-shot prompting. Here is an example of a zero-shot prompt to perform the tagging:

No alt text provided for this image

Step 3 - Creating LLM Project

I use Comet’s LLM tools to track how both the paper tagger and the LLM evaluator perform.

The first step is to create an LLM Project in Comet.

No alt text provided for this image

Step 4 - Logging Prompts in Comet

The next step is to use Comet’s LLM SDK to log all the prompts and related information to Comet. The functionalities are available via the open-source comet-llm Python library.

The comet-llm library helps to log prompts, responses, prompt templates, variables, and other related metadata you want to track.

For this use case, I am logging the prompts, the expected LLM response, the predicted LLM response, and the final verdict of the LLM-powered evaluator. Remember, we are interested in assessing and tracking the performance and quality of the LLM evaluator so we also want to create helpful tags to help us track this easily.

Here is an example of the prompt information and metadata I am logging:

No alt text provided for this image

Step 5 - Analyzing LLM Behavior

We are particularly interested in analyzing the LLM evaluator so it’s useful to filter by the final verdict tag (INCORRECT/CORRECT) and other bits of information to take a closer look at the behavior and quality of the LLM-powered evaluator.

Comet 's LLMOps functionalities allow us to quickly track and get insights about the effectiveness of the LLM evaluator.

We can easily navigate prompts/responses and metadata. We can also filter and group the prompt logs to get faster overall trends/insights.

Having the prompt logs in Comet allows us to track the performance of the LLM evaluator in real-time which is super useful.

Step 6 - Search for Prompts

You can also perform quick searches on your prompt logs via the interface.

For our use case, we are interested in quickly finding specific keywords, errors, or prompts of interest.

I search for relatively new ML concepts (Llama 2 or GPT-4) as I suspect the model might struggle to extract these names from the paper abstracts.

From a few searches and navigating the logs, I have managed to gather a few insights and have a better understanding of the behavior of the LLMs.

For instance, I observed that the LLM evaluator is misclassifying the prompt responses as INCORRECT even when the predicted tags overlap with the expected tags in the evaluation dataset. Here is an example of this behavior:

No alt text provided for this image

Next Steps

There is more work to do on improving the LLM evaluator itself. I can continue iterating and improving my solution and use Comet to continue monitoring the quality of the LLM evaluator.

The next step would be to potentially experiment with better LLM evaluators or directly improve the prompt template used by the LLM evaluator chain.

This is a simple use case, but you can use Comet LLMOps functionalities to track and debug your LLMs for a wide range of use cases. The team is working on many other features such as tracking user feedback, better grouping features, and viewing and diffing prompt chains.

If there is enough interest, I might do a follow-up thread to demonstrate other steps we can take to keep improving our solution and share more insights.

Find the notebook I used for this demo here: https://github.com/dair-ai/llm-evaluator

Comet LLM SDK: https://github.com/comet-ml/comet-llm

Docs: https://www.comet.com/docs/v2/guides/large-language-models/overview/

Is there a Restful API for?Comet's LLMOps?

回复
Abby Morgan

AI/ML Growth Engineer @ Comet Opik | Technical Writer | Community Organizer | Mentor

1 年

Love to see this and can't wait for more! ??

Fantastic! Thanks for sharing! ??

要查看或添加评论,请登录

Elvis S.的更多文章

  • OpenAI Introduces Operator & Agents

    OpenAI Introduces Operator & Agents

    OpenAI Introduces Operator & Agents! Here is everything you need to know: Operator is a system that can use a web…

    1 条评论
  • My Favorite LLM Papers for October

    My Favorite LLM Papers for October

    Here's a list of my favorite LLM papers I read this month: 1/ Zephyr LLM - a 7B parameter model with competitive…

    2 条评论
  • How To Build a Custom Chat LLM on Your Data

    How To Build a Custom Chat LLM on Your Data

    This is one of the fastest ways to build a custom ChatGPT-like system on top of your data. It's called ChatLLM (by…

    2 条评论
  • Data Exploration with Chat Powered by GPT-4

    Data Exploration with Chat Powered by GPT-4

    As an ML Engineer, this is one of the most useful applications of GPT-4 I've seen. Chat Explore is a powerful…

    6 条评论
  • Open Source Solution Replicates ChatGPT Training Process

    Open Source Solution Replicates ChatGPT Training Process

    ChatGPT is the biggest buzz in AI today! ChatGPT demonstrates remarkable capabilities so there is a high interest to…

    7 条评论
  • New Conversational AI Tool Lets You “Chat” With Your Data

    New Conversational AI Tool Lets You “Chat” With Your Data

    As an ML engineer, one area where I spend a lot of time is data engineering. Can we use conversational AI technologies…

    8 条评论
  • Analyzing Worldwide Energy Production with Kibana?Lens

    Analyzing Worldwide Energy Production with Kibana?Lens

    While there are many tools that can be used to perform a quick analysis of large-scale data, data analysis in itself is…

    1 条评论
  • XLNet outperforms BERT on several NLP Tasks

    XLNet outperforms BERT on several NLP Tasks

    Two pretraining objectives that have been successful for pretraining neural networks used in transfer learning NLP are…

    1 条评论

社区洞察

其他会员也浏览了