LLMOps: Evaluate LLM apps with Langsmith.

LLMOps: Evaluate LLM apps with Langsmith.

Overview: -

Through this article, I will share my recent work on LLM application integration with Langsmith. LangSmith has built by Langchain. This is a unified DevOps platform for developing, collaborating, testing, deploying, and monitoring LLM applications. Langsmith has multiple features. I believe one article will have excessive information to digest. So, breaking it into multiple chapter. This is important chapter as will cover how to create dataset, experiment and evaluate LLM applications .

Take away from article: -

In the first chapter link, I have documented how to integrate Langsmith with LLM application and trace the LLM calls.

After going through this chapter, readers will understand how to create dataset using Python code. Most important part of Langsmith is to evaluate LLM applications through the created dataset. As LLM application is unpredictable in nature so monitoring and evaluation is key for success.

All chapters are hands on. Video guides are available and it wont require any API subscription to write implement through Python code.

Let's start and create dataset:-

Power of LLM application is with data it can handle and predict the response with accuracy. So required data set and evaluate LLM application response with predefined experiment. We can directly upload dataset in .csv format in Langsmith web portal. The following steps is using python code.

  1. Create python class
  2. Import langsmith pip packages "pip install -U langsmith"
  3. Initialize client object
  4. Create a dataset using create_dataset
  5. Feed Input and Output and run the code

from langsmith import Client

client = Client()
dataset = client.create_dataset( dataset_name="Demo dataset")
client.create_examples(
    inputs=[
        {"postfix": "to LangSmith"},
        {"postfix": "to Evaluations in LangSmith"},
    ],
    outputs=[
        {"output": "Welcome to LangSmith"},
        {"output": "Welcome to Evaluations in LangSmith"},
    ],
    dataset_id=dataset.id,
)        

6. Open Langsmith navigate to Dataset. The dataset will show there.

Let's evaluate LLM application:-

After the dataset added there are few steps required to evaluate LLM application responses is as per the same question and answer. Will our developed LLM application able to generate similar output for same set of dataset? That's called evaluation.

  1. Need few import like LangChainStringEvaluator, evaluate, Run and Example etc.
  2. Create a prompt template to take input , to capture real LLM responses and mark it as Correct or Incorrect.
  3. Define evaluator. E.g generate score based on differences between prediction and actual responses

# Define evaluators
def must_mention(run: Run, example: Example) -> dict: 
    prediction = run.outputs.get("output") or "" 
    required = example.outputs.get("must_mention") or [] 
    score = all(phrase in prediction for phrase in required) 
    return {"key":"must_mention", "score": score}        

4. Evaluate the prediction and responses of LLM application.

experiment_results = evaluate(
    predict, # Your AI system
    data=dataset_name, # The data to predict and grade over
    evaluators=[must_mention], # The evaluators to score the results
    experiment_prefix="rap-generator", # A prefix for your experiment names to easily identify them
    metadata={
      "version": "1.0.0",
    },        

5. Run and open Langsmith. Navigate to Dataset to check evaluation result.

6. Useful video to create dataset and evaluation steps. Detailed Python code and steps are mentioned in video guide.

Useful video to create dataset:-

Useful video to evaluate:-

Conclusion:

Congratulations! You've now created a dataset and used it to evaluate your agent or LLM.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了