LLMOps: Evaluate LLM apps with Langsmith.
Overview: -
Through this article, I will share my recent work on LLM application integration with Langsmith. LangSmith has built by Langchain. This is a unified DevOps platform for developing, collaborating, testing, deploying, and monitoring LLM applications. Langsmith has multiple features. I believe one article will have excessive information to digest. So, breaking it into multiple chapter. This is important chapter as will cover how to create dataset, experiment and evaluate LLM applications .
Take away from article: -
In the first chapter link, I have documented how to integrate Langsmith with LLM application and trace the LLM calls.
After going through this chapter, readers will understand how to create dataset using Python code. Most important part of Langsmith is to evaluate LLM applications through the created dataset. As LLM application is unpredictable in nature so monitoring and evaluation is key for success.
All chapters are hands on. Video guides are available and it wont require any API subscription to write implement through Python code.
Let's start and create dataset:-
Power of LLM application is with data it can handle and predict the response with accuracy. So required data set and evaluate LLM application response with predefined experiment. We can directly upload dataset in .csv format in Langsmith web portal. The following steps is using python code.
from langsmith import Client
client = Client()
dataset = client.create_dataset( dataset_name="Demo dataset")
client.create_examples(
inputs=[
{"postfix": "to LangSmith"},
{"postfix": "to Evaluations in LangSmith"},
],
outputs=[
{"output": "Welcome to LangSmith"},
{"output": "Welcome to Evaluations in LangSmith"},
],
dataset_id=dataset.id,
)
6. Open Langsmith navigate to Dataset. The dataset will show there.
领英推荐
Let's evaluate LLM application:-
After the dataset added there are few steps required to evaluate LLM application responses is as per the same question and answer. Will our developed LLM application able to generate similar output for same set of dataset? That's called evaluation.
# Define evaluators
def must_mention(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("must_mention") or []
score = all(phrase in prediction for phrase in required)
return {"key":"must_mention", "score": score}
4. Evaluate the prediction and responses of LLM application.
experiment_results = evaluate(
predict, # Your AI system
data=dataset_name, # The data to predict and grade over
evaluators=[must_mention], # The evaluators to score the results
experiment_prefix="rap-generator", # A prefix for your experiment names to easily identify them
metadata={
"version": "1.0.0",
},
5. Run and open Langsmith. Navigate to Dataset to check evaluation result.
6. Useful video to create dataset and evaluation steps. Detailed Python code and steps are mentioned in video guide.
Useful video to create dataset:-
Useful video to evaluate:-
Conclusion:
Congratulations! You've now created a dataset and used it to evaluate your agent or LLM.