Writing LLM tests with Pytest & LangSmith integration
Traditional testing methods often fall short when dealing with the complexities of LLMs.
This is where LangSmith, in conjunction with the familiar Pytest framework, steps in to keep your LLM calls tested.
Why This Matters
LangSmith enhances Pytest by providing detailed tracing and logging for LLM test runs, making debugging much easier.
You can track various metrics beyond simple pass/fail, gaining a deeper understanding of your LLM's performance over time.
Finally, LangSmith facilitates collaboration by allowing you to easily share test results with both technical and non-technical stakeholders.
Why Should You Care About Enhanced LLM Evaluation?
Developing robust LLM applications requires more than just functional tests. You need to understand why a test fails, track performance changes over time, and effectively communicate results across teams. LangSmith integration with Pytest addresses these challenges head-on. By leveraging these tools, you can move beyond basic assertions and gain actionable insights into your LLM's behavior, ultimately leading to more reliable and higher-quality applications.
What We Will Explore
This article will walk you through the process of integrating LangSmith with Pytest to enhance your LLM testing workflow. We'll cover setting up your project, writing tests, using LangSmith for debugging, logging detailed metrics, and understanding how this integration can streamline your LLM development process.
Getting Started: Project Setup and Basic Tests
Let's begin by setting up a basic Python project for testing an LLM application. Assume you have a project structure like this:
langsmith-pytest
├── .pytest_cache
├── src
│ └── agent.py
│ └── src.egg-info
└── tests
├── __init__.py
├── test_agent.py
└── utils.py
└── setup.py
Inside src/agent.py, you might have a function that generates marketing copy using an LLM:
# src/agent.py
import openai
from langsmith import traceable, wrappers
oai_client = wrappers.wrap_openai(openai.OpenAI())
@traceable
def generate_marketing_copy(request: str) -> str:
result = oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert technical marketer."},
{"role": "user", "content": request},
],
)
return result.choices[0].message.content
And in tests/utils.py, a function to score the generated copy:
# tests/utils.py
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from typing import TypedDict, Any
from pydantic import BaseModel
client = wrap_openai(OpenAI())
class Response(BaseModel):
score: int
@traceable
def score_marketing_copy(content: str, content_type: str) -> Any:
completion = client.beta.chat.completions.parse(
model="gpt-40-mini",
messages=[
{
"role": "system",
"content": f"You are an expert technical marketer. Grade the following {content_type} on how good",
},
{"role": "user", "content": content},
],
response_format=Response
)
return completion.choices[0].message.parsed
Now, let's create some basic Pytest tests in tests/test_agent.py:
# tests/test_agent.py
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
def test_tweet():
request = "Write a tweet about LLMs"
response = generate_marketing_copy(request)
assert len(response) < 280
score_result = score_marketing_copy(response, "tweet")
assert score_result.score > 8
def test_linkedin_post():
request = "Write a LinkedIn post about LLMs"
response = generate_marketing_copy(request)
assert len(response) > 280
assert response.count("\n") > 2
score_result = score_marketing_copy(response, "linkedin post")
assert score_result.score > 8
If you run these tests using pytest tests, you might encounter failures. This is where LangSmith integration becomes invaluable.
领英推荐
Debugging with LangSmith: Tracing Test Runs
To integrate LangSmith, first, ensure you have the necessary packages installed:
pip install -U "langsmith[pytest]" openai
Then, decorate your test functions with @pytest.mark.langsmith:
# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing
@pytest.mark.langsmith
def test_tweet():
request = "Write a tweet about LLMs"
response = generate_marketing_copy(request)
assert len(response) < 280
score_result = score_marketing_copy(response, "tweet")
assert score_result.score > 8
@pytest.mark.langsmith
def test_linkedin_post():
request = "Write a LinkedIn post about LLMs"
response = generate_marketing_copy(request)
assert len(response) > 280
assert response.count("\n") > 2
score_result = score_marketing_copy(response, "linkedin post")
assert score_result.score > 8
Now, run your tests again:
pytest tests
With LangSmith integrated, each test run is traced and logged to the LangSmith dashboard. This allows you to inspect the inputs, outputs, and intermediate steps of your LLM calls when a test fails.
Logging Inputs, Outputs, and Feedback for Comprehensive Evaluation
Beyond basic pass/fail, LangSmith allows you to log detailed metrics and feedback within your tests. Let's enhance our tests to log inputs, outputs, and custom feedback:
# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing
@pytest.mark.langsmith
def test_tweet():
request = "Write a tweet about LLMs"
testing.log_inputs({"query": request})
response = generate_marketing_copy(request)
testing.log_outputs({"response": response})
testing.log_feedback(key="length", score=len(response))
testing.log_feedback(key="twitter_length", score=len(response) < 280)
assert len(response) < 280
score_result = score_marketing_copy(response, "tweet")
assert score_result.score > 8
@pytest.mark.langsmith
def test_linkedin_post():
request = "Write a LinkedIn post about LLMs"
testing.log_inputs({"query": request})
response = generate_marketing_copy(request)
testing.log_outputs({"response": response})
testing.log_feedback(key="length", score=len(response))
testing.log_feedback(key="linkedin_length", score=len(response) > 280)
testing.log_feedback(key="newlines_count", score=response.count("\n"))
testing.log_feedback(key="newlines", score=response.count("\n") > 2)
assert len(response) > 280
assert response.count("\n") > 2
score_result = score_marketing_copy(response, "linkedin post")
assert score_result.score > 8
Here, we use testing.log_inputs, testing.log_outputs, and testing.log_feedback to record specific data points. This provides a richer context for understanding test results in LangSmith.
Advanced Debugging: trace_feedback Context Manager
For more complex scenarios, especially when evaluating feedback itself, LangSmith's trace_feedback context manager is incredibly useful. Imagine you want to evaluate the semantic similarity between the generated SQL and the expected SQL in a test. You can use trace_feedback to isolate and trace this evaluation step:
# tests/test_my_app.py
import pytest
from langsmith import testing as t
from langsmith import wrappers
import openai
oai_client = wrappers.wrap_openai(openai.OpenAI())
@traceable
def generate_sql(user_query: str) -> str:
result = oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Convert the user query to a SQL"},
{"role": "user", "content": user_query},
],
)
return result.choices[0].message.content
def is_valid_sql(query: str) -> bool:
"""Return True if the query is valid SQL."""
# In a real scenario, implement actual SQL validation logic
return True
@pytest.mark.langsmith
def test_sql_generation_select_all() -> None:
user_query = "Get all users from the customers table"
t.log_inputs({"user_query": user_query})
sql = generate_sql(user_query)
t.log_outputs({"sql": sql})
expected = "SELECT * FROM customers;"
t.log_reference_outputs({"sql": expected})
with t.trace_feedback():
instructions = (
"Return 1 if the ACTUAL and EXPECTED answers are semantically "
"otherwise return 0. Return only 0 or 1 and nothing else."
)
grade = oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": instructions},
{"role": "user", "content": f"ACTUAL: {sql}\nEXPECTED: {expected}"},
],
)
score = float(grade.choices[0].message.content)
t.log_feedback(key="correct", score=score)
assert sql == expected
The trace_feedback context manager ensures that the steps within it are traced separately, making it easier to debug the feedback generation process.
Tracing Langchain Applications
If you are using Langchain, LangSmith integrates seamlessly. By wrapping your OpenAI client with wrap_openai, you can automatically log traces to LangSmith without extensive code modifications. Ensure you have set the environment variable LANGCHAIN_TRACING_V2 to 'true' and configured your LANGCHAIN_API_KEY.
# utils.py (example)
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai
client = wrap_openai(OpenAI()) # Wrap your OpenAI client
# ... rest of your code using 'client' ...
This simple step enables tracing for all calls made through this wrapped client.
That's it! Enjoy testing your LLM calls.
Founder | Product School | xBooking.com
2 周This is critical for PM’s,nice one! Do you also have the example where you show how to generate the GUI or do we get this by default in Langsmith with the code you provided?
Full Digitalized Chief Operation Officer (FDO COO) | First cohort within "Coca-Cola Founders" - the 1st Corporate Venture funds in the world operated at global scale.
3 周Interesting
Machine Learning Engineer | 3+ Years Building Production-Grade ML | 7+ Years Data Expertise | Python | GCP | Computer Vision | LLMs
3 周Love this! Very insightful especially with the code examples