Writing LLM tests with Pytest & LangSmith integration

Writing LLM tests with Pytest & LangSmith integration

Traditional testing methods often fall short when dealing with the complexities of LLMs.

This is where LangSmith, in conjunction with the familiar Pytest framework, steps in to keep your LLM calls tested.

Why This Matters

LangSmith enhances Pytest by providing detailed tracing and logging for LLM test runs, making debugging much easier.

You can track various metrics beyond simple pass/fail, gaining a deeper understanding of your LLM's performance over time.

Finally, LangSmith facilitates collaboration by allowing you to easily share test results with both technical and non-technical stakeholders.

Why Should You Care About Enhanced LLM Evaluation?

Developing robust LLM applications requires more than just functional tests. You need to understand why a test fails, track performance changes over time, and effectively communicate results across teams. LangSmith integration with Pytest addresses these challenges head-on. By leveraging these tools, you can move beyond basic assertions and gain actionable insights into your LLM's behavior, ultimately leading to more reliable and higher-quality applications.

What We Will Explore

This article will walk you through the process of integrating LangSmith with Pytest to enhance your LLM testing workflow. We'll cover setting up your project, writing tests, using LangSmith for debugging, logging detailed metrics, and understanding how this integration can streamline your LLM development process.

Getting Started: Project Setup and Basic Tests

Let's begin by setting up a basic Python project for testing an LLM application. Assume you have a project structure like this:

langsmith-pytest
├── .pytest_cache
├── src
│   └── agent.py
│   └── src.egg-info
└── tests
    ├── __init__.py
    ├── test_agent.py
    └── utils.py
    └── setup.py        

Inside src/agent.py, you might have a function that generates marketing copy using an LLM:

# src/agent.py

import openai
from langsmith import traceable, wrappers

oai_client = wrappers.wrap_openai(openai.OpenAI())

@traceable
def generate_marketing_copy(request: str) -> str:
    result = oai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert technical marketer."},
            {"role": "user", "content": request},
        ],
    )
    return result.choices[0].message.content        

And in tests/utils.py, a function to score the generated copy:

# tests/utils.py
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from typing import TypedDict, Any
from pydantic import BaseModel

client = wrap_openai(OpenAI())

class Response(BaseModel):
    score: int

@traceable
def score_marketing_copy(content: str, content_type: str) -> Any:
    completion = client.beta.chat.completions.parse(
        model="gpt-40-mini",
        messages=[
            {
                "role": "system",
                "content": f"You are an expert technical marketer. Grade the following {content_type} on how good",
            },
            {"role": "user", "content": content},
        ],
        response_format=Response
    )
    return completion.choices[0].message.parsed        

Now, let's create some basic Pytest tests in tests/test_agent.py:

# tests/test_agent.py
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy

def test_tweet():
    request = "Write a tweet about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8        

If you run these tests using pytest tests, you might encounter failures. This is where LangSmith integration becomes invaluable.

Debugging with LangSmith: Tracing Test Runs

To integrate LangSmith, first, ensure you have the necessary packages installed:

pip install -U "langsmith[pytest]" openai        

Then, decorate your test functions with @pytest.mark.langsmith:

# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing

@pytest.mark.langsmith
def test_tweet():
    request = "Write a tweet about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

@pytest.mark.langsmith
def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8        

Now, run your tests again:

pytest tests        

With LangSmith integrated, each test run is traced and logged to the LangSmith dashboard. This allows you to inspect the inputs, outputs, and intermediate steps of your LLM calls when a test fails.

Logging Inputs, Outputs, and Feedback for Comprehensive Evaluation

Beyond basic pass/fail, LangSmith allows you to log detailed metrics and feedback within your tests. Let's enhance our tests to log inputs, outputs, and custom feedback:

# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing

@pytest.mark.langsmith
def test_tweet():
    request = "Write a tweet about LLMs"
    testing.log_inputs({"query": request})
    response = generate_marketing_copy(request)
    testing.log_outputs({"response": response})
    testing.log_feedback(key="length", score=len(response))
    testing.log_feedback(key="twitter_length", score=len(response) < 280)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

@pytest.mark.langsmith
def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    testing.log_inputs({"query": request})
    response = generate_marketing_copy(request)
    testing.log_outputs({"response": response})
    testing.log_feedback(key="length", score=len(response))
    testing.log_feedback(key="linkedin_length", score=len(response) > 280)
    testing.log_feedback(key="newlines_count", score=response.count("\n"))
    testing.log_feedback(key="newlines", score=response.count("\n") > 2)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8        

Here, we use testing.log_inputs, testing.log_outputs, and testing.log_feedback to record specific data points. This provides a richer context for understanding test results in LangSmith.

Advanced Debugging: trace_feedback Context Manager

For more complex scenarios, especially when evaluating feedback itself, LangSmith's trace_feedback context manager is incredibly useful. Imagine you want to evaluate the semantic similarity between the generated SQL and the expected SQL in a test. You can use trace_feedback to isolate and trace this evaluation step:

# tests/test_my_app.py
import pytest
from langsmith import testing as t
from langsmith import wrappers
import openai

oai_client = wrappers.wrap_openai(openai.OpenAI())

@traceable
def generate_sql(user_query: str) -> str:
    result = oai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Convert the user query to a SQL"},
            {"role": "user", "content": user_query},
        ],
    )
    return result.choices[0].message.content

def is_valid_sql(query: str) -> bool:
    """Return True if the query is valid SQL."""
    # In a real scenario, implement actual SQL validation logic
    return True

@pytest.mark.langsmith
def test_sql_generation_select_all() -> None:
    user_query = "Get all users from the customers table"
    t.log_inputs({"user_query": user_query})
    sql = generate_sql(user_query)
    t.log_outputs({"sql": sql})
    expected = "SELECT * FROM customers;"
    t.log_reference_outputs({"sql": expected})

    with t.trace_feedback():
        instructions = (
            "Return 1 if the ACTUAL and EXPECTED answers are semantically "
            "otherwise return 0. Return only 0 or 1 and nothing else."
        )
        grade = oai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": instructions},
                {"role": "user", "content": f"ACTUAL: {sql}\nEXPECTED: {expected}"},
            ],
        )
        score = float(grade.choices[0].message.content)
        t.log_feedback(key="correct", score=score)

    assert sql == expected        

The trace_feedback context manager ensures that the steps within it are traced separately, making it easier to debug the feedback generation process.

Tracing Langchain Applications

If you are using Langchain, LangSmith integrates seamlessly. By wrapping your OpenAI client with wrap_openai, you can automatically log traces to LangSmith without extensive code modifications. Ensure you have set the environment variable LANGCHAIN_TRACING_V2 to 'true' and configured your LANGCHAIN_API_KEY.

# utils.py (example)
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI()) # Wrap your OpenAI client

# ... rest of your code using 'client' ...        

This simple step enables tracing for all calls made through this wrapped client.

That's it! Enjoy testing your LLM calls.


Bruno Pais

Founder | Product School | xBooking.com

2 周

This is critical for PM’s,nice one! Do you also have the example where you show how to generate the GUI or do we get this by default in Langsmith with the code you provided?

回复
Duy Nguyen

Full Digitalized Chief Operation Officer (FDO COO) | First cohort within "Coca-Cola Founders" - the 1st Corporate Venture funds in the world operated at global scale.

3 周

Interesting

Ijeoma N.

Machine Learning Engineer | 3+ Years Building Production-Grade ML | 7+ Years Data Expertise | Python | GCP | Computer Vision | LLMs

3 周

Love this! Very insightful especially with the code examples

要查看或添加评论,请登录

Hai N.的更多文章

社区洞察

其他会员也浏览了