登录查看更多内容

Writing LLM tests with Pytest & LangSmith integration

Hai N.

AI Tinkerers Ottawa | Cats with Bats

发布日期: 2025年2月22日

+ 关注

Traditional testing methods often fall short when dealing with the complexities of LLMs.

This is where LangSmith, in conjunction with the familiar Pytest framework, steps in to keep your LLM calls tested.

Why This Matters

LangSmith enhances Pytest by providing detailed tracing and logging for LLM test runs, making debugging much easier.

You can track various metrics beyond simple pass/fail, gaining a deeper understanding of your LLM's performance over time.

Finally, LangSmith facilitates collaboration by allowing you to easily share test results with both technical and non-technical stakeholders.

Why Should You Care About Enhanced LLM Evaluation?

Developing robust LLM applications requires more than just functional tests. You need to understand why a test fails, track performance changes over time, and effectively communicate results across teams. LangSmith integration with Pytest addresses these challenges head-on. By leveraging these tools, you can move beyond basic assertions and gain actionable insights into your LLM's behavior, ultimately leading to more reliable and higher-quality applications.

What We Will Explore

This article will walk you through the process of integrating LangSmith with Pytest to enhance your LLM testing workflow. We'll cover setting up your project, writing tests, using LangSmith for debugging, logging detailed metrics, and understanding how this integration can streamline your LLM development process.

Getting Started: Project Setup and Basic Tests

Let's begin by setting up a basic Python project for testing an LLM application. Assume you have a project structure like this:

langsmith-pytest
├── .pytest_cache
├── src
│   └── agent.py
│   └── src.egg-info
└── tests
    ├── __init__.py
    ├── test_agent.py
    └── utils.py
    └── setup.py

Inside src/agent.py, you might have a function that generates marketing copy using an LLM:

# src/agent.py

import openai
from langsmith import traceable, wrappers

oai_client = wrappers.wrap_openai(openai.OpenAI())

@traceable
def generate_marketing_copy(request: str) -> str:
    result = oai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert technical marketer."},
            {"role": "user", "content": request},
        ],
    )
    return result.choices[0].message.content

And in tests/utils.py, a function to score the generated copy:

# tests/utils.py
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from typing import TypedDict, Any
from pydantic import BaseModel

client = wrap_openai(OpenAI())

class Response(BaseModel):
    score: int

@traceable
def score_marketing_copy(content: str, content_type: str) -> Any:
    completion = client.beta.chat.completions.parse(
        model="gpt-40-mini",
        messages=[
            {
                "role": "system",
                "content": f"You are an expert technical marketer. Grade the following {content_type} on how good",
            },
            {"role": "user", "content": content},
        ],
        response_format=Response
    )
    return completion.choices[0].message.parsed

Now, let's create some basic Pytest tests in tests/test_agent.py:

# tests/test_agent.py
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy

def test_tweet():
    request = "Write a tweet about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8

If you run these tests using pytest tests, you might encounter failures. This is where LangSmith integration becomes invaluable.

领英推荐

Building Robust APIs with Confidence: A Comprehensive…

Skill Quotient 1 年前

Finding Your Best Go Testing Framework!

LambdaTest 1 年前

The Many Facets of Performance Testing

LambdaTest 2 年前

Debugging with LangSmith: Tracing Test Runs

To integrate LangSmith, first, ensure you have the necessary packages installed:

pip install -U "langsmith[pytest]" openai

Then, decorate your test functions with @pytest.mark.langsmith:

# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing

@pytest.mark.langsmith
def test_tweet():
    request = "Write a tweet about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

@pytest.mark.langsmith
def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    response = generate_marketing_copy(request)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8

Now, run your tests again:

pytest tests

With LangSmith integrated, each test run is traced and logged to the LangSmith dashboard. This allows you to inspect the inputs, outputs, and intermediate steps of your LLM calls when a test fails.

Logging Inputs, Outputs, and Feedback for Comprehensive Evaluation

Beyond basic pass/fail, LangSmith allows you to log detailed metrics and feedback within your tests. Let's enhance our tests to log inputs, outputs, and custom feedback:

# tests/test_agent.py
import pytest
from src.agent import generate_marketing_copy
from tests.utils import score_marketing_copy
from langsmith import testing

@pytest.mark.langsmith
def test_tweet():
    request = "Write a tweet about LLMs"
    testing.log_inputs({"query": request})
    response = generate_marketing_copy(request)
    testing.log_outputs({"response": response})
    testing.log_feedback(key="length", score=len(response))
    testing.log_feedback(key="twitter_length", score=len(response) < 280)
    assert len(response) < 280
    score_result = score_marketing_copy(response, "tweet")
    assert score_result.score > 8

@pytest.mark.langsmith
def test_linkedin_post():
    request = "Write a LinkedIn post about LLMs"
    testing.log_inputs({"query": request})
    response = generate_marketing_copy(request)
    testing.log_outputs({"response": response})
    testing.log_feedback(key="length", score=len(response))
    testing.log_feedback(key="linkedin_length", score=len(response) > 280)
    testing.log_feedback(key="newlines_count", score=response.count("\n"))
    testing.log_feedback(key="newlines", score=response.count("\n") > 2)
    assert len(response) > 280
    assert response.count("\n") > 2
    score_result = score_marketing_copy(response, "linkedin post")
    assert score_result.score > 8

Here, we use testing.log_inputs, testing.log_outputs, and testing.log_feedback to record specific data points. This provides a richer context for understanding test results in LangSmith.

Advanced Debugging: trace_feedback Context Manager

For more complex scenarios, especially when evaluating feedback itself, LangSmith's trace_feedback context manager is incredibly useful. Imagine you want to evaluate the semantic similarity between the generated SQL and the expected SQL in a test. You can use trace_feedback to isolate and trace this evaluation step:

# tests/test_my_app.py
import pytest
from langsmith import testing as t
from langsmith import wrappers
import openai

oai_client = wrappers.wrap_openai(openai.OpenAI())

@traceable
def generate_sql(user_query: str) -> str:
    result = oai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Convert the user query to a SQL"},
            {"role": "user", "content": user_query},
        ],
    )
    return result.choices[0].message.content

def is_valid_sql(query: str) -> bool:
    """Return True if the query is valid SQL."""
    # In a real scenario, implement actual SQL validation logic
    return True

@pytest.mark.langsmith
def test_sql_generation_select_all() -> None:
    user_query = "Get all users from the customers table"
    t.log_inputs({"user_query": user_query})
    sql = generate_sql(user_query)
    t.log_outputs({"sql": sql})
    expected = "SELECT * FROM customers;"
    t.log_reference_outputs({"sql": expected})

    with t.trace_feedback():
        instructions = (
            "Return 1 if the ACTUAL and EXPECTED answers are semantically "
            "otherwise return 0. Return only 0 or 1 and nothing else."
        )
        grade = oai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": instructions},
                {"role": "user", "content": f"ACTUAL: {sql}\nEXPECTED: {expected}"},
            ],
        )
        score = float(grade.choices[0].message.content)
        t.log_feedback(key="correct", score=score)

    assert sql == expected

The trace_feedback context manager ensures that the steps within it are traced separately, making it easier to debug the feedback generation process.

Tracing Langchain Applications

If you are using Langchain, LangSmith integrates seamlessly. By wrapping your OpenAI client with wrap_openai, you can automatically log traces to LangSmith without extensive code modifications. Ensure you have set the environment variable LANGCHAIN_TRACING_V2 to 'true' and configured your LANGCHAIN_API_KEY.

# utils.py (example)
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI()) # Wrap your OpenAI client

# ... rest of your code using 'client' ...

This simple step enables tracing for all calls made through this wrapped client.

That's it! Enjoy testing your LLM calls.

Bruno Pais

Founder | Product School | xBooking.com

2 周

This is critical for PM’s,nice one! Do you also have the example where you show how to generate the GUI or do we get this by default in Langsmith with the code you provided?

Duy Nguyen

Full Digitalized Chief Operation Officer (FDO COO) | First cohort within "Coca-Cola Founders" - the 1st Corporate Venture funds in the world operated at global scale.

3 周

Interesting

1 次回应

Ijeoma N.

3 周

Love this! Very insightful especially with the code examples

1 次回应

查看更多评论

要查看或添加评论，请登录

Hai N.的更多文章

A guide to build OpenAI Agents SDK Multi-agent sales team

2025年3月14日

A guide to build OpenAI Agents SDK Multi-agent sales team

Imagine having an AI-powered sales team that can automatically identify leads, gather crucial information, and craft…
OpenAI Agents SDK: A Step-by-Step Guide to building your first agent

2025年3月12日

OpenAI Agents SDK: A Step-by-Step Guide to building your first agent

Key Features and Benefits I will guide you through the OpenAI Agents SDK, a powerful tool designed to simplify the…

3 条评论
What is LangMem (memory for LangGraph)?

2025年2月20日

What is LangMem (memory for LangGraph)?

AI agents, like humans, benefit from different types of memory: semantic (facts), episodic (experiences), and…

3 条评论

Writing LLM tests with Pytest & LangSmith integration

Hai N.

AI Tinkerers Ottawa | Cats with Bats

Why This Matters

Why Should You Care About Enhanced LLM Evaluation?

What We Will Explore

Getting Started: Project Setup and Basic Tests

领英推荐

Debugging with LangSmith: Tracing Test Runs

Logging Inputs, Outputs, and Feedback for Comprehensive Evaluation

Advanced Debugging: trace_feedback Context Manager

Tracing Langchain Applications

That's it! Enjoy testing your LLM calls.

Hai N.的更多文章

社区洞察

其他会员也浏览了

Test Automation - Accelerating Playwright TypeScript Tests with Parallel Execution in GitHub Actions and Allure Reporting

Effective Strategies for Debugging Complex Code

Let’s Start Writing Tests (Correctly)

Test-Driven Development (TDD) Crash Course

When testing just doesn't cut it

Navigating Docker Entrypoint Script Issues: A Guide to Permissions and Best Practices

What is Swagger and How to Use It

week 52 - Test Code Refactoring Unveiled, An Improvement to TDD Efficiency and Large Language Models in Detecting Test Smells

IllicoDB3 - append(), write(), read(), goto(), skip()

QA Insider: Free GitHub Copilot Changes the Game for VS Code Users

Why This Matters

Why Should You Care About Enhanced LLM Evaluation?

What We Will Explore

Getting Started: Project Setup and Basic Tests

领英推荐

Debugging with LangSmith: Tracing Test Runs

Logging Inputs, Outputs, and Feedback for Comprehensive Evaluation

Advanced Debugging: trace_feedback Context Manager

Tracing Langchain Applications

That's it! Enjoy testing your LLM calls.

Hai N.的更多文章

A guide to build OpenAI Agents SDK Multi-agent sales team

OpenAI Agents SDK: A Step-by-Step Guide to building your first agent

What is LangMem (memory for LangGraph)?

社区洞察

其他会员也浏览了

Test Automation - Accelerating Playwright TypeScript Tests with Parallel Execution in GitHub Actions and Allure Reporting

Effective Strategies for Debugging Complex Code

Let’s Start Writing Tests (Correctly)

Test-Driven Development (TDD) Crash Course

When testing just doesn't cut it

Navigating Docker Entrypoint Script Issues: A Guide to Permissions and Best Practices

What is Swagger and How to Use It

week 52 - Test Code Refactoring Unveiled, An Improvement to TDD Efficiency and Large Language Models in Detecting Test Smells

IllicoDB3 - append(), write(), read(), goto(), skip()

QA Insider: Free GitHub Copilot Changes the Game for VS Code Users