登录查看更多内容

End-to-End Testing an AI Application with Playwright and GitHub Actions

Jacob Habib

Software Engineer @ Open Dollar

发布日期: 2024年8月9日

+ 关注

Creating a Robust AI Testing Workflow From Localhost to Production

Why End-to-End Testing?

LLMs are notoriously finicky. You can try to corral them into an API, fine-tune them, lower their temperature, select JSON mode, pray, but in the end you may still end up with a hallucination rate of 15-20%. Developers expect their code to be deterministic, so this is not ideal. Enterprise applications typically have a great number of automated tests that can click around and point out even the slightest differences in expected behavior. For example, the automated tests for one application that I worked on was so sensitive that a simple change to an existing flow could cause dozens of broken tests, leading to long hours of manual testing for the quality engineers.

End-to-end tests are meant to verify that everything in a system works as it should in a real-world scenario. That means striking a balance between writing tests that are robust enough to handle acceptable levels of variance. However, they should not be so brittle that the test breaks on every other CI run. The reality of development is that time is finite, and the smaller the company the more painful it can be to write tests, whether they are unit, integration, or end-to-end tests. However, I'd like to prove that if a small team has the time to write even just one test before shipping, it should be an end-to-end test.

Architecture of Eidolon AI

Recently I've been contributing to the open source AI agent framework Eidolon AI. The Eidolon team noted that one of their highest priority needs for the project was just a simple, full end-to-end test for one of their many AI agent examples. The tech stack for their simplest examples includes a MongoDB database, a server that's built with a Dockerfile (eidolon-server), and a standalone Next.js UI that's also built with a Dockerfile (webui). A Docker Compose file at the base of the repository orchestrates each of these components together.

This is what the webui looks like after it's built and running:

Adding a Test to Eidolon AI

I wanted my first E2E test on Eidolon to target the example chatbot. The chatbot was an ideal example to target with E2E tests because it requires the database, server, and front-end application to coordinate, but it doesn't require any additional services outside the scope of the existing Docker Compose.

I decided the best way to test the chatbot example would be to use Playwright with GitHub Actions. Playwright is an excellent way to add end-to-end testing to modern applications because we can configure it to hook into a running Docker instance, and also it provides granular ways to target different parts of the DOM, such as selecting a chatbot's text box.

GitHub Actions is the ideal choice as a CI tool because Eidolon was already orchestrating the server and front end together in other GitHub Actions workflow files, and GitHub Actions has a useful action called upload-artifact that uploads the screenshots and result of the test as a test artifact, so we can see exactly why a test failed.

Configuring Playwright

Install

Unlike ordinary packages, when installing Playwright we have to install the package itself as well as the browser binaries:

pnpm install --save-dev @playwright/test@latest 
pnpm exec playwright install --with-deps

We need the browser binaries so that Playwright can see and control different browsers programmatically.

const { defineConfig } = require('@playwright/test');

module.exports = defineConfig({
    // Where in the repo Playwright should search for our tests
    testDir: './tests',
    // Where the test results should be stored
    outputDir: 'tests/test-results',
    // Important for a CI to specify a timeout or it could hang
    timeout: 30000,
    retries: 2,
    use: {
    // When running our tests we don't open the browser (headless)
    // Upon test failure a screenshot of the front end will be saved
        headless: true,
        baseURL: 'https://localhost:3000',
        screenshot: 'only-on-failure',
    },
    webServer: {
        // We need to launch a dev server before running the tests
        command: 'pnpm docker-compose up',
        // Working directory where we run the above command
        cwd: '../../..',
        port: 3000,
        timeout: 120000,
        // If a server is already running, use that for tests
        reuseExistingServer: true,
    },
});

Adding a Test

Now that we've configured Playwright to properly search for our tests and know where our front end and server are running, we can add our first test. Since we're testing the chatbot, the most basic E2E test we can run is to ensure the chatbot responds to basic input. Let's break down the chatbot.test.js file which we'll add in our tests directory as specified in the Playwright config:

const { test, expect } = require('@playwright/test');

// Test to check if the chatbot responds to input
test('Chatbot should respond to input', async ({ page }) => {
    await page.goto('/eidolon-apps/sp/chatbot');
    // If the user is not logged in, log in with a random email
    if (await page.locator('text=Eidolon Demo Cloud').isVisible()) {
        const randomEmail = `test${Math.random().toString(36).substring(7)}@example.com`;
        await page.fill('input[id="input-username-for-credentials-provider"]', randomEmail);
        await page.click('button[type="submit"]');
    }
    // Add a chat
    const addChatButton = await page.locator('text=Add Chat');
    await addChatButton.click();
    const inputField = await page.locator('textarea[aria-invalid="false"]');
    await inputField.waitFor();
    // Fill the input field with a message
    await inputField.fill('Hello, how are you? Type "Hello!" if you are there!');
    await page.locator('button[id="submit-input-text"]').click();
    const response = await page.getByText("Hello!", { exact: true });
    await response.waitFor();
    await expect(response).toBeVisible();
    await expect(response).toContainText('Hello!');
});

Let's assume that we've disabled login functionality, and we just want to add a chat, fill in text, and wait for a response.

When we're writing tests, how can we know what items to target in the DOM? Well, since we downloaded the browser binaries we can run Playwright in debug mode and use the browser's developer tools to know exactly what elements might be a good fit. The relevant command to run Playwright in debug mode is the following:

pnpm exec playwright test --debug

This is what the test buckle looks like when it's open. Note that we can use the "Step Over" button that I've highlighted in red to go line by line in our test file and see exactly how our front end changes at each step of our test:

At this point, we can move forward to enter the chatbot app, click "Add Chat" (note that it's an async operation), and select the text area where we'll write our prompt for the chatbot:

When stepping over each step the Playwright test buckle highlights the relevant locator. In this example we target the element by its element type (textarea) and a unique attribute (its aria-invalid value). It would be even better if our element had a unique ID so that we didn't need to target it by a combination of these two descriptors, but for now it'll suffice.

Justin Beall 2 个月前

Leveraging AI for Automated Web Application Testing

Ajay Nagotha 2 周前

How to Design Prompts for Test Case Wizard and…

Shree Krishna Priya J 5 个月前

The URL in the browser has a unique process ID for the current conversation (66b58...). The process ID is a already a good sign that our server, front end, and database are working together, because a new process is created with a POST request and it's fetched with a GET request.

In the final step of our test, we fill in the textarea element with the fill() function, then target the enter arrow by its element and id: (button[id="submit-input-text"]. Finally, we use the async waitFor() function to wait for the application to respond (this is why a timeout is important), and we search the whole text of the page to find if the "Hello!" response we're expecting is found:

Our test is successful and the Chromium browser closes automatically.

In our front end's package.json we can add some scripts to run Playwright automatically.

I added the following:

"build": "pnpm run build-eidolon && next build",
"docker:up": "docker-compose up -d",
"docker:down": "docker-compose down",
"docker:build": "docker-compose build",
"docker:rebuild": "pnpm run docker:down && pnpm run docker:build && pnpm run docker:up",
"test:e2e": "pnpm run docker:rebuild && pnpm exec playwright test && pnpm run docker:down",
"playwright:debug": "pnpm exec playwright test --debug",

We now have the test:e2e command to run our E2E tests from the CI, and the playwright:debug command to test locally

GitHub Actions Workflow

Now that our test is working correctly, we need to have them run on the CI. Let's add an e2e.yml file in the .github/workflows directory at the base of our repo. For brevity, I'll only include a link to the workflow file and I'll include relevant snippets below.

Specify Workflow Trigger

To ensure our E2E workflow file doesn't run unnecessarily, we specify exactly when it should run using GitHub's workflow syntax:

on:
  # Lets us trigger the workflow manually
  workflow_dispatch:
  # Triggered on pushes to main, certain paths, and pull requests
  push:
    branches: [main]
    paths:
      - '**'
      - '!k8s-operator/**'
  pull_request:
    paths:
      - '**'
      - '!k8s-operator/**'

Architecture of the test-e2e Job

We'll run just one job, and it includes runs-on, services, and steps. The basic architecture looks like this:

jobs:
    test-e2e:
        runs-on: ubuntu-latest
        services:
            mongo:
        env:
        steps:

Because the server depends on a health MongoDB database, we run the database service first and then run the Docker scripts to spin up the server and front end later. The env section includes variables related to NextAuth.js and an OpenAI key. Finally, we get to the main portion of the workflow, the steps.

      - name: Upload Playwright test results
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: playwright-results
          path: |
            webui/apps/eidolon-ui2/tests/test-results
          if-no-files-found: ignore

      - name: Upload Playwright screenshots
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: playwright-screenshots
          path: webui/apps/eidolon-ui2/tests/screenshots
          if-no-files-found: ignore

The trickiest part to remember about the above workflow steps is that we need to include always() to ensure the CI runs these steps even if a previous step fails (otherwise we would never arrive to the log steps when a test fails).

Now when a test fails during our workflow, we can view the Playwright test results by downloading playwright-results, a zip file that contains the relevant images.

A test failure screenshot taken by Playwright:

Conclusion

There are plenty of best practices that would take another article to go over in detail, like getting consistent LLM outputs, targeting DOM elements effectively, and adding retry/timeout flows in workflow files. For now you can get started adding simple end-to-end tests for your own AI application using just Playwright and GitHub Actions.

If you're interested in contributing to Eidolon AI the team is always looking for contributors—feel free to join their Discord.

If you'd like help adding end-to-end testing for your own application or just want to say hello, feel free to reach out!

SmythOS

2 个月

It's great to see efforts being made to prioritize end-to-end testing, especially in AI applications where complexity can quickly add up. What were some of the biggest challenges you faced when implementing Playwright and GitHub Actions for Eidolon AI's testing workflow?

要查看或添加评论，请登录

Jacob Habib的更多文章

On the Edge of Innovation: The Web’s Evolution From Serverless to Edge Computing

2022年9月6日

On the Edge of Innovation: The Web’s Evolution From Serverless to Edge Computing

An Intro to Cloud, Serverless, and Edge Computing Web developers have been enjoying a technological renaissance over…
Why Build B2C Brands with Blockchain?

2022年9月5日

Why Build B2C Brands with Blockchain?

Introduction: From Corporate Brands to Community Brands Imagine launching a restaurant with such a staunch following…
The Bogotá Docs: A Guide for Devcon Attendees

2022年8月12日

The Bogotá Docs: A Guide for Devcon Attendees

Introduction I was thrilled when I found out such a large blockchain conference would be hosted in Bogotá, a city I’ve…
Staying Safe as a Blockchain Dev: Safe Smart Contract and dApp Development for Beginners

2022年6月1日

Staying Safe as a Blockchain Dev: Safe Smart Contract and dApp Development for Beginners

Introduction The blockchain ecosystem boasts multibillion-dollar decentralized protocols, applications, and networks…
Building a Serverless Stock Price Web App With Cloudflare Workers, Pages, Workers KV and Cron Triggers

2022年1月30日

Building a Serverless Stock Price Web App With Cloudflare Workers, Pages, Workers KV and Cron Triggers

Intro Building a web app during college or a coding bootcamp, confronted with a dizzying array of programming jargon…
Creating a free JAMstack website with Nuxt.js, Tailwind CSS, and GitHub Pages

2021年6月1日

Creating a free JAMstack website with Nuxt.js, Tailwind CSS, and GitHub Pages

After finishing up my Java development bootcamp, I decided to begin developing a professional presence online by…
Coffee Flavorings Are Deadly Sweet

2020年3月11日

Coffee Flavorings Are Deadly Sweet

Regulation is Needed to Protect Coffee Processors and Consumers Roasted coffee flavorings are appealing options to…

2 条评论

See all articles

End-to-End Testing an AI Application with Playwright and GitHub Actions

Jacob Habib

Software Engineer @ Open Dollar

Creating a Robust AI Testing Workflow From Localhost to Production

Why End-to-End Testing?

Architecture of Eidolon AI

Adding a Test to Eidolon AI

Configuring Playwright

Install

Adding a Test

领英推荐

GitHub Actions Workflow

Specify Workflow Trigger

Architecture of the test-e2e Job

Conclusion

Jacob Habib的更多文章

社区洞察

其他会员也浏览了

Why you should consider Microsoft Bot Framework for your next AI Chatbot

AI & Engineering Effectiveness: Market Landscape

Automated testing of AI applications

Getting Started with gRPC using .Net

Is AI already replacing developers?

Software Engineers and AI

DDD is (mostly) superstitious nonsense

If I can do it, anybody can

Day 1 of 100 AI tools [LinkedIn]

Botpress Part Two: Brainy Bots for the Discerning Developer (and the Clueless Cousin)

Creating a Robust AI Testing Workflow From Localhost to Production

Why End-to-End Testing?

Architecture of Eidolon AI

Adding a Test to Eidolon AI

Configuring Playwright

Install

Adding a Test

领英推荐

GitHub Actions Workflow

Specify Workflow Trigger

Architecture of the test-e2e Job

Conclusion

Jacob Habib的更多文章

On the Edge of Innovation: The Web’s Evolution From Serverless to Edge Computing

Why Build B2C Brands with Blockchain?

The Bogotá Docs: A Guide for Devcon Attendees

Staying Safe as a Blockchain Dev: Safe Smart Contract and dApp Development for Beginners

Building a Serverless Stock Price Web App With Cloudflare Workers, Pages, Workers KV and Cron Triggers

Creating a free JAMstack website with Nuxt.js, Tailwind CSS, and GitHub Pages

Coffee Flavorings Are Deadly Sweet

社区洞察

其他会员也浏览了

Why you should consider Microsoft Bot Framework for your next AI Chatbot

AI & Engineering Effectiveness: Market Landscape

Automated testing of AI applications

Getting Started with gRPC using .Net

Is AI already replacing developers?

Software Engineers and AI

DDD is (mostly) superstitious nonsense

If I can do it, anybody can

Day 1 of 100 AI tools [LinkedIn]

Botpress Part Two: Brainy Bots for the Discerning Developer (and the Clueless Cousin)