End-to-End Testing an AI Application with Playwright and GitHub Actions
Creating a Robust AI Testing Workflow From Localhost to Production
Why End-to-End Testing?
LLMs are notoriously finicky. You can try to corral them into an API, fine-tune them, lower their temperature, select JSON mode, pray, but in the end you may still end up with a hallucination rate of 15-20%. Developers expect their code to be deterministic, so this is not ideal. Enterprise applications typically have a great number of automated tests that can click around and point out even the slightest differences in expected behavior. For example, the automated tests for one application that I worked on was so sensitive that a simple change to an existing flow could cause dozens of broken tests, leading to long hours of manual testing for the quality engineers.
End-to-end tests are meant to verify that everything in a system works as it should in a real-world scenario. That means striking a balance between writing tests that are robust enough to handle acceptable levels of variance. However, they should not be so brittle that the test breaks on every other CI run. The reality of development is that time is finite, and the smaller the company the more painful it can be to write tests, whether they are unit, integration, or end-to-end tests. However, I'd like to prove that if a small team has the time to write even just one test before shipping, it should be an end-to-end test.
Architecture of Eidolon AI
Recently I've been contributing to the open source AI agent framework Eidolon AI. The Eidolon team noted that one of their highest priority needs for the project was just a simple, full end-to-end test for one of their many AI agent examples. The tech stack for their simplest examples includes a MongoDB database, a server that's built with a Dockerfile (eidolon-server), and a standalone Next.js UI that's also built with a Dockerfile (webui). A Docker Compose file at the base of the repository orchestrates each of these components together.
This is what the webui looks like after it's built and running:
Adding a Test to Eidolon AI
I wanted my first E2E test on Eidolon to target the example chatbot. The chatbot was an ideal example to target with E2E tests because it requires the database, server, and front-end application to coordinate, but it doesn't require any additional services outside the scope of the existing Docker Compose.
I decided the best way to test the chatbot example would be to use Playwright with GitHub Actions. Playwright is an excellent way to add end-to-end testing to modern applications because we can configure it to hook into a running Docker instance, and also it provides granular ways to target different parts of the DOM, such as selecting a chatbot's text box.
GitHub Actions is the ideal choice as a CI tool because Eidolon was already orchestrating the server and front end together in other GitHub Actions workflow files, and GitHub Actions has a useful action called upload-artifact that uploads the screenshots and result of the test as a test artifact, so we can see exactly why a test failed.
Configuring Playwright
Install
Unlike ordinary packages, when installing Playwright we have to install the package itself as well as the browser binaries:
pnpm install --save-dev @playwright/test@latest
pnpm exec playwright install --with-deps
We need the browser binaries so that Playwright can see and control different browsers programmatically.
const { defineConfig } = require('@playwright/test');
module.exports = defineConfig({
// Where in the repo Playwright should search for our tests
testDir: './tests',
// Where the test results should be stored
outputDir: 'tests/test-results',
// Important for a CI to specify a timeout or it could hang
timeout: 30000,
retries: 2,
use: {
// When running our tests we don't open the browser (headless)
// Upon test failure a screenshot of the front end will be saved
headless: true,
baseURL: 'https://localhost:3000',
screenshot: 'only-on-failure',
},
webServer: {
// We need to launch a dev server before running the tests
command: 'pnpm docker-compose up',
// Working directory where we run the above command
cwd: '../../..',
port: 3000,
timeout: 120000,
// If a server is already running, use that for tests
reuseExistingServer: true,
},
});
Adding a Test
Now that we've configured Playwright to properly search for our tests and know where our front end and server are running, we can add our first test. Since we're testing the chatbot, the most basic E2E test we can run is to ensure the chatbot responds to basic input. Let's break down the chatbot.test.js file which we'll add in our tests directory as specified in the Playwright config:
const { test, expect } = require('@playwright/test');
// Test to check if the chatbot responds to input
test('Chatbot should respond to input', async ({ page }) => {
await page.goto('/eidolon-apps/sp/chatbot');
// If the user is not logged in, log in with a random email
if (await page.locator('text=Eidolon Demo Cloud').isVisible()) {
const randomEmail = `test${Math.random().toString(36).substring(7)}@example.com`;
await page.fill('input[id="input-username-for-credentials-provider"]', randomEmail);
await page.click('button[type="submit"]');
}
// Add a chat
const addChatButton = await page.locator('text=Add Chat');
await addChatButton.click();
const inputField = await page.locator('textarea[aria-invalid="false"]');
await inputField.waitFor();
// Fill the input field with a message
await inputField.fill('Hello, how are you? Type "Hello!" if you are there!');
await page.locator('button[id="submit-input-text"]').click();
const response = await page.getByText("Hello!", { exact: true });
await response.waitFor();
await expect(response).toBeVisible();
await expect(response).toContainText('Hello!');
});
Let's assume that we've disabled login functionality, and we just want to add a chat, fill in text, and wait for a response.
When we're writing tests, how can we know what items to target in the DOM? Well, since we downloaded the browser binaries we can run Playwright in debug mode and use the browser's developer tools to know exactly what elements might be a good fit. The relevant command to run Playwright in debug mode is the following:
pnpm exec playwright test --debug
This is what the test buckle looks like when it's open. Note that we can use the "Step Over" button that I've highlighted in red to go line by line in our test file and see exactly how our front end changes at each step of our test:
At this point, we can move forward to enter the chatbot app, click "Add Chat" (note that it's an async operation), and select the text area where we'll write our prompt for the chatbot:
When stepping over each step the Playwright test buckle highlights the relevant locator. In this example we target the element by its element type (textarea) and a unique attribute (its aria-invalid value). It would be even better if our element had a unique ID so that we didn't need to target it by a combination of these two descriptors, but for now it'll suffice.
领英推荐
The URL in the browser has a unique process ID for the current conversation (66b58...). The process ID is a already a good sign that our server, front end, and database are working together, because a new process is created with a POST request and it's fetched with a GET request.
In the final step of our test, we fill in the textarea element with the fill() function, then target the enter arrow by its element and id: (button[id="submit-input-text"]. Finally, we use the async waitFor() function to wait for the application to respond (this is why a timeout is important), and we search the whole text of the page to find if the "Hello!" response we're expecting is found:
Our test is successful and the Chromium browser closes automatically.
In our front end's package.json we can add some scripts to run Playwright automatically.
I added the following:
"build": "pnpm run build-eidolon && next build",
"docker:up": "docker-compose up -d",
"docker:down": "docker-compose down",
"docker:build": "docker-compose build",
"docker:rebuild": "pnpm run docker:down && pnpm run docker:build && pnpm run docker:up",
"test:e2e": "pnpm run docker:rebuild && pnpm exec playwright test && pnpm run docker:down",
"playwright:debug": "pnpm exec playwright test --debug",
We now have the test:e2e command to run our E2E tests from the CI, and the playwright:debug command to test locally
GitHub Actions Workflow
Now that our test is working correctly, we need to have them run on the CI. Let's add an e2e.yml file in the .github/workflows directory at the base of our repo. For brevity, I'll only include a link to the workflow file and I'll include relevant snippets below.
Specify Workflow Trigger
To ensure our E2E workflow file doesn't run unnecessarily, we specify exactly when it should run using GitHub's workflow syntax:
on:
# Lets us trigger the workflow manually
workflow_dispatch:
# Triggered on pushes to main, certain paths, and pull requests
push:
branches: [main]
paths:
- '**'
- '!k8s-operator/**'
pull_request:
paths:
- '**'
- '!k8s-operator/**'
Architecture of the test-e2e Job
We'll run just one job, and it includes runs-on, services, and steps. The basic architecture looks like this:
jobs:
test-e2e:
runs-on: ubuntu-latest
services:
mongo:
env:
steps:
Because the server depends on a health MongoDB database, we run the database service first and then run the Docker scripts to spin up the server and front end later. The env section includes variables related to NextAuth.js and an OpenAI key. Finally, we get to the main portion of the workflow, the steps.
- name: Upload Playwright test results
uses: actions/upload-artifact@v3
if: always()
with:
name: playwright-results
path: |
webui/apps/eidolon-ui2/tests/test-results
if-no-files-found: ignore
- name: Upload Playwright screenshots
uses: actions/upload-artifact@v3
if: always()
with:
name: playwright-screenshots
path: webui/apps/eidolon-ui2/tests/screenshots
if-no-files-found: ignore
The trickiest part to remember about the above workflow steps is that we need to include always() to ensure the CI runs these steps even if a previous step fails (otherwise we would never arrive to the log steps when a test fails).
Now when a test fails during our workflow, we can view the Playwright test results by downloading playwright-results, a zip file that contains the relevant images.
A test failure screenshot taken by Playwright:
Conclusion
There are plenty of best practices that would take another article to go over in detail, like getting consistent LLM outputs, targeting DOM elements effectively, and adding retry/timeout flows in workflow files. For now you can get started adding simple end-to-end tests for your own AI application using just Playwright and GitHub Actions.
If you're interested in contributing to Eidolon AI the team is always looking for contributors—feel free to join their Discord.
If you'd like help adding end-to-end testing for your own application or just want to say hello, feel free to reach out!
It's great to see efforts being made to prioritize end-to-end testing, especially in AI applications where complexity can quickly add up. What were some of the biggest challenges you faced when implementing Playwright and GitHub Actions for Eidolon AI's testing workflow?