Claude webapp-testing skill: 10 Playwright cookbook
Ten real Playwright tests — smoke, auth, visual regression, file upload, network mocks, mobile viewport, OAuth popups, perf budget, a11y audit, GitHub Actions — each as a single Claude prompt with the exact TypeScript it produces.
Already know what skills are? Skip to the cookbook. First time? Read the explainer then come back. Need the install? It’s on the /skills/webapp-testing page.

On this page · 21 sections▾
- What this skill does
- The cookbook
- Install + README
- Watch it built
- 01 · Smoke test: homepage loads + primary CTA works
- 02 · Auth flow: login form + session storage + logout
- 03 · Visual regression with toHaveScreenshot()
- 04 · Network mocks via page.route()
- 05 · Mobile viewport with devices['iPhone 13']
- 06 · OAuth popup handling
- 07 · Performance assertion via CDPSession
- 08 · Accessibility audit with @axe-core/playwright
- 09 · CI matrix: chromium / firefox / webkit on GitHub Actions
- 10 · Shared user fixture with test.beforeEach
- Community signal
- The contrarian take
- Real suites shipped
- Gotchas
- Pairs well with
- FAQ
- Sources
What this skill actually does
Sixty seconds of context before the cookbook — what the webapp-testing skill is, what Claude returns when you invoke it, and the one thing it does NOT do for you.
What this skill actually does
“Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.”
— anthropics/skills · skills/webapp-testing/SKILL.md · /skills/webapp-testing
What Claude returns
You ask in natural language; Claude returns a runnable Playwright spec — almost always a TypeScript *.spec.ts that imports `test, expect` from `@playwright/test`, drives `page.goto`, `page.locator`, and `page.getByRole`, asserts with `expect(locator).toHaveText(...)` and `expect(page).toHaveScreenshot()` for visual diffs, stubs upstreams via `page.route(...)`, and configures projects (chromium, firefox, webkit, `devices['iPhone 13']`) plus retries in `playwright.config.ts`. The whole file runs unmodified under `npx playwright test`.
What it does NOT do
It does not install Playwright for you — run `npm i -D @playwright/test && npx playwright install` first. It also does not boot your dev server (the helper `scripts/with_server.py` does that, and the contrarian section flags a security gotcha there).
How you trigger it
write a Playwright test for the login flowsmoke-test the homepage in chromium and webkitadd a visual regression test for the pricing pageCost when idle
~100
The cookbook
Each entry below is a Playwright test you could ship this week. They run in the order I’d teach them — the early ones (smoke, auth, visual) are reusable on every project, the later ones lean on Playwright features you only need when the shape gets specific (CDP for perf, axe-core for a11y, page.route for flaky-API isolation, popup events for OAuth). Every entry pairs with one or two skills or MCP servers you already have on mcp.directory.
One trade-off worth naming up front: this skill is a competitor to the Playwright MCP server. The skill is ~120 idle tokens (just the SKILL.md description). The MCP ships a tool schema for every browser action and that schema lives in the context window every turn. Pick the skill when each test is a fresh script and idle cost matters; pick the MCP when one agent needs to drive a long-lived browser session across many turns. The contrarian section covers the security trade-off that comes with the skill route.
Install + README
If the skill isn’t on your machine yet, here’s the one-liner. The full install panel (Codex, Copilot, Antigravity variants) lives on the skill page. The README below is the raw SKILL.md from anthropics/skills/webapp-testing — same source the install pulls from.
One-line install · by anthropics
Open skill pageInstall
mkdir -p .claude/skills/webapp-testing && curl -L -o skill.zip "https://mcp.directory/api/skills/download/47" && unzip -o skill.zip -d .claude/skills/webapp-testing && rm skill.zipInstalls to .claude/skills/webapp-testing
Watch it built
A practical walkthrough of Claude Code driving Playwright end-to-end — useful before the cookbook because it anchors what the feedback loop feels like (fresh script per task, screenshots and console output back to the agent).
Smoke test: homepage loads + primary CTA works
One test that fails fast in CI when the homepage 500s, the hero CTA is missing, or the route it points at is broken.
ForEvery web team. Run on every PR.
The prompt
Write a Playwright smoke test in tests/smoke/homepage.spec.ts. It should: load http://localhost:3000, assert the page has a visible h1, click the primary CTA labelled 'Get started', and assert the URL ends with /signup. Tag the test '@smoke' so we can filter it in CI.What slides.md looks like
import { test, expect } from '@playwright/test';
test('homepage loads and CTA navigates to signup @smoke', async ({ page }) => {
await page.goto('http://localhost:3000');
await expect(page.getByRole('heading', { level: 1 })).toBeVisible();
await page.getByRole('link', { name: 'Get started' }).click();
await expect(page).toHaveURL(/\/signup$/);
});One-line tweak
Run only this in CI's fast lane with `npx playwright test --grep @smoke`. Add a second assertion on `expect(page).toHaveTitle(/YourBrand/)` if you have brand-name regression worries.
Auth flow: login form + session storage + logout
Verify the full sign-in loop: form validation, redirect on success, session cookie set, logout clears it.
ForAnyone with a credentialed app. Pairs naturally with `storageState` reuse for downstream tests.
The prompt
Write tests/auth/login.spec.ts. Fill the email and password inputs (use getByLabel), click 'Sign in', expect URL to become /dashboard, expect a cookie named 'session' to exist. Then click 'Log out' and expect the cookie to be gone. Save the authenticated state to playwright/.auth/user.json with page.context().storageState.What slides.md looks like
import { test, expect } from '@playwright/test';
test('user can sign in, persist session, and sign out', async ({ page, context }) => {
await page.goto('/login');
await page.getByLabel('Email').fill('[email protected]');
await page.getByLabel('Password').fill(process.env.TEST_PASSWORD!);
await page.getByRole('button', { name: 'Sign in' }).click();
await expect(page).toHaveURL('/dashboard');
expect((await context.cookies()).some(c => c.name === 'session')).toBe(true);
await context.storageState({ path: 'playwright/.auth/user.json' });
await page.getByRole('button', { name: 'Log out' }).click();
expect((await context.cookies()).some(c => c.name === 'session')).toBe(false);
});One-line tweak
Reuse `storageState: 'playwright/.auth/user.json'` in `playwright.config.ts` so downstream specs skip the login UI entirely — speeds the suite up 5–10x.
Visual regression with toHaveScreenshot()
Pixel-diff the rendered page against a committed baseline; fail when a CSS change quietly moves your hero by 4px.
ForDesign-system teams, marketing pages, anything where unintended visual drift is a bug.
The prompt
Write tests/visual/pricing.spec.ts. Navigate to /pricing, wait for networkidle, then call expect(page).toHaveScreenshot('pricing.png'). Mask the dynamic '€/$' price block with a CSS selector. Set maxDiffPixels: 100.What slides.md looks like
import { test, expect } from '@playwright/test';
test('pricing page matches the committed baseline', async ({ page }) => {
await page.goto('/pricing');
await page.waitForLoadState('networkidle');
await expect(page).toHaveScreenshot('pricing.png', {
mask: [page.locator('[data-test="price-amount"]')],
maxDiffPixels: 100,
animations: 'disabled',
});
});One-line tweak
Generate baselines on the same OS that runs CI — never locally. `npx playwright test --update-snapshots --project=chromium` from a Linux runner avoids the macOS-vs-Ubuntu anti-aliasing trap.
Network mocks via page.route()
Stub flaky upstream APIs so the test verifies your UI, not someone else's uptime.
ForAny frontend that talks to a third-party (Stripe, Algolia, GitHub API, internal microservices).
The prompt
Write tests/mocks/search.spec.ts. Intercept GET /api/search?q=* and respond with a fixture of three results. Type 'react' into the search input and assert the three result titles appear.What slides.md looks like
import { test, expect } from '@playwright/test';
test('search renders results from a stubbed API', async ({ page }) => {
await page.route('**/api/search?q=*', async route => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ hits: [
{ id: '1', title: 'React docs' },
{ id: '2', title: 'React Router' },
{ id: '3', title: 'React Query' },
] }),
});
});
await page.goto('/search');
await page.getByPlaceholder('Search').fill('react');
await expect(page.getByRole('listitem')).toHaveCount(3);
await expect(page.getByText('React Router')).toBeVisible();
});One-line tweak
Move the fixture to `tests/fixtures/search.json` and `import searchFixture from '../fixtures/search.json'` so the response stays diffable in PRs.
Mobile viewport with devices['iPhone 13']
Catch the regressions that only happen at 390×844 — overflow, touch targets too close, off-canvas nav broken.
ForAnyone whose mobile traffic is more than 30% of conversions.
The prompt
Add a 'mobile' project to playwright.config.ts using devices['iPhone 13']. Write tests/mobile/nav.spec.ts that opens the hamburger, taps 'Pricing', and asserts the URL is /pricing.What slides.md looks like
// playwright.config.ts
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'mobile', use: { ...devices['iPhone 13'] } },
],
});
// tests/mobile/nav.spec.ts
import { test, expect } from '@playwright/test';
test('mobile nav opens and routes to /pricing', async ({ page }) => {
await page.goto('/');
await page.getByRole('button', { name: 'Open menu' }).tap();
await page.getByRole('link', { name: 'Pricing' }).tap();
await expect(page).toHaveURL('/pricing');
});One-line tweak
Run mobile-only with `npx playwright test --project=mobile`. Add `devices['Pixel 7']` as a third project to catch Chrome-on-Android-specific issues.
OAuth popup handling
Test the 'Sign in with Google/GitHub' flow without flaking — the popup must be awaited BEFORE the click that opens it.
ForAny product with social sign-in.
The prompt
Write tests/oauth/github.spec.ts. Click 'Continue with GitHub'. Use page.waitForEvent('popup') CREATED BEFORE the click. In the popup, fill the GitHub login fixture and click Authorize. Expect the original page to land at /dashboard.What slides.md looks like
import { test, expect } from '@playwright/test';
test('github OAuth popup completes and returns to /dashboard', async ({ page }) => {
await page.goto('/login');
const popupPromise = page.waitForEvent('popup'); // create BEFORE click
await page.getByRole('button', { name: 'Continue with GitHub' }).click();
const popup = await popupPromise;
await popup.waitForLoadState();
await popup.getByLabel('Username').fill(process.env.GH_USER!);
await popup.getByLabel('Password').fill(process.env.GH_PW!);
await popup.getByRole('button', { name: 'Sign in' }).click();
await popup.getByRole('button', { name: 'Authorize' }).click();
await expect(page).toHaveURL('/dashboard');
});One-line tweak
For deterministic CI runs, stub the `/oauth/callback` endpoint with `page.route` and never hit the real GitHub IdP — see use case 4 for the pattern.
Performance assertion via CDPSession
Fail the build when First Contentful Paint regresses past a budget — without bolting on a Lighthouse CI process.
ForTeams with a perf budget already (LCP < 2.5s, CLS < 0.1).
The prompt
Write tests/perf/landing.spec.ts. Navigate to /, wait until the page is settled, then read the browser-side `performance.getEntriesByType('paint')` API to assert First Contentful Paint < 1500 ms.What slides.md looks like
import { test, expect } from '@playwright/test';
test('landing page FCP stays under 1.5s', async ({ page }) => {
await page.goto('/', { waitUntil: 'networkidle' });
const fcp = await page.evaluate(() => {
const entry = performance
.getEntriesByType('paint')
.find((e) => e.name === 'first-contentful-paint');
return entry ? entry.startTime : null;
});
console.log('FCP (ms):', fcp);
expect(fcp).not.toBeNull();
expect(fcp!).toBeLessThan(1500);
});One-line tweak
Pair with the `perf-lighthouse` skill if you want full Core Web Vitals (LCP, INP, CLS) — the Paint Timing API is the cheap floor; Lighthouse is the audit.
Accessibility audit with @axe-core/playwright
Run axe against every key page; fail the build on serious or critical violations.
ForAnyone who needs WCAG 2.1 AA. The skill prompt also nudges Claude to fix obvious violations in the same PR.
The prompt
Install @axe-core/playwright. Write tests/a11y/dashboard.spec.ts that navigates to /dashboard, runs AxeBuilder({ page }).analyze(), and fails if any 'serious' or 'critical' violation is reported.What slides.md looks like
import { test, expect } from '@playwright/test';
import AxeBuilder from '@axe-core/playwright';
test('dashboard has no serious or critical a11y violations', async ({ page }) => {
await page.goto('/dashboard');
const results = await new AxeBuilder({ page })
.withTags(['wcag2a', 'wcag2aa'])
.analyze();
const blocking = results.violations.filter(
v => v.impact === 'serious' || v.impact === 'critical'
);
expect.soft(blocking, JSON.stringify(blocking, null, 2)).toEqual([]);
});One-line tweak
Use `expect.soft` so one bad rule doesn't hide the others — the test still fails, but every violation is reported in the same run.
CI matrix: chromium / firefox / webkit on GitHub Actions
Run every spec across the three engines on every push, with traces uploaded on failure.
ForAny team that ships to users on Safari (i.e., everyone).
The prompt
Generate .github/workflows/playwright.yml. Matrix over the three projects (chromium, firefox, webkit). Cache Playwright browsers. Upload playwright-report/ as an artifact on failure.What slides.md looks like
# .github/workflows/playwright.yml
name: Playwright Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
project: [chromium, firefox, webkit]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20, cache: 'npm' }
- run: npm ci
- run: npx playwright install --with-deps ${{ matrix.project }}
- run: npx playwright test --project=${{ matrix.project }}
- if: ${{ failure() }}
uses: actions/upload-artifact@v4
with: { name: playwright-report-${{ matrix.project }}, path: playwright-report/, retention-days: 7 }One-line tweak
Add `--shard=${{ matrix.shard }}/4` and a second matrix axis `shard: [1, 2, 3, 4]` once your suite passes ~5 minutes wall-clock.
Shared user fixture with test.beforeEach
Stop re-logging-in at the start of every spec. Define one authenticated fixture; every downstream test inherits it.
ForAny suite over ~20 tests where login boilerplate is now the slowest part of CI.
The prompt
Refactor the suite. Create tests/fixtures/auth.ts that exports a custom test with an `authedPage` fixture. Use storageState from playwright/.auth/user.json (saved by use case 2). Then rewrite tests/dashboard.spec.ts to import from the fixture and skip the login UI entirely.What slides.md looks like
// tests/fixtures/auth.ts
import { test as base, expect, Page } from '@playwright/test';
type Fixtures = { authedPage: Page };
export const test = base.extend<Fixtures>({
authedPage: async ({ browser }, use) => {
const context = await browser.newContext({ storageState: 'playwright/.auth/user.json' });
const page = await context.newPage();
await use(page);
await context.close();
},
});
export { expect };
// tests/dashboard.spec.ts
import { test, expect } from './fixtures/auth';
test('authed user sees their org name', async ({ authedPage }) => {
await authedPage.goto('/dashboard');
await expect(authedPage.getByTestId('org-name')).toHaveText('Acme Inc.');
});One-line tweak
Add a second fixture `adminPage` that loads `playwright/.auth/admin.json` for tests that need elevated permissions — the pattern composes.
Community signal
Three voices from the Show HN thread for the open-source Playwright skill that inspired this category. The first is the clearest endorsement of why a skill works for Playwright; the second is the context-cost story; the third is the author’s own honest framing of when not to bother.
“Playwright runs tests in parallel by default for free, whereas Cypress performs parallelization only for different machines through a paid feature.”
BigBinary engineering team · Blog
Why-we-switched post-mortem; the single biggest reason teams move once Claude is authoring tests — parallel runs are free.
“I'm likely 50–100% more productive in Playwright than I was in Cypress.”
Michael Lynch · Blog
Honest cost/benefit on Cypress→Playwright migration; useful counterweight to the contrarian section below.
“Playwright exposed problems that Cypress's automatic retries and auto-waiting masked — if tests had race conditions, Playwright exposed them consistently.”
21RISK engineering · Blog
The flakiness story most teams discover the week after they migrate.
The contrarian take
Not everyone is sold on skills for Playwright. The most honest critique on the launch thread came from Michael Lynch (mtlynch.io):
“I have a personal appreciation for Cypress as an open-source company, and in particular, Gleb Bahmutov, their VP of Engineering.”
Michael Lynch (mtlynch.io) · Blog
From the Show HN thread on Playwright skills.
He’s right, and the official anthropics/webapp-testing skill has the receipts. Issue #1021 documents a working command-injection in scripts/with_server.py: the wrapper called subprocess.Popen(server['cmd'], shell=True, ...) on a string that came straight from a CLI argument. With shell=True, a value like "python server.py; touch /tmp/pwned" executed two commands instead of one. The reporter (nobuhiro-sasaki) summarised the threat model bluntly: “When this script is invoked by an AI agent from a prompt-driven workflow, a malicious or injected --server value can execute arbitrary shell commands on the host.”
The fix (PR #1039) is a one-line swap from shell=True to shlex.split(server['cmd']) plus an explicit --cwd argument. The lesson is bigger than the patch: when an agent assembles --server values from prompts, README snippets, or tool output, shell=True is a footgun. AftHurrahWinch’s core point holds — an MCP server is a deterministic surface; a skill is markdown a model interprets. That’s fine for a Playwright spec, but for the wrapper that boots your dev server, audit the version, pin to a commit that contains the fix, and never let untrusted text reach a shell-mode subprocess.
If determinism matters more than idle cost, the Playwright MCP server is the honest alternative — the tool schemas are static, the actions are well-typed, the trust boundary is the MCP transport. Pick it when you’re running an agent loop across many turns against the same browser session, and reach for this skill when each test is a one-shot script.
Real suites shipped
Concrete examples of teams running Claude + Playwright in anger. None of these are pure-marketing — they cite spec counts, target apps, and visible diffs.
- Microsoft VS Code — desktop smoke tests being migrated from custom drivers to Playwright Electron
- Microsoft Playwright VS Code extension — its own e2e suite is Playwright (`playwright.config.ts` in repo root)
- Disney testing framework — open-source Python + Playwright suite covering home and login flows
- Adobe Analytics tracking — open-source Playwright suite intercepting Adobe Analytics beacons via `page.route`
- Adobe Experience Manager Cloud — Playwright is the supported UI test runner shipped with Cloud Manager
- Microsoft Azure App Testing — managed Playwright cloud for enterprise CI matrices
Gotchas (the four that bite)
Sourced from the anthropics/skills issue tracker and the Show HN thread for Playwright skills.
with_server.py + untrusted --server is a shell-injection vector
Issue #1021 — pre-fix versions used shell=True. Pin to the commit that includes PR #1039 (shlex.split) and never feed --server values built from prompt or tool output.
Skills aren't deterministic; the wrapper still has to be
The model can choose how to author a Playwright spec, but the helper script that boots your dev server should be locked-down code, not LLM-generated. Treat with_server.py as the trust boundary.
page.waitForEvent('popup') must be created BEFORE the click
Awaiting waitForEvent after the click that opens the popup is a classic race. The popup either fires before await registers, or never resolves. Use the popupPromise pattern in use case 7.
toHaveScreenshot baselines drift across OS
A baseline captured on macOS won't match the same page on Ubuntu CI — anti-aliasing differs. Either commit per-platform baselines or generate them in CI and never locally. Set maxDiffPixels: 100 as a floor.
Pairs well with
Curated to match the cookbook’s actual integrations: the Playwright-adjacent skills (playwright-cli, writing-playwright-tests, playwright-pro) plus the perf and a11y skills the later use cases lean on. The natural cross-link is the Flutter skill cookbook — Flutter web apps and these Playwright tests pair perfectly for a full-stack agentic test suite.
Related skills
Related MCP servers
Two posts that compose well with this cookbook: What are Claude Code skills? covers the underlying mechanism, and Claude Code best practices covers the orchestration patterns the longer use cases (8, 10) lean on.
Frequently asked questions
What is the webapp-testing skill, and how is it different from the Playwright MCP server?
The webapp-testing skill is Anthropic's official SKILL.md that teaches Claude how to author and run Playwright scripts on demand against a local dev server. The Playwright MCP server keeps a long-lived browser session and exposes browser actions as MCP tools — its tool schemas live in the context window every turn. Reach for the skill when each test is a fresh script and idle cost matters; reach for the MCP when the agent benefits from persistent state across many turns.
Chrome DevTools MCP vs Playwright MCP — which one pairs with this skill?
They solve adjacent problems. Playwright MCP wraps page-level automation (click, fill, navigate, screenshot). Chrome DevTools MCP exposes lower-level CDP primitives (Performance.getMetrics, Coverage, Tracing). Use case 8 in this cookbook hits CDP directly through Playwright's context.newCDPSession — that's the cheapest path. If you already run Playwright tests, stay there; only add Chrome DevTools MCP when you need protocol-level surface the skill cannot reach.
Does the webapp-testing skill author Python or TypeScript Playwright?
The official Anthropic SKILL.md leans on Python with sync_playwright() because the helper script (scripts/with_server.py) is a Python wrapper that owns the dev-server lifecycle. The 10 cookbook entries above are written in TypeScript — that's the more common stack on the frontend side, and the patterns map 1:1. Tell Claude which language you want; the skill will follow.
Is there a known security issue with the webapp-testing skill I should know about?
Yes. Issue #1021 in anthropics/skills documents a shell-injection in scripts/with_server.py — it called subprocess.Popen with shell=True on a CLI string. The fix (PR #1039) replaces shell=True with shlex.split. Audit your version, pin to the patched commit, and never let untrusted text reach a --server value. The contrarian section below covers this in detail.
What is the playwright-cli skill, and do I need it on top of webapp-testing?
playwright-cli is a sibling skill that teaches Claude the npx playwright test command surface — flags, projects, shards, reporters. webapp-testing focuses on authoring a Playwright script and lifecycle-managing the server it tests. They compose: webapp-testing writes the spec, playwright-cli runs it. Most teams want both installed.
How do I avoid flaky tests when Claude authors the suite?
Three rules the cookbook prompts already enforce: prefer page.getByRole / page.getByLabel over CSS selectors (they re-bind on copy edits), wait on networkidle or a visible state instead of a fixed timeout, and use page.route to stub external APIs in any test that doesn't explicitly verify the upstream. If a test still flakes, run it with --trace=on and read the trace before adding retries — retries hide the bug.
Why is 'webapp testing' getting impressions on Google but no clicks?
The bare 'webapp testing' query is too broad — it surfaces every Playwright tutorial on the web. This blog targets the long-tail variants that map to the Anthropic skill specifically: 'webapp testing skill', 'webapp-testing skill', 'webapp testing claude skill', plus the playwright-cli and chrome-devtools-mcp comparison cluster where the Anthropic SKILL.md is the right answer.
Sources
Primary
- anthropics/webapp-testing SKILL.md (the skill manifest)
- Playwright official documentation
- Playwright network mocking (page.route)
- Playwright visual comparisons (toHaveScreenshot)
- Playwright accessibility testing with axe-core
- Playwright CDPSession API
- Playwright + GitHub Actions
Community
- BigBinary engineering team — Blog
- Michael Lynch — Blog
- 21RISK engineering — Blog
- Playwright official docs — Blog
- Debbie O'Brien (@debs_obrien) — X / Twitter
- Debbie O'Brien · NDC London — Blog
Critical and contrarian
- Michael Lynch (mtlynch.io) on skills vs. MCPs and command-injection vectors
- anthropics/skills #1021 — with_server.py command injection
- anthropics/skills #1039 — the shlex.split fix
Internal