e2e-test is an end-to-end testing practice skill for critical user journeys. It supports designing E2E coverage strategy, handling flaky tests, defining CI gates, and turning exploratory verification into maintainable automated tests. Its three main strengths are: preferring Agent Browser for exploration and reproduction, then Playwright or the project’s native test framework for code, with a clear tool path; built-in environment gates, runner selection, and result-strength control for honest degradation across tech stacks instead of rigid templates; and structured output plus machine-readable JSON for test governance, triage, and CI integration.
This evaluation reviews the e2e-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (E2E journey coverage, flaky test triage, CI gate design). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 39 assertions.
Special challenge: issue2md is a pure Go web app with no Node.js/Playwright/package.json, while e2e-test favors Playwright. This tests the skill’s environment adaptation and degradation strategy.
Triage flow depends heavily on structured methodology; baseline lacks it
Eval 1: E2E journey
+46.7%
Gate coverage + Output Contract + env degradation decision record
Eval 3: CI design
+33.3% (lowest)
CI design is a model strength; skill mainly adds Gate and JSON
Flaky triage is where the skill adds the most value—the baseline can find root causes and suggest fixes but lacks triage methodology (reproduce → classify → fix/quarantine) and stability proof requirements (-count=20 verification).
This is the most distinctive dimension in this evaluation. The skill is designed for Playwright first, but when faced with a pure Go project:
Dimension
With Skill
Without Skill
Runner selection decision
Explicit rationale (no Node.js, no package.json, Constitution constraint)
Implicit choice of Go HTTP tests (no decision record)
Degradation path
"Generate the strongest deliverable the environment can support" → Go HTTP
Naturally chose Go (no degradation concept)
Playwright code
Explicitly rejected ("Installing Playwright would violate the constitution")
Not considered (no relevant context)
Analysis: With-skill’s Operating Model §5 ("Produce only the strongest deliverable the environment can actually support") correctly guided the degradation decision. The skill did not blindly generate Playwright code; after the Environment Gate confirmed the toolchain was missing, it chose the Go HTTP path. The degradation rationale was explicitly recorded, which matters for PR review and team alignment.
This is the highest-value dimension of the skill—with-skill covered all 5 Gates in all 3 scenarios; without-skill missed multiple Gates in all 3.
Gate
With Skill (3 scenarios)
Without Skill (3 scenarios)
Configuration Gate
3/3
0/3
Environment Gate
3/3
1/3 (Eval 2 partial)
Execution Integrity Gate
3/3
0/3
Stability Gate
2/2 (Eval 2, 3)
0/2
Side-Effect Gate
2/2 (Eval 1, 2)
0/2
Practical value: The Gate system prevents three common errors: 1. False execution claims — Execution Integrity Gate ensures "Not run" is explicitly labeled 2. Single pass = fix — Stability Gate requires -count=20 verification 3. Missing config dependencies — Configuration Gate lists all variables and their available/missing/unknown status
Without-skill reports were not low quality (Eval 3’s CI strategy was thorough), but lacked standardized structure. This means: - Report format varies by task type - CI/tooling cannot consume results programmatically - Results from multiple runs are hard to compare
3 contributing factors + Local vs CI comparison table
4 factors (more detailed)
Fix suggestions
3 fixes + impact ranking
3 fixes + CI workflow patch
Reproduction command
go test ... -count=10
No -count command
Stability verification
"Validation requires: -count=20 with 20/20 pass rate on CI runner"
No stability requirement
Quarantine strategy
Template with owner, due date, status
No quarantine discussion
Analysis: Root-cause quality was comparable (both found go run compile + 3s timeout). Without-skill lacked a triage methodology framework. The skill’s Flaky Test Policy ("reproduce with repeat runs → classify → fix → quarantine only with owner, issue, and removal deadline") provides a complete process guarantee.
Analysis: Without-skill showed strong baseline ability in CI design—it designed a tiered strategy, found the swagger generation bug, and provided a detailed Rollout Plan. The skill’s increment is mainly in structured Gate validation and machine-readable output.
Token cost-effectiveness lowers the total—SKILL.md cost-effectiveness is good, but Playwright-specific reference content has no value for non-JS projects.