Skip to content

e2e-test Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation target: e2e-test


e2e-test is an end-to-end testing practice skill for critical user journeys. It supports designing E2E coverage strategy, handling flaky tests, defining CI gates, and turning exploratory verification into maintainable automated tests. Its three main strengths are: preferring Agent Browser for exploration and reproduction, then Playwright or the project’s native test framework for code, with a clear tool path; built-in environment gates, runner selection, and result-strength control for honest degradation across tech stacks instead of rigid templates; and structured output plus machine-readable JSON for test governance, triage, and CI integration.

1. Evaluation Overview

This evaluation reviews the e2e-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (E2E journey coverage, flaky test triage, CI gate design). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 39 assertions.

Special challenge: issue2md is a pure Go web app with no Node.js/Playwright/package.json, while e2e-test favors Playwright. This tests the skill’s environment adaptation and degradation strategy.

Dimension With Skill Without Skill Delta
Assertion pass rate 39/39 (100%) 20/39 (51.3%) +48.7 pp
5 Gate coverage 3/3 scenarios full 0/3 Skill-only
Output Contract structured report 3/3 0/3 Skill-only
Machine-Readable JSON 3/3 0/3 Skill-only
Quality Scorecard 1/1 (Eval 1) 0/1 Skill-only
Environment adaptation (Go ← Playwright degradation) Correct degradation + rationale Naturally chose Go (no skill guidance) Skill provides decision record
Skill Token cost (SKILL.md only) ~2,800 tokens 0
Skill Token cost (typical load) ~9,400 tokens 0
Token cost per 1% pass-rate gain ~57 tokens (SKILL.md only) / ~193 tokens (typical)

2. Test Methodology

2.1 Scenario Design

Scenario Goal Core focus Assertions
Eval 1: E2E journey coverage Create E2E tests for web convert flow Environment adaptation (pure Go vs Playwright skill), Gate coverage, test quality 15
Eval 2: Flaky test triage Triage intermittently failing SwaggerRedirect test in CI Triage flow, root-cause classification, stability verification, Gate coverage 12
Eval 3: CI gate design Design E2E CI strategy Trigger strategy, secret handling, artifact collection, retry strategy 12

2.2 Special Challenge: Playwright Skill vs. Go Project

issue2md’s characteristics make it a boundary test scenario:

issue2md characteristic e2e-test expectation
No Node.js / package.json Skill prefers Playwright (Node.js)
No client-side JavaScript Skill has many DOM selector/wait rules
Go html/template server-side rendering Skill assumes SPA/SSR (Next.js, React, Vue)
Existing Go HTTP client E2E tests Skill recommends Playwright code path

This tests the skill’s degradation ability—when the preferred toolchain does not apply, can it correctly identify and choose an alternative?

2.3 Execution

  • With-skill runs load SKILL.md and selectively load reference files
  • Without-skill runs load no skill; model default behavior
  • All runs execute in independent subagents in parallel

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: E2E journey coverage 15 15/15 (100%) 8/15 (53.3%) +46.7%
Eval 2: Flaky test triage 12 12/12 (100%) 4/12 (33.3%) +66.7%
Eval 3: CI gate design 12 12/12 (100%) 8/12 (66.7%) +33.3%
Total 39 39/39 (100%) 20/39 (51.3%) +48.7%

3.2 Per-Assertion Details

Eval 1: E2E Journey Coverage (15 assertions)

# Assertion With Without Notes
A1 Configuration gate structured table Without mentioned gating but no structured var table
A2 Environment gate evaluation Without had no explicit env evaluation
A3 Execution integrity gate Without did not state whether tests ran
A4 Correctly identifies no Playwright
A5 Does not blindly generate Playwright code
A6 Generates appropriate Go E2E tests
A7 No guessed secrets/URLs
A8 Tests cover convert flow
A9 Tests cover error path
A10 No unconditional sleep/waitForTimeout
A11 Data isolation explicitly stated Without did not document data isolation
A12 Output Contract structured report Without only brief report
A13 Machine-readable JSON Without no JSON summary
A14 Identifies existing E2E tests
A15 Next actions provided Without no next actions

Eval 2: Flaky Test Triage (12 assertions)

# Assertion With Without Notes
B1 Follows triage sequence (reproduce, classify, fix/quarantine)
B2 Root-cause classification labeled
B3 Provides reproduction command (with -count) Without no -count reproduction command
B4 Configuration gate Without no config gate analysis
B5 Environment gate Both compared local vs CI
B6 Execution integrity gate Without did not state whether tests ran
B7 No false execution claims
B8 Concrete fix suggestions
B9 Output contract Without no structured output
B10 Artifact strategy Without did not discuss artifacts
B11 Stability gate (single pass ≠ stable) Without no -count=20 stability verification
B12 Side-effect gate Without no side-effect analysis

Eval 3: CI Gate Design (12 assertions)

# Assertion With Without Notes
C1 Configuration gate Without no structured config analysis
C2 Environment gate Without no explicit env gate
C3 CI strategy doc (blocking vs nightly) Both provided tiered trigger strategy
C4 Artifact collection config
C5 GitHub Actions workflow YAML
C6 Retry/flaky strategy
C7 Output contract Without no structured output
C8 Machine-readable JSON Without no JSON summary
C9 Identifies existing CI targets Both found swagger generation gap
C10 Service startup strategy
C11 Parallel vs serial rationale
C12 Next actions Without had Rollout Plan

3.3 Classification of 19 Without-Skill Failures

Failure type Count Evals Notes
5 Mandatory Gates missing 9 All Configuration Gate 3×, Environment Gate 2×, Execution Integrity 2×, Stability Gate 1×, Side-Effect Gate 1×
Output Contract missing 3 All No structured table for task type/runner/env gate/execution status
Machine-Readable JSON missing 3 All No CI/tooling-consumable JSON summary
Data isolation not documented 1 Eval 1 No explicit data isolation statement
Reproduction command incomplete 1 Eval 2 No -count reproduction command
Next actions missing 1 Eval 1 No next-actions list
Artifact strategy missing 1 Eval 2 Triage report did not discuss trace/artifact

3.4 Trend: Skill Advantage by Task Type

Scenario type With-Skill advantage Reason
Eval 2: Flaky triage +66.7% (highest) Triage flow depends heavily on structured methodology; baseline lacks it
Eval 1: E2E journey +46.7% Gate coverage + Output Contract + env degradation decision record
Eval 3: CI design +33.3% (lowest) CI design is a model strength; skill mainly adds Gate and JSON

Flaky triage is where the skill adds the most value—the baseline can find root causes and suggest fixes but lacks triage methodology (reproduce → classify → fix/quarantine) and stability proof requirements (-count=20 verification).


4. Dimension-by-Dimension Comparison

4.1 Environment Adaptation (Core Differentiator)

This is the most distinctive dimension in this evaluation. The skill is designed for Playwright first, but when faced with a pure Go project:

Dimension With Skill Without Skill
Runner selection decision Explicit rationale (no Node.js, no package.json, Constitution constraint) Implicit choice of Go HTTP tests (no decision record)
Degradation path "Generate the strongest deliverable the environment can support" → Go HTTP Naturally chose Go (no degradation concept)
Playwright code Explicitly rejected ("Installing Playwright would violate the constitution") Not considered (no relevant context)

Analysis: With-skill’s Operating Model §5 ("Produce only the strongest deliverable the environment can actually support") correctly guided the degradation decision. The skill did not blindly generate Playwright code; after the Environment Gate confirmed the toolchain was missing, it chose the Go HTTP path. The degradation rationale was explicitly recorded, which matters for PR review and team alignment.

4.2 Five Mandatory Gates Coverage

This is the highest-value dimension of the skill—with-skill covered all 5 Gates in all 3 scenarios; without-skill missed multiple Gates in all 3.

Gate With Skill (3 scenarios) Without Skill (3 scenarios)
Configuration Gate 3/3 0/3
Environment Gate 3/3 1/3 (Eval 2 partial)
Execution Integrity Gate 3/3 0/3
Stability Gate 2/2 (Eval 2, 3) 0/2
Side-Effect Gate 2/2 (Eval 1, 2) 0/2

Practical value: The Gate system prevents three common errors: 1. False execution claims — Execution Integrity Gate ensures "Not run" is explicitly labeled 2. Single pass = fix — Stability Gate requires -count=20 verification 3. Missing config dependencies — Configuration Gate lists all variables and their available/missing/unknown status

4.3 Output Contract and Machine-Readable JSON

With-skill outputs included:

Structure Eval 1 Eval 2 Eval 3
Output Contract table ✅ 9 fields ✅ 9 fields ✅ 9 fields
Machine-Readable JSON
Quality Scorecard ✅ (C1–C4, S1–S6, H1–H4) N/A N/A

Without-skill reports were not low quality (Eval 3’s CI strategy was thorough), but lacked standardized structure. This means: - Report format varies by task type - CI/tooling cannot consume results programmatically - Results from multiple runs are hard to compare

4.4 Flaky Triage Methodology (Eval 2 Deep Dive)

This is where with-skill advantage was largest (+66.7%).

Dimension With Skill Without Skill
Triage template Standardized Flaky Triage Template (test name, env, frequency, category checkboxes) Free-form analysis
Root-cause depth 3 contributing factors + Local vs CI comparison table 4 factors (more detailed)
Fix suggestions 3 fixes + impact ranking 3 fixes + CI workflow patch
Reproduction command go test ... -count=10 No -count command
Stability verification "Validation requires: -count=20 with 20/20 pass rate on CI runner" No stability requirement
Quarantine strategy Template with owner, due date, status No quarantine discussion

Analysis: Root-cause quality was comparable (both found go run compile + 3s timeout). Without-skill lacked a triage methodology framework. The skill’s Flaky Test Policy ("reproduce with repeat runs → classify → fix → quarantine only with owner, issue, and removal deadline") provides a complete process guarantee.

4.5 CI Strategy Design (Eval 3 Deep Dive)

This is where without-skill was closest (+33.3%).

Dimension With Skill Without Skill
Tiered trigger strategy ✅ Detailed ASCII diagram + per-tier budget ✅ Table + detailed rationale
Token handling ✅ Security Checklist (5 items) ✅ Two-tier matrix
Swagger generation gap ✅ Found and fixed ✅ Found and fixed
Quarantine rules ✅ 4 rules ✅ Brief mention
Rollout plan None ✅ 7-phase rollout plan
Mandatory Gates table
JSON summary

Analysis: Without-skill showed strong baseline ability in CI design—it designed a tiered strategy, found the swagger generation bug, and provided a detailed Rollout Plan. The skill’s increment is mainly in structured Gate validation and machine-readable output.


5. Token Cost-Effectiveness

5.1 Skill Size

File Lines Words Bytes Est. tokens
SKILL.md 439 1,946 13,912 ~2,800
references/checklists.md 152 824 5,528 ~1,200
references/playwright-patterns.md 220 691 6,428 ~1,000
references/playwright-deep-patterns.md 825 2,898 24,581 ~4,200
references/environment-and-dependency-gates.md 181 943 6,275 ~1,350
references/agent-browser-workflows.md 191 893 6,812 ~1,300
references/golden-examples.md 247 1,018 8,997 ~1,500
scripts/discover_e2e_needs.sh 215 755 6,413 ~1,100
Description (always in context) ~50 ~60
Total 2,470 10,018 78,946 ~14,510

5.2 Actual Load Scenarios

Scenario Files read Total tokens
Eval 1: E2E journey SKILL.md + checklists + playwright-patterns + env-gates + golden-examples ~7,850
Eval 2: Flaky triage SKILL.md + checklists + env-gates + golden-examples ~6,850
Eval 3: CI design SKILL.md + checklists + playwright-deep + env-gates + golden-examples ~11,050
Typical average ~8,580
Full load (all refs) SKILL.md + all 6 references ~13,350
Minimal load SKILL.md only ~2,800

5.3 Token Cost vs. Quality Gain

Metric Value
With-skill pass rate 100% (39/39)
Without-skill pass rate 51.3% (20/39)
Pass-rate gain +48.7 pp
Token cost per assertion fixed ~147 tokens (SKILL.md only) / ~451 tokens (typical)
Token cost per 1% pass-rate gain ~57 tokens (SKILL.md only) / ~176 tokens (typical)

5.4 Comparison with Other Skills

Metric e2e-test thirdparty-api-integration-test api-integration-test go-makefile-writer git-commit
SKILL.md tokens ~2,800 ~680 ~1,800 ~1,960 ~1,120
Typical load tokens ~8,580 ~2,050 ~2,850 ~4,600 ~1,120
Pass-rate gain +48.7% +33.3% +36.8% +31.0% +22.7%
Tokens per 1% (SKILL.md) ~57 tok ~20 tok ~49 tok ~63 tok ~51 tok
Tokens per 1% (typical) ~176 tok ~62 tok ~77 tok ~149 tok ~51 tok

Analysis:

  • Highest absolute gain (+48.7%) — e2e-test assertion delta (19) is the largest in the series
  • SKILL.md cost-effectiveness good (~57 tok/1%) — similar to git-commit (~51 tok) and api-integration-test (~49 tok)
  • Typical load cost-effectiveness high (~176 tok/1%) — reference volume is large (6 files ~11,710 tokens), much of it Playwright-specific

5.5 Token Segment Cost-Effectiveness

Module Est. tokens Related assertion deltas Cost-effectiveness
Mandatory Gates (5 × ~80 tok each) ~400 9 (A1–A3, B4, B6, B11, B12, C1, C2) Very high — 44 tok/assertion
Output Contract definition ~200 3 (A12, B9, C7) Very high — 67 tok/assertion
Machine-Readable JSON template ~150 3 (A13, B8_partial, C8) Very high — 50 tok/assertion
Flaky Test Policy ~120 2 (B3, B11) Very high — 60 tok/assertion
Quality Scorecard ~400 Indirect (Eval 1 scorecard output) Medium
Anti-Examples (7 examples) ~500 Indirect (A10 no-sleep) Low — most anti-examples not applicable to Go
Version/Platform Gate ~250 0 Low — not applicable to Go
Command Starters ~100 0 Low — Agent Browser commands not applicable
references/playwright-deep-patterns.md ~4,200 0 direct Low — pure Go project
references/playwright-patterns.md ~1,000 0 direct Low — pure Go project
references/golden-examples.md ~1,500 Indirect (report structure) Medium
references/checklists.md ~1,200 Indirect (triage template) High
references/environment-and-dependency-gates.md ~1,350 Indirect (env evaluation framework) High

5.6 Token Efficiency Rating

Rating Conclusion
Overall ROI Good — ~8,580 tokens (typical) for +48.7% pass rate; highest absolute gain in series
SKILL.md ROI Good — ~2,800 tokens cost-effectiveness (~57 tok/1%) on par with series
High-leverage token share ~31% (870/2,800) directly contributes to 17/19 assertion deltas
Low-leverage token share ~30% (850/2,800) contributes nothing in Go project evaluation (Playwright-specific)
Reference cost-effectiveness Mixed — checklists + env-gates high value; playwright-patterns + deep-patterns no value for Go

6. Boundary with Base Model Capabilities

6.1 Capabilities Base Model Already Has (No Skill Increment)

Capability Evidence
Choosing appropriate E2E tool (Go HTTP vs Playwright) 3/3 scenarios chose Go HTTP
Root-cause depth (flaky test) Eval 2: Found go run compile + 3s timeout dual factors
CI tiered trigger strategy design Eval 3: PR/main/nightly tiers
Swagger generation gap discovery Eval 3: Both found
Artifact upload YAML generation Eval 3: Full actions/upload-artifact config
Secret handling (t.Skip when absent) 3/3 scenarios correct
Serial vs parallel rationale Eval 3: Detailed analysis

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
5 Mandatory Gates entirely missing 3/3 scenarios no gate analysis High — risk of false execution claims, missing config deps
Output Contract missing 3/3 scenarios no standardized report structure Medium — reports not reproducible or comparable
Machine-Readable JSON missing 3/3 scenarios no JSON Medium — CI/tooling cannot consume programmatically
Stability Gate missing Eval 2 no -count=20 verification requirement High — single pass claimed as fix
Data isolation not documented Eval 1 no explicit statement Low — code was isolated
Flaky triage methodology missing Eval 2 no standard triage sequence Medium — analysis quality depends on experience
Degradation decision not recorded Eval 1 no runner choice rationale Low — choice correct but no traceability

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Gate coverage (5 gates) 5.0/5 1.0/5 +4.0
Environment adaptation and degradation 5.0/5 3.5/5 +1.5
Structured report & JSON 5.0/5 1.0/5 +4.0
Test quality 5.0/5 4.0/5 +1.0
Flaky triage methodology 5.0/5 2.5/5 +2.5
CI design 5.0/5 4.0/5 +1.0
Mean 5.00/5 2.67/5 +2.33

7.2 Weighted Total

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Gate coverage system 20% 10/10 2.00
Structured report & JSON output 15% 10/10 1.50
Flaky triage methodology 10% 10/10 1.00
Environment adaptation 10% 10/10 1.00
Token cost-effectiveness 15% 6.0/10 0.90
CI design increment 5% 7.0/10 0.35
Weighted total 9.25/10

Token cost-effectiveness lowers the total—SKILL.md cost-effectiveness is good, but Playwright-specific reference content has no value for non-JS projects.


8. Evaluation Materials

Material Path
Eval 1 with-skill output /tmp/e2e-eval/eval-1/with_skill/
Eval 1 without-skill output /tmp/e2e-eval/eval-1/without_skill/
Eval 2 with-skill output /tmp/e2e-eval/eval-2/with_skill/
Eval 2 without-skill output /tmp/e2e-eval/eval-2/without_skill/
Eval 3 with-skill output /tmp/e2e-eval/eval-3/with_skill/
Eval 3 without-skill output /tmp/e2e-eval/eval-3/without_skill/