Skip to content

tdd-workflow Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation subject: tdd-workflow


tdd-workflow is an end-to-end TDD skill for Go code changes, designed to put "write failing tests first, then minimal implementation, then safe refactor" into practice. It is especially suited for new features, bug fixes, and security-sensitive logic testing. Its three standout strengths are: requiring a one-to-one mapping between Defect Hypotheses and test cases, so you define "what bug to catch" before writing tests; enforcing a Red → Green → Refactor evidence chain so the TDD process is verifiable, not just claimed; and using Killer Cases plus coverage and risk-path gates to elevate tests from "runnable" to "capable of catching critical defects".

1. Evaluation Overview

This evaluation reviews the tdd-workflow skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (S-size yamlQuote boundary tests, M-size normalizeSummaryJSON three-function tests, M-size IsPrivateIPLiteral security tests). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 39 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 39/39 (100%) 21/39 (53.8%) +46.2 pp
Defect Hypothesis → Test Mapping 3/3 scenarios with full mapping 0/3 Skill-only
Red → Green evidence 3/3 0/3 Skill-only
Killer Case mechanism 3/3 (6 killer cases total) 0/3 Skill-only
Output Contract structured report 3/3 0/3 Skill-only
Coverage report 3/3 0/3 Skill-only
Change Size classification 3/3 0/3 Skill-only
Skill Token cost (SKILL.md only) ~2,400 tokens 0
Skill Token cost (typical load) ~3,650 tokens 0
Token cost per 1% pass-rate gain ~52 tokens (SKILL.md only) / ~79 tokens (typical)

2. Test Methodology

2.1 Scenario Design

Scenario Target function Package Core focus Assertions
Eval 1: yamlQuote yamlQuote (4 LOC) converter S-size TDD cycle, Red evidence, boundary conditions, string escaping 12
Eval 2: normalizeSummaryJSON normalizeSummaryJSON + extractSummaryText + buildResponsesEndpoint converter M-size three-function tests, JSON parse boundaries, code fence handling 14
Eval 3: IsPrivateIPLiteral IsPrivateIPLiteral (12 LOC) urlutil Security-sensitive SSRF protection, IPv4/IPv6 dual-stack, RFC 1918 range boundaries 13

2.2 Rationale for Target Selection

  • All use stdlib assertions (project constitution forbids testify) — tests skill adaptation to project assertion style
  • Functions exist but lack direct unit tests — tests skill ability to handle "characterization testing" (adding tests to existing code)
  • Varying complexity — from 4 LOC pure functions to 12 LOC multi-branch security functions

2.3 Execution

  • With-skill runs first read SKILL.md and optionally load reference materials
  • Without-skill runs read no skill, using model default behavior
  • All runs execute in parallel in independent subagents

3. Assertion Pass Rate

3.1 Overview

Scenario Assertions With Skill Without Skill Delta
Eval 1: yamlQuote (S) 12 12/12 (100%) 6/12 (50.0%) +50.0%
Eval 2: normalizeSummaryJSON (M) 14 14/14 (100%) 8/14 (57.1%) +42.9%
Eval 3: IsPrivateIPLiteral (M) 13 13/13 (100%) 7/13 (53.8%) +46.2%
Total 39 39/39 (100%) 21/39 (53.8%) +46.2%

3.2 Per-Assertion Details

Eval 1: yamlQuote S-size (12 assertions)

# Assertion With Without Notes
A1 Change size classified as S Without has no size concept
A2 Defect hypothesis list (≥3) With has 7 items DH1–DH7
A3 Red evidence (failing tests before implementation) With shows 3/7 fail via mutation testing
A4 Green evidence (tests pass)
A5 Table-driven tests
A6 Boundary cases covered Without has more cases (15 vs 7)
A7 Killer case explicitly marked With marks single_quote as KILLER
A8 Stdlib assertions
A9 Test file co-located
A10 Output contract Without only has brief summary
A11 Coverage report With: yamlQuote 100%, package 83.5%
A12 No speculative production code

Eval 2: normalizeSummaryJSON M-size (14 assertions)

# Assertion With Without Notes
B1 Change size classification
B2 Defect hypothesis list (≥5) With has 15 items DH1–DH15
B3 Red evidence With records via characterization approach
B4 Green evidence
B5 Table-driven tests
B6 Happy path (valid JSON, code fence)
B7 Error paths (empty, non-JSON, malformed)
B8 Boundary (code fence with/without lang tag)
B9 Killer case explicitly marked With has 3 killer cases
B10 Stdlib assertions
B11 Output contract
B12 Coverage report With: 85.4% package, 100% target functions
B13 Reasonable test count
B14 No mock abuse

Eval 3: IsPrivateIPLiteral M-size (13 assertions)

# Assertion With Without Notes
C1 Change size classification
C2 Defect hypothesis list (≥4) With has 5 items H1–H5
C3 Red evidence
C4 Green evidence
C5 Table-driven tests
C6 IPv4 private ranges covered
C7 Public address returns false
C8 IPv6 loopback (::1) handling
C9 Non-IP hostname returns false
C10 Killer case With: IPv4-mapped IPv6 SSRF bypass test
C11 Stdlib assertions
C12 Output contract
C13 Coverage report With: 100% on function, 89.7% package

3.3 Classification of 18 Failed Assertions (Without-Skill)

Failure type Count Evals Notes
Change Size classification missing 3 All No S/M/L classification or test budget control
Defect Hypothesis missing 3 All No hypothesis–test mapping; tests lack theoretical basis
Red Evidence missing 3 All No evidence of tests failing before implementation
Killer Case missing 3 All No targeted tests for high-risk hypotheses
Output Contract missing 3 All Simple summary instead of structured deliverable
Coverage report missing 3 All No line coverage or risk-path coverage reported

Key observation: All 18 failures are TDD methodology artifacts, not test code quality issues. Without-skill test code quality is not low (Eval 1 even produced 15 test cases vs With-skill’s 7), but it lacks TDD process evidence and structured reports.

3.4 Delta Stability

Deltas across the three scenarios are highly consistent (+42.9% to +50.0%), indicating the skill’s contribution is not task-dependent but systematically injects six categories of TDD methodology artifacts.


4. Dimension-by-Dimension Comparison

4.1 Defect Hypothesis → Test Mapping (Core Differentiator)

This is the TDD skill’s most distinctive contribution — requiring hypotheses before tests.

Scenario With Skill Without Skill
Eval 1: yamlQuote 7 hypotheses: DH1(empty)→DH7(unicode), each mapped to test name 15 test cases, no hypothesis provenance
Eval 2: normalizeSummaryJSON 15 hypotheses across 3 functions, grouped by function 31 test cases, no hypotheses
Eval 3: IsPrivateIPLiteral 5 hypotheses: H1(mapped IPv6)→H5(unspecified), including SSRF attack hypotheses 36 test cases, boundary tests but no attack hypotheses

Practical value: Defect hypotheses are not just report decoration — they drive more targeted test design:

  • Eval 3 H1 (IPv4-mapped IPv6 bypass) is a test angle completely absent from Without-skill. ::ffff:127.0.0.1 and ::ffff:10.0.0.1 are real SSRF attack vectors; none of Without-skill’s 36 tests touch them.
  • Eval 2 DH5 (nested braces extraction boundary) is a key test for the Index/LastIndex algorithm; Without-skill has the test but no hypothesis rationale.

4.2 Red → Green Evidence

Scenario With Skill Without Skill
Eval 1 Mutation testing: remove ReplaceAll, 3/7 fail (precise red evidence) Direct "All 15 pass" (no red phase)
Eval 2 Characterization testing: run tests on existing code first to confirm behavior Direct "31 subtests, 0 failures"
Eval 3 Hypothesis-driven: killer cases presented as attack hypotheses Direct "ALL PASS"

Key difference: For characterization testing (adding tests to existing code), the skill still requires Red evidence — Eval 1 via mutation, Eval 2–3 via hypothesis. Without-skill only shows "all pass", so it cannot prove what the tests actually verify.

4.3 Killer Case Mechanism

With-skill produced 6 killer cases in total, each with a 4-part structure:

  1. Defect hypothesis — the specific defect hypothesis to verify or falsify
  2. Fault injection — how to trigger that defect (mutation or attack input)
  3. Critical assertion — the key assertion that must succeed
  4. Removal risk — risk if this test is removed
Eval Killer Case Value
1 single_quote — removing ReplaceAll produces invalid YAML Regression protection
2 nested_braces — Index/LastIndex extraction boundary for nested JSON Real AI output scenario
2 first_output_text_wins — preferred semantics for multiple output_text Non-determinism protection
2 /v1_with_trailing_slash — URL path /v1/ deduplication User config variation
3 ::ffff:127.0.0.1 — IPv4-mapped IPv6 loopback SSRF bypass Security-critical
3 ::ffff:10.0.0.1 — IPv4-mapped IPv6 private SSRF bypass Security-critical

Without-skill tests cover boundaries but lack an SSRF attack perspective (Eval 3) and lack mutation-driven regression protection (Eval 1).

4.4 Test Code Quality (Both Sides Comparable)

Notably, Without-skill test code quality is not low:

Dimension With Skill Without Skill
Test count Eval 1: 7, Eval 2: 22, Eval 3: 42 Eval 1: 15, Eval 2: 31, Eval 3: 36
Table-driven
Stdlib assertions
t.Run subtests
t.Parallel Partial ✅ All
Boundary cases
YAML metacharacters None (Eval 1) key: value, text # comment (Eval 1)

Without-skill produced more test cases in Eval 1 (15 vs 7) and even covered YAML metacharacters, which With-skill did not. But it lacks a methodology framework — no hypotheses, no red evidence, no coverage report, no killer cases.

Conclusion: The skill’s core value is not generating more or better test code, but injecting TDD methodology discipline and structured deliverables.

4.5 Residual Risks Analysis (Eval 3 Highlight)

With-skill’s Eval 3 report listed 4 residual risks:

  1. CGNAT (100.64.0.0/10) — currently returns false; extend if threat model includes shared address space
  2. IPv6 zone IDs — upstream handling of fe80::1%eth0 is uncertain
  3. DNS rebinding — design limitation for hostname resolution bypass
  4. Octal/hex IP notation — TOCTOU risk for 0177.0.0.1

This risk analysis is entirely absent from Without-skill and is especially important for security-sensitive code.


5. Token Cost-Effectiveness Analysis

5.1 Skill Size

File Lines Words Bytes Est. Tokens
SKILL.md 296 1,686 11,350 ~2,400
references/tdd-workflow.md 172 732 5,375 ~1,050
references/api-3layer-template.md 162 573 4,508 ~800
references/fake-stub-template.md 66 207 1,532 ~300
references/boundary-checklist.md 56 450 3,124 ~650
Description (always in context) ~30 ~40
Total 752 3,678 25,889 ~5,240

5.2 Actual Load Scenarios

Scenario Files read Total tokens
Eval 1: yamlQuote (S) SKILL.md + boundary-checklist + fake-stub ~3,350
Eval 2: normalizeSummaryJSON (M) SKILL.md + boundary-checklist + fake-stub + tdd-workflow ~4,400
Eval 3: IsPrivateIPLiteral (M) SKILL.md + boundary-checklist ~3,050
Typical average ~3,600
Full load (all references) SKILL.md + all 4 references ~5,200
Minimal load SKILL.md only ~2,400

5.3 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (39/39)
Without-skill pass rate 53.8% (21/39)
Pass-rate gain +46.2 pp
Token cost per assertion fixed ~133 tokens (SKILL.md only) / ~200 tokens (typical)
Token cost per 1% pass-rate gain ~52 tokens (SKILL.md only) / ~78 tokens (typical)

5.4 Comparison with Other Skills’ Cost-Effectiveness

Metric tdd-workflow e2e-test thirdparty-api-integration-test go-makefile-writer git-commit
SKILL.md tokens ~2,400 ~1,800 ~680 ~1,960 ~1,120
Typical load tokens ~3,600 ~4,600 ~2,050 ~4,600 ~1,120
Pass-rate gain +46.2% +48.7% +33.3% +31.0% +22.7%
Tokens per 1% (SKILL.md) ~52 tok ~37 tok ~20 tok ~63 tok ~51 tok
Tokens per 1% (typical) ~78 tok ~94 tok ~62 tok ~149 tok ~51 tok

Analysis:

  • Second-highest absolute gain (+46.2%) — behind only e2e-test’s +48.7%
  • Strong typical-load cost-effectiveness (~78 tok/1%) — third in the series, behind only git-commit (~51) and thirdparty-api-integration-test (~62)
  • Good SKILL.md cost-effectiveness (~52 tok/1%) — comparable to git-commit (~51)
  • Lean, effective references — 4 reference files total ~2,800 tokens, each with a clear use case

5.5 Token Segment Cost-Effectiveness

SKILL.md module Est. tokens Related assertion delta Cost-effectiveness
6 Mandatory Gates (defect hypothesis, killer, coverage, execution integrity, concurrency, change-size) ~600 15 assertions (A1–A3, A7, A10–A11, B1–B3, B9, B11–B12, C1–C3, C10, C12–C13) Very high — 40 tok/assertion
Quality Scorecard ~350 Indirect (report structure) High
Output Contract definition ~100 3 assertions (A10, B11, C12) Very high — 33 tok/assertion
Workflow 8-step ~150 Indirect (process guidance) High
Command Playbook ~100 Indirect (standardized commands) Medium
Anti-Examples (7) ~700 Indirect (avoid common mistakes) Medium — no direct assertion match
Hard Rules ~200 Indirect (assertion style adaptation) Medium
references/boundary-checklist.md ~650 Indirect (DH design guidance) High — loaded every scenario
references/fake-stub-template.md ~300 0 direct Low — no fake/stub in this eval
references/tdd-workflow.md ~1,050 0 direct Low — only Eval 2 loaded
references/api-3layer-template.md ~800 0 direct Low — not loaded in this eval

5.6 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~3,600 tokens (typical) for +46.2% pass rate; third-best cost-effectiveness in series
SKILL.md ROI Excellent — ~2,400 tokens cost-effectiveness (~52 tok/1%) tied with git-commit
High-leverage token share ~44% (~1,050/2,400) directly contributes to 18/18 assertion deltas
Low-leverage token share ~29% (~700/2,400) Anti-Examples with no direct assertion match
Reference cost-effectiveness boundary-checklist high value (loaded every scenario); other 3 loaded on demand

6. Boundary Analysis vs Claude Base Model Capabilities

6.1 Capabilities Base Model Already Has (No Skill Increment)

Capability Evidence
Table-driven tests with t.Run 3/3 scenarios
Stdlib assertions (t.Fatalf with got/want) 3/3 scenarios
Boundary condition testing Eval 1: metacharacters; Eval 3: RFC 1918 boundaries
Error-path coverage Eval 2: empty, no braces, invalid JSON
t.Parallel usage 3/3 scenarios (Without-skill uses Parallel more aggressively)
Co-located test files 3/3 scenarios
Reasonable test count Without-skill even produced more cases

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Impact
TDD Red→Green flow entirely missing 3/3 scenarios lack red evidence High — cannot prove tests actually verify behavior
Defect Hypothesis missing 3/3 scenarios lack hypothesis list High — tests lack theoretical basis and attack perspective
Killer Case missing 3/3 scenarios lack killer cases High — no targeted tests for high-risk hypotheses (e.g. SSRF bypass)
Coverage report missing 3/3 scenarios lack coverage Medium — cannot quantify test adequacy
Change Size classification missing 3/3 scenarios lack S/M/L Medium — no test budget control (may over- or under-test)
Output Contract missing 3/3 scenarios lack structured report Medium — reports not reproducible or comparable
Residual Risks missing 3/3 scenarios lack follow-up risk analysis Low — but critical for security code

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
TDD methodology (Red/Green/Refactor) 5.0/5 1.0/5 +4.0
Defect Hypothesis + Killer Case 5.0/5 0.5/5 +4.5
Structured report & Coverage 5.0/5 1.0/5 +4.0
Test code quality 4.5/5 4.0/5 +0.5
Security analysis (Eval 3 residual risks) 5.0/5 2.0/5 +3.0
Overall mean 4.90/5 1.70/5 +3.20

7.2 Weighted Total Score

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
TDD methodology injection 20% 10/10 2.00
Defect Hypothesis + Killer Case 15% 10/10 1.50
Structured report & Coverage 10% 10/10 1.00
Token cost-effectiveness 15% 8.5/10 1.28
Test code quality increment 10% 5.0/10 0.50
Security analysis / Residual Risks 5% 10/10 0.50
Weighted total 9.28/10

Test code quality increment is scored lower because Without-skill test code quality is not low — the skill’s core value lies in methodology discipline, not code generation.


8. Evaluation Materials

Material Path
Eval 1 with-skill output /tmp/tdd-eval/eval-1/with_skill/
Eval 1 without-skill output /tmp/tdd-eval/eval-1/without_skill/
Eval 2 with-skill output /tmp/tdd-eval/eval-2/with_skill/
Eval 2 without-skill output /tmp/tdd-eval/eval-2/without_skill/
Eval 3 with-skill output /tmp/tdd-eval/eval-3/with_skill/
Eval 3 without-skill output /tmp/tdd-eval/eval-3/without_skill/