tdd-workflow Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation subject: tdd-workflow

tdd-workflow is an end-to-end TDD skill for Go code changes, designed to put "write failing tests first, then minimal implementation, then safe refactor" into practice. It is especially suited for new features, bug fixes, and security-sensitive logic testing. Its three standout strengths are: requiring a one-to-one mapping between Defect Hypotheses and test cases, so you define "what bug to catch" before writing tests; enforcing a Red → Green → Refactor evidence chain so the TDD process is verifiable, not just claimed; and using Killer Cases plus coverage and risk-path gates to elevate tests from "runnable" to "capable of catching critical defects".

1. Evaluation Overview¶

This evaluation reviews the tdd-workflow skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (S-size yamlQuote boundary tests, M-size normalizeSummaryJSON three-function tests, M-size IsPrivateIPLiteral security tests). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 39 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	39/39 (100%)	21/39 (53.8%)	+46.2 pp
Defect Hypothesis → Test Mapping	3/3 scenarios with full mapping	0/3	Skill-only
Red → Green evidence	3/3	0/3	Skill-only
Killer Case mechanism	3/3 (6 killer cases total)	0/3	Skill-only
Output Contract structured report	3/3	0/3	Skill-only
Coverage report	3/3	0/3	Skill-only
Change Size classification	3/3	0/3	Skill-only
Skill Token cost (SKILL.md only)	~2,400 tokens	0	—
Skill Token cost (typical load)	~3,650 tokens	0	—
Token cost per 1% pass-rate gain	~52 tokens (SKILL.md only) / ~79 tokens (typical)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	Target function	Package	Core focus	Assertions
Eval 1: yamlQuote	`yamlQuote` (4 LOC)	converter	S-size TDD cycle, Red evidence, boundary conditions, string escaping	12
Eval 2: normalizeSummaryJSON	`normalizeSummaryJSON` + `extractSummaryText` + `buildResponsesEndpoint`	converter	M-size three-function tests, JSON parse boundaries, code fence handling	14
Eval 3: IsPrivateIPLiteral	`IsPrivateIPLiteral` (12 LOC)	urlutil	Security-sensitive SSRF protection, IPv4/IPv6 dual-stack, RFC 1918 range boundaries	13

2.2 Rationale for Target Selection¶

All use stdlib assertions (project constitution forbids testify) — tests skill adaptation to project assertion style
Functions exist but lack direct unit tests — tests skill ability to handle "characterization testing" (adding tests to existing code)
Varying complexity — from 4 LOC pure functions to 12 LOC multi-branch security functions

2.3 Execution¶

With-skill runs first read SKILL.md and optionally load reference materials
Without-skill runs read no skill, using model default behavior
All runs execute in parallel in independent subagents

3. Assertion Pass Rate¶

3.1 Overview¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: yamlQuote (S)	12	12/12 (100%)	6/12 (50.0%)	+50.0%
Eval 2: normalizeSummaryJSON (M)	14	14/14 (100%)	8/14 (57.1%)	+42.9%
Eval 3: IsPrivateIPLiteral (M)	13	13/13 (100%)	7/13 (53.8%)	+46.2%
Total	39	39/39 (100%)	21/39 (53.8%)	+46.2%

3.2 Per-Assertion Details¶

Eval 1: yamlQuote S-size (12 assertions)¶

#	Assertion	With	Without	Notes
A1	Change size classified as S	✅	❌	Without has no size concept
A2	Defect hypothesis list (≥3)	✅	❌	With has 7 items DH1–DH7
A3	Red evidence (failing tests before implementation)	✅	❌	With shows 3/7 fail via mutation testing
A4	Green evidence (tests pass)	✅	✅
A5	Table-driven tests	✅	✅
A6	Boundary cases covered	✅	✅	Without has more cases (15 vs 7)
A7	Killer case explicitly marked	✅	❌	With marks `single_quote` as KILLER
A8	Stdlib assertions	✅	✅
A9	Test file co-located	✅	✅
A10	Output contract	✅	❌	Without only has brief summary
A11	Coverage report	✅	❌	With: yamlQuote 100%, package 83.5%
A12	No speculative production code	✅	✅

Eval 2: normalizeSummaryJSON M-size (14 assertions)¶

#	Assertion	With	Without	Notes
B1	Change size classification	✅	❌
B2	Defect hypothesis list (≥5)	✅	❌	With has 15 items DH1–DH15
B3	Red evidence	✅	❌	With records via characterization approach
B4	Green evidence	✅	✅
B5	Table-driven tests	✅	✅
B6	Happy path (valid JSON, code fence)	✅	✅
B7	Error paths (empty, non-JSON, malformed)	✅	✅
B8	Boundary (code fence with/without lang tag)	✅	✅
B9	Killer case explicitly marked	✅	❌	With has 3 killer cases
B10	Stdlib assertions	✅	✅
B11	Output contract	✅	❌
B12	Coverage report	✅	❌	With: 85.4% package, 100% target functions
B13	Reasonable test count	✅	✅
B14	No mock abuse	✅	✅

Eval 3: IsPrivateIPLiteral M-size (13 assertions)¶

#	Assertion	With	Without	Notes
C1	Change size classification	✅	❌
C2	Defect hypothesis list (≥4)	✅	❌	With has 5 items H1–H5
C3	Red evidence	✅	❌
C4	Green evidence	✅	✅
C5	Table-driven tests	✅	✅
C6	IPv4 private ranges covered	✅	✅
C7	Public address returns false	✅	✅
C8	IPv6 loopback (::1) handling	✅	✅
C9	Non-IP hostname returns false	✅	✅
C10	Killer case	✅	❌	With: IPv4-mapped IPv6 SSRF bypass test
C11	Stdlib assertions	✅	✅
C12	Output contract	✅	❌
C13	Coverage report	✅	❌	With: 100% on function, 89.7% package

3.3 Classification of 18 Failed Assertions (Without-Skill)¶

Failure type	Count	Evals	Notes
Change Size classification missing	3	All	No S/M/L classification or test budget control
Defect Hypothesis missing	3	All	No hypothesis–test mapping; tests lack theoretical basis
Red Evidence missing	3	All	No evidence of tests failing before implementation
Killer Case missing	3	All	No targeted tests for high-risk hypotheses
Output Contract missing	3	All	Simple summary instead of structured deliverable
Coverage report missing	3	All	No line coverage or risk-path coverage reported

Key observation: All 18 failures are TDD methodology artifacts, not test code quality issues. Without-skill test code quality is not low (Eval 1 even produced 15 test cases vs With-skill’s 7), but it lacks TDD process evidence and structured reports.

3.4 Delta Stability¶

Deltas across the three scenarios are highly consistent (+42.9% to +50.0%), indicating the skill’s contribution is not task-dependent but systematically injects six categories of TDD methodology artifacts.

4. Dimension-by-Dimension Comparison¶

4.1 Defect Hypothesis → Test Mapping (Core Differentiator)¶

This is the TDD skill’s most distinctive contribution — requiring hypotheses before tests.

Scenario	With Skill	Without Skill
Eval 1: yamlQuote	7 hypotheses: DH1(empty)→DH7(unicode), each mapped to test name	15 test cases, no hypothesis provenance
Eval 2: normalizeSummaryJSON	15 hypotheses across 3 functions, grouped by function	31 test cases, no hypotheses
Eval 3: IsPrivateIPLiteral	5 hypotheses: H1(mapped IPv6)→H5(unspecified), including SSRF attack hypotheses	36 test cases, boundary tests but no attack hypotheses

Practical value: Defect hypotheses are not just report decoration — they drive more targeted test design:

Eval 3 H1 (IPv4-mapped IPv6 bypass) is a test angle completely absent from Without-skill. ::ffff:127.0.0.1 and ::ffff:10.0.0.1 are real SSRF attack vectors; none of Without-skill’s 36 tests touch them.
Eval 2 DH5 (nested braces extraction boundary) is a key test for the Index/LastIndex algorithm; Without-skill has the test but no hypothesis rationale.

4.2 Red → Green Evidence¶

Scenario	With Skill	Without Skill
Eval 1	Mutation testing: remove `ReplaceAll`, 3/7 fail (precise red evidence)	Direct "All 15 pass" (no red phase)
Eval 2	Characterization testing: run tests on existing code first to confirm behavior	Direct "31 subtests, 0 failures"
Eval 3	Hypothesis-driven: killer cases presented as attack hypotheses	Direct "ALL PASS"

Key difference: For characterization testing (adding tests to existing code), the skill still requires Red evidence — Eval 1 via mutation, Eval 2–3 via hypothesis. Without-skill only shows "all pass", so it cannot prove what the tests actually verify.

4.3 Killer Case Mechanism¶

With-skill produced 6 killer cases in total, each with a 4-part structure:

Defect hypothesis — the specific defect hypothesis to verify or falsify
Fault injection — how to trigger that defect (mutation or attack input)
Critical assertion — the key assertion that must succeed
Removal risk — risk if this test is removed

Eval	Killer Case	Value
1	`single_quote` — removing ReplaceAll produces invalid YAML	Regression protection
2	`nested_braces` — Index/LastIndex extraction boundary for nested JSON	Real AI output scenario
2	`first_output_text_wins` — preferred semantics for multiple output_text	Non-determinism protection
2	`/v1_with_trailing_slash` — URL path `/v1/` deduplication	User config variation
3	`::ffff:127.0.0.1` — IPv4-mapped IPv6 loopback SSRF bypass	Security-critical
3	`::ffff:10.0.0.1` — IPv4-mapped IPv6 private SSRF bypass	Security-critical

Without-skill tests cover boundaries but lack an SSRF attack perspective (Eval 3) and lack mutation-driven regression protection (Eval 1).

4.4 Test Code Quality (Both Sides Comparable)¶

Notably, Without-skill test code quality is not low:

Dimension	With Skill	Without Skill
Test count	Eval 1: 7, Eval 2: 22, Eval 3: 42	Eval 1: 15, Eval 2: 31, Eval 3: 36
Table-driven	✅	✅
Stdlib assertions	✅	✅
t.Run subtests	✅	✅
t.Parallel	Partial	✅ All
Boundary cases	✅	✅
YAML metacharacters	None (Eval 1)	✅ `key: value`, `text # comment` (Eval 1)

Without-skill produced more test cases in Eval 1 (15 vs 7) and even covered YAML metacharacters, which With-skill did not. But it lacks a methodology framework — no hypotheses, no red evidence, no coverage report, no killer cases.

Conclusion: The skill’s core value is not generating more or better test code, but injecting TDD methodology discipline and structured deliverables.

4.5 Residual Risks Analysis (Eval 3 Highlight)¶

With-skill’s Eval 3 report listed 4 residual risks:

CGNAT (100.64.0.0/10) — currently returns false; extend if threat model includes shared address space
IPv6 zone IDs — upstream handling of fe80::1%eth0 is uncertain
DNS rebinding — design limitation for hostname resolution bypass
Octal/hex IP notation — TOCTOU risk for 0177.0.0.1

This risk analysis is entirely absent from Without-skill and is especially important for security-sensitive code.

5. Token Cost-Effectiveness Analysis¶

5.1 Skill Size¶

File	Lines	Words	Bytes	Est. Tokens
SKILL.md	296	1,686	11,350	~2,400
references/tdd-workflow.md	172	732	5,375	~1,050
references/api-3layer-template.md	162	573	4,508	~800
references/fake-stub-template.md	66	207	1,532	~300
references/boundary-checklist.md	56	450	3,124	~650
Description (always in context)	—	~30	—	~40
Total	752	3,678	25,889	~5,240

5.2 Actual Load Scenarios¶

Scenario	Files read	Total tokens
Eval 1: yamlQuote (S)	SKILL.md + boundary-checklist + fake-stub	~3,350
Eval 2: normalizeSummaryJSON (M)	SKILL.md + boundary-checklist + fake-stub + tdd-workflow	~4,400
Eval 3: IsPrivateIPLiteral (M)	SKILL.md + boundary-checklist	~3,050
Typical average		~3,600
Full load (all references)	SKILL.md + all 4 references	~5,200
Minimal load	SKILL.md only	~2,400

5.3 Token Cost for Quality Gain¶

Metric	Value
With-skill pass rate	100% (39/39)
Without-skill pass rate	53.8% (21/39)
Pass-rate gain	+46.2 pp
Token cost per assertion fixed	~133 tokens (SKILL.md only) / ~200 tokens (typical)
Token cost per 1% pass-rate gain	~52 tokens (SKILL.md only) / ~78 tokens (typical)

5.4 Comparison with Other Skills’ Cost-Effectiveness¶

Metric	tdd-workflow	e2e-test	thirdparty-api-integration-test	go-makefile-writer	git-commit
SKILL.md tokens	~2,400	~1,800	~680	~1,960	~1,120
Typical load tokens	~3,600	~4,600	~2,050	~4,600	~1,120
Pass-rate gain	+46.2%	+48.7%	+33.3%	+31.0%	+22.7%
Tokens per 1% (SKILL.md)	~52 tok	~37 tok	~20 tok	~63 tok	~51 tok
Tokens per 1% (typical)	~78 tok	~94 tok	~62 tok	~149 tok	~51 tok

Analysis:

Second-highest absolute gain (+46.2%) — behind only e2e-test’s +48.7%
Strong typical-load cost-effectiveness (~78 tok/1%) — third in the series, behind only git-commit (~51) and thirdparty-api-integration-test (~62)
Good SKILL.md cost-effectiveness (~52 tok/1%) — comparable to git-commit (~51)
Lean, effective references — 4 reference files total ~2,800 tokens, each with a clear use case

5.5 Token Segment Cost-Effectiveness¶

SKILL.md module	Est. tokens	Related assertion delta	Cost-effectiveness
6 Mandatory Gates (defect hypothesis, killer, coverage, execution integrity, concurrency, change-size)	~600	15 assertions (A1–A3, A7, A10–A11, B1–B3, B9, B11–B12, C1–C3, C10, C12–C13)	Very high — 40 tok/assertion
Quality Scorecard	~350	Indirect (report structure)	High
Output Contract definition	~100	3 assertions (A10, B11, C12)	Very high — 33 tok/assertion
Workflow 8-step	~150	Indirect (process guidance)	High
Command Playbook	~100	Indirect (standardized commands)	Medium
Anti-Examples (7)	~700	Indirect (avoid common mistakes)	Medium — no direct assertion match
Hard Rules	~200	Indirect (assertion style adaptation)	Medium
references/boundary-checklist.md	~650	Indirect (DH design guidance)	High — loaded every scenario
references/fake-stub-template.md	~300	0 direct	Low — no fake/stub in this eval
references/tdd-workflow.md	~1,050	0 direct	Low — only Eval 2 loaded
references/api-3layer-template.md	~800	0 direct	Low — not loaded in this eval

5.6 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Excellent — ~3,600 tokens (typical) for +46.2% pass rate; third-best cost-effectiveness in series
SKILL.md ROI	Excellent — ~2,400 tokens cost-effectiveness (~52 tok/1%) tied with git-commit
High-leverage token share	~44% (~1,050/2,400) directly contributes to 18/18 assertion deltas
Low-leverage token share	~29% (~700/2,400) Anti-Examples with no direct assertion match
Reference cost-effectiveness	boundary-checklist high value (loaded every scenario); other 3 loaded on demand

6. Boundary Analysis vs Claude Base Model Capabilities¶

6.1 Capabilities Base Model Already Has (No Skill Increment)¶

Capability	Evidence
Table-driven tests with t.Run	3/3 scenarios
Stdlib assertions (t.Fatalf with got/want)	3/3 scenarios
Boundary condition testing	Eval 1: metacharacters; Eval 3: RFC 1918 boundaries
Error-path coverage	Eval 2: empty, no braces, invalid JSON
t.Parallel usage	3/3 scenarios (Without-skill uses Parallel more aggressively)
Co-located test files	3/3 scenarios
Reasonable test count	Without-skill even produced more cases

6.2 Base Model Gaps (Skill Fills)¶

Gap	Evidence	Impact
TDD Red→Green flow entirely missing	3/3 scenarios lack red evidence	High — cannot prove tests actually verify behavior
Defect Hypothesis missing	3/3 scenarios lack hypothesis list	High — tests lack theoretical basis and attack perspective
Killer Case missing	3/3 scenarios lack killer cases	High — no targeted tests for high-risk hypotheses (e.g. SSRF bypass)
Coverage report missing	3/3 scenarios lack coverage	Medium — cannot quantify test adequacy
Change Size classification missing	3/3 scenarios lack S/M/L	Medium — no test budget control (may over- or under-test)
Output Contract missing	3/3 scenarios lack structured report	Medium — reports not reproducible or comparable
Residual Risks missing	3/3 scenarios lack follow-up risk analysis	Low — but critical for security code

7. Overall Score¶

7.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
TDD methodology (Red/Green/Refactor)	5.0/5	1.0/5	+4.0
Defect Hypothesis + Killer Case	5.0/5	0.5/5	+4.5
Structured report & Coverage	5.0/5	1.0/5	+4.0
Test code quality	4.5/5	4.0/5	+0.5
Security analysis (Eval 3 residual risks)	5.0/5	2.0/5	+3.0
Overall mean	4.90/5	1.70/5	+3.20

7.2 Weighted Total Score¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	10/10	2.50
TDD methodology injection	20%	10/10	2.00
Defect Hypothesis + Killer Case	15%	10/10	1.50
Structured report & Coverage	10%	10/10	1.00
Token cost-effectiveness	15%	8.5/10	1.28
Test code quality increment	10%	5.0/10	0.50
Security analysis / Residual Risks	5%	10/10	0.50
Weighted total			9.28/10

Test code quality increment is scored lower because Without-skill test code quality is not low — the skill’s core value lies in methodology discipline, not code generation.

8. Evaluation Materials¶

Material	Path
Eval 1 with-skill output	`/tmp/tdd-eval/eval-1/with_skill/`
Eval 1 without-skill output	`/tmp/tdd-eval/eval-1/without_skill/`
Eval 2 with-skill output	`/tmp/tdd-eval/eval-2/with_skill/`
Eval 2 without-skill output	`/tmp/tdd-eval/eval-2/without_skill/`
Eval 3 with-skill output	`/tmp/tdd-eval/eval-3/with_skill/`
Eval 3 without-skill output	`/tmp/tdd-eval/eval-3/without_skill/`