tdd-workflow is an end-to-end TDD skill for Go code changes, designed to put "write failing tests first, then minimal implementation, then safe refactor" into practice. It is especially suited for new features, bug fixes, and security-sensitive logic testing. Its three standout strengths are: requiring a one-to-one mapping between Defect Hypotheses and test cases, so you define "what bug to catch" before writing tests; enforcing a Red → Green → Refactor evidence chain so the TDD process is verifiable, not just claimed; and using Killer Cases plus coverage and risk-path gates to elevate tests from "runnable" to "capable of catching critical defects".
This evaluation reviews the tdd-workflow skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (S-size yamlQuote boundary tests, M-size normalizeSummaryJSON three-function tests, M-size IsPrivateIPLiteral security tests). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 39 assertions.
3.3 Classification of 18 Failed Assertions (Without-Skill)¶
Failure type
Count
Evals
Notes
Change Size classification missing
3
All
No S/M/L classification or test budget control
Defect Hypothesis missing
3
All
No hypothesis–test mapping; tests lack theoretical basis
Red Evidence missing
3
All
No evidence of tests failing before implementation
Killer Case missing
3
All
No targeted tests for high-risk hypotheses
Output Contract missing
3
All
Simple summary instead of structured deliverable
Coverage report missing
3
All
No line coverage or risk-path coverage reported
Key observation: All 18 failures are TDD methodology artifacts, not test code quality issues. Without-skill test code quality is not low (Eval 1 even produced 15 test cases vs With-skill’s 7), but it lacks TDD process evidence and structured reports.
Deltas across the three scenarios are highly consistent (+42.9% to +50.0%), indicating the skill’s contribution is not task-dependent but systematically injects six categories of TDD methodology artifacts.
4.1 Defect Hypothesis → Test Mapping (Core Differentiator)¶
This is the TDD skill’s most distinctive contribution — requiring hypotheses before tests.
Scenario
With Skill
Without Skill
Eval 1: yamlQuote
7 hypotheses: DH1(empty)→DH7(unicode), each mapped to test name
15 test cases, no hypothesis provenance
Eval 2: normalizeSummaryJSON
15 hypotheses across 3 functions, grouped by function
31 test cases, no hypotheses
Eval 3: IsPrivateIPLiteral
5 hypotheses: H1(mapped IPv6)→H5(unspecified), including SSRF attack hypotheses
36 test cases, boundary tests but no attack hypotheses
Practical value: Defect hypotheses are not just report decoration — they drive more targeted test design:
Eval 3 H1 (IPv4-mapped IPv6 bypass) is a test angle completely absent from Without-skill. ::ffff:127.0.0.1 and ::ffff:10.0.0.1 are real SSRF attack vectors; none of Without-skill’s 36 tests touch them.
Eval 2 DH5 (nested braces extraction boundary) is a key test for the Index/LastIndex algorithm; Without-skill has the test but no hypothesis rationale.
Mutation testing: remove ReplaceAll, 3/7 fail (precise red evidence)
Direct "All 15 pass" (no red phase)
Eval 2
Characterization testing: run tests on existing code first to confirm behavior
Direct "31 subtests, 0 failures"
Eval 3
Hypothesis-driven: killer cases presented as attack hypotheses
Direct "ALL PASS"
Key difference: For characterization testing (adding tests to existing code), the skill still requires Red evidence — Eval 1 via mutation, Eval 2–3 via hypothesis. Without-skill only shows "all pass", so it cannot prove what the tests actually verify.
Notably, Without-skill test code quality is not low:
Dimension
With Skill
Without Skill
Test count
Eval 1: 7, Eval 2: 22, Eval 3: 42
Eval 1: 15, Eval 2: 31, Eval 3: 36
Table-driven
✅
✅
Stdlib assertions
✅
✅
t.Run subtests
✅
✅
t.Parallel
Partial
✅ All
Boundary cases
✅
✅
YAML metacharacters
None (Eval 1)
✅ key: value, text # comment (Eval 1)
Without-skill produced more test cases in Eval 1 (15 vs 7) and even covered YAML metacharacters, which With-skill did not. But it lacks a methodology framework — no hypotheses, no red evidence, no coverage report, no killer cases.
Conclusion: The skill’s core value is not generating more or better test code, but injecting TDD methodology discipline and structured deliverables.
Test code quality increment is scored lower because Without-skill test code quality is not low — the skill’s core value lies in methodology discipline, not code generation.