unit-test Skill Design Rationale¶
unit-test is a defect-first unit-testing framework for Go repositories. Its core idea is: the goal of high-quality unit tests is to first classify target risk, then design high-signal cases around explicit defect hypotheses, and finally explain why the suite deserves to exist through Killer Cases, boundary checklists, tiered scorecards, and real -race / coverage evidence. That is why the skill turns the Go Version Gate, Execution Modes, Defect-First Workflow, High-Signal Test Budget, Boundary Checklist, Auto Scorecard, Property-Based Testing, and Reporting Integrity into one fixed workflow.
1. Definition¶
unit-test is used for:
- adding, strengthening, and fixing unit tests for Go code,
- prioritizing real defects in boundaries, mappings, concurrency, and context propagation,
- organizing tests into maintainable table-driven +
t.Runstructures, - selecting
Light / Standard / Strictmode based on target risk, - and delivering tests together with race, coverage, scorecard, and residual-risk evidence.
Its output is not just test code. It also includes:
- execution mode and rationale,
- targets tested and case counts,
- Go version and version-dependent adaptations,
- boundary checklist,
- coverage / race results,
- scorecard and final PASS / FAIL.
Failure Hypothesis List, detailed Killer Case reporting, and the JSON summary mainly belong to Standard / Strict mode outputs; Light mode intentionally reduces output to a lighter boundary check and scorecard package.
From a design perspective, it is closer to a unit-test governance framework than to a prompt that merely adds a few _test.go files.
2. Background and Problems¶
The skill is not solving "models do not know how to write Go tests." It is solving the fact that default unit-test generation often drifts into low-signal patterns:
- many tests exist, but the assertions are weak,
- coverage looks strong, but the suite still misses real bugs,
- lots of cases are present, but there is no visible testing methodology.
Without an explicit framework, the most common failures cluster into eight categories:
| Problem | Typical consequence |
|---|---|
| Coverage is chased without defect thinking | many paths appear tested, but important bugs still escape |
| Target risk is not classified | simple functions get over-engineered while risky concurrent code gets only shallow happy paths |
| No Killer Case exists | maintainers cannot tell which assertions are the real regression barriers |
| Boundary coverage is not systematic | nil, empty, singleton, last-element, and context-cancel cases are easily missed |
| Assertions are mutation-weak | tests only check err == nil or not nil, so field-level regressions still pass |
| Test organization is scattered | one target accumulates many separate TestXxx functions with poor reuse and high edit cost |
| Concurrency tests are nondeterministic | time.Sleep, no -race, and leaked shared state make tests flaky |
| Output is not auditable | the team cannot see why a mode was chosen, why the suite is considered sufficient, or what remains untested |
The design logic of unit-test is to answer "how risky is this code, what is most likely to fail, which assertions form the regression barrier, what coverage gate is reasonable, and what level of test process should apply?" before deciding how many tests to write or how to organize them.
3. Comparison with Common Alternatives¶
It helps to compare it with a few common alternatives:
| Dimension | unit-test skill | Asking a model to "write unit tests" | Treating unit tests as a coverage-filling task |
|---|---|---|---|
| Defect-hypothesis-driven design | Strong | Weak | Weak |
Mode selection (Light / Standard / Strict) | Strong | Weak | Weak |
| Killer Case discipline | Strong | Weak | Weak |
| Boundary-checklist system | Strong | Medium | Weak |
Concurrency / -race awareness | Strong | Medium | Weak |
| Test-organization consistency | Strong | Medium | Weak |
| Quality scoring and auditability | Strong | Weak | Weak |
| Coverage philosophy | high-signal first, anti-bloat | often ad hoc | often metric-driven |
Its value is not only that there are "more tests." Its value is that unit tests become an explainable, auditable, and maintainable defect-defense system instead of a pile of disconnected cases.
4. Core Design Rationale¶
4.1 Mode Selection Comes Before Case Design¶
In workflow step 0, unit-test requires choosing Light / Standard / Strict. That matters because it explicitly rejects a common mistake: applying one heavy test process to every target.
It routes based on factors such as:
- target count,
- presence of concurrency,
- dependency complexity,
- branch complexity,
- security sensitivity,
- context / deadline logic,
- collection transforms or property-based triggers.
This design matters because:
- simple pure functions can stay in
Lightand avoid unnecessary overhead, - ordinary business logic defaults to
Standard, - concurrent, security-sensitive, or higher-risk targets are promoted to
Strict.
What this solves is mismatch between test-process weight and code risk. Without it, the usual distortion is that trivial logic gets wrapped in oversized methodology while dangerous code receives only a handful of shallow cases.
4.2 The Defect-First Workflow Is Central¶
The most important design choice in unit-test is not table-driven structure and not coverage. It is the Defect-First Workflow. In Standard / Strict modes, the skill requires a Failure Hypothesis List before writing tests, covering at least:
- loop / index risks,
- collection-transform risks,
- branching risks,
- concurrency risks,
- context / time risks.
This is critical because the highest-value part of unit testing is not "which parameter combinations exist," but "how is this code most likely to fail." For lower-risk targets, Light mode intentionally skips this heavier methodology layer so that simple unit tests do not turn into oversized process.
The evaluation proves this directly: with-skill and without-skill were almost identical in core functional-path coverage, and the largest gap came from Failure Hypothesis Lists, Killer Cases, and Boundary Checklists. In other words, the skill's main increment is not "testing more." It is "explaining much more systematically why these tests exist."
4.3 Killer Cases Are a Hard Constraint¶
In Standard + Strict modes, the skill requires at least one Killer Case per test target, and each Killer Case must contain four parts:
- defect hypothesis,
- fault injection or boundary setup,
- critical assertion,
- removal-risk statement.
This is one of the skill's most distinctive design choices. The difference between a normal edge case and a Killer Case is that a Killer Case must point to a named defect and explain "if this assertion is removed, which known bug can escape."
This solves a very practical maintenance problem: test files are often simplified, refactored, or partially deleted over time. Without the removal-risk layer, later maintainers cannot easily tell which assertions are decorative and which are the actual regression barrier. The evaluation showed that without-skill often covered the same paths, but lacked this explanatory layer, leaving the regression-defense boundary less explicit.
4.4 The Boundary Checklist Is Explicit vs. Implicit¶
unit-test turns boundary review into:
- a 5-item checklist for
Light, - a 12-item checklist for
Standard / Strict.
The checklist covers:
nil,- empty,
- singleton,
- size / last-element boundary,
- min/max boundaries,
- invalid format,
- zero-value struct/default trap,
- dependency error,
- context cancellation,
- concurrent / race behavior,
- mapping completeness,
- killer-case mapping.
This is highly practical because missed boundaries are usually not caused by ignorance. They are missed because they do not naturally surface in a stable order during test writing. Once checklist discipline is explicit, test quality no longer depends on immediate memory alone; it has a reviewable baseline.
The evaluation also showed that without-skill did not necessarily omit boundary cases entirely, but it did omit the explicit checklist. That means the team could not quickly see which boundaries were systematically covered and which were only touched incidentally.
4.5 The Coverage Gate Is Scoped and Rationale-Based¶
The skill does not reduce coverage policy to a simplistic "always >= 80%." It explicitly distinguishes:
- logic-heavy packages: default
>= 80%, - infra / IO-heavy packages: possibly lower, but only with explicit rationale.
It also insists that:
- coverage must not be inflated with low-signal tests,
- even when coverage may be lower, boundary discipline still remains,
- multi-package situations should use
-coverpkg=./...or per-package profiles for honest measurement.
This is mature design because it rejects two common extremes:
- treating coverage as the whole truth,
- abandoning coverage gates entirely because they are imperfect.
unit-test instead keeps coverage as one quality threshold while never letting it replace defect-first design.
4.6 Assertion Mutation Resistance Is Emphasized So Strongly¶
The skill repeatedly requires mutation-resistant assertions while adapting assertion style to project convention. It supports:
require/assertintestify-based projects,cmp.Diffwhengo-cmpis the established convention,t.Fatalf/t.Errorfin stdlib-only projects,- no existence-only checks like
err == nilornot nil, - business-field assertions rather than object existence.
This solves the core weakness of low-signal tests. A test that only proves "an object came back" may still pass when fields are swapped, defaults are wrong, or the last mapped item is missing. What the skill truly enforces is not one assertion library, but sufficient assertion strength to express business correctness.
The first rule in bug-finding-techniques.md is Mutation-Resistant Assertions, and the evaluation repeatedly supports the same point: the skill cares about which exact field must fail when behavior regresses, not just whether execution reached the path.
4.7 Test Structure Rules Are So Concrete¶
The skill requires:
- top-level naming adapted to target type,
t.Rungroups that map 1-to-1 to test targets,- table-driven cases,
- defect-oriented readable case names,
- preferring
t.Parallel()when safe, but only when truly isolated.
This is not just style. It is about maintenance cost. In the evaluation, without-skill often produced many separate TestXxx functions. Functionally that was sometimes fine, but it makes incremental additions more expensive and repeats setup more often. With-skill consistently used table-driven + t.Run, making future case additions cheaper.
4.8 Concurrency Testing Is Designed Around Determinism¶
unit-test explicitly requires:
- running
go test -race, - avoiding
time.Sleepfor synchronization, - using channel barriers, WaitGroups, and channel sequencing for deterministic control,
- avoiding unsafe
t.Parallel()combinations.
This is important because the biggest risk in concurrent unit tests is not that they "will not run." It is that they fail only sometimes, or pass locally and fail in CI. The patterns in references/concurrency-testing.md effectively turn concurrency testing from "guessing scheduling with time" into "controlling scheduling with synchronization primitives." That is why -race is treated as a hard requirement rather than an optional enhancement.
4.9 Property-Based Testing Is Supported but Not Allowed to Replace Table-Driven Tests¶
The skill takes a restrained approach to property-based testing:
- not applicable in
Light, - optional recommendation in
Standard, - required recommendation or implementation in
Strictwhen the pattern fits.
It only recommends this path for patterns such as roundtrip, idempotency, preservation, commutativity, parse validity, and monotonicity, and it explicitly says:
- property-based tests verify invariants,
- table-driven tests verify exact boundaries and concrete outputs,
- Killer Cases are still not replaced by property-based tests.
This is mature design because it avoids another testing distortion: seeing one invariant and trying to replace hand-written boundary tests with randomized checks. The skill's position is clear: property-based testing adds breadth; it does not replace defect-driven or boundary-driven tests.
4.10 Generated Code Exclusion Is Necessary¶
unit-test explicitly excludes:
*.pb.go,*_gen.go,wire_gen.go,mock_*.go,*_mock.go,- files marked with
Code generated ... DO NOT EDIT.
This is necessary because generated code is usually better validated by its generator's own guarantees or by testing the higher-level behavior around it. Without this exclusion, a model can easily spend effort where coverage is easy to raise but long-term value is low. The important point is not that generated code should never be tested, but that the skill treats it as outside the default high-value scope.
4.11 Auto Scorecard and Reporting Integrity Exist Together¶
The skill's output is not allowed to stop at "tests added." It must report at least:
- mode,
- version adaptation,
- boundary checklist,
- coverage / race results,
- a 13-item or 7-item scorecard,
- final PASS / FAIL.
In Standard / Strict modes, it must additionally include:
- hypothesis / killer-case mapping,
- fuller methodology output,
- JSON summary.
At the same time, Reporting Integrity says the model must not claim -race or coverage results unless they were actually run, and must provide exact commands when execution is not possible.
The value of this pairing is that unit tests become an auditable delivery artifact rather than only a code change. The team gets not only test files, but also answers to:
- why this mode was selected,
- which defects are covered, especially in
Standard / Strictoutputs, - which killer cases are critical in
Standard / Strict, - whether coverage / race requirements were really met,
- what residual risks remain.
The evaluation's clearest advantage also sits here: methodology output and audit traceability were the main differentiators.
4.12 Trigger Design Is Part of the Skill's Architecture¶
Unlike many skills, unit-test treats trigger accuracy as a first-class concern in evaluation, and the result was 20/20. That was not incidental; it came from deliberate Description design:
- strong trigger keyword coverage,
- explicit exclusions for benchmark / fuzz / integration / E2E / load / mock tasks,
- strong imperative trigger wording,
- and strong "cannot be replaced from memory" signals.
This belongs in the rationale because it shows that unit-test is not merely adding more rules. It is also solving "when should this skill activate, and when should it yield to another testing skill?"
5. Problems This Design Solves¶
Combining the current SKILL.md, key references, and the evaluation report, the skill solves the following problems:
| Problem type | Corresponding design | Practical effect |
|---|---|---|
| Unit tests chase coverage only | Defect-First Workflow | Tests are shaped more like bug-finding than path-counting |
| High-risk and low-risk code use the same process | Execution Modes | Test-process weight matches code risk |
| Critical assertions get deleted during refactors | Killer Case + Removal Risk | Regression barriers become explicit |
| Boundary cases are missed | Boundary Checklist | Coverage becomes more systematic |
| Assertions are low-signal | Mutation-Resistant Assertions | Field, mapping, and state regressions are easier to catch |
| Concurrency tests are flaky | -race + deterministic concurrency patterns | Tests become more reliable |
| Test organization is scattered | table-driven + t.Run + target adaptation | Maintenance cost goes down |
| Test completeness is not auditable | Scorecard + Reporting Integrity | Teams can judge deliverability more clearly |
6. Key Highlights¶
6.1 It Turns Unit Testing from a Coverage Exercise into a Defect-Discovery Process¶
This is the skill's most important strength and the main source of differentiation in evaluation.
6.2 Light / Standard / Strict Makes Test Intensity Match Risk¶
Not every code path deserves the same process overhead, and this skill makes that judgment explicit.
6.3 Killer Cases Are a Highly Distinctive Design Feature¶
For every Standard / Strict target, at least one case must clearly say "if this assertion disappears, this bug can escape."
6.4 Boundary Checklists and Scorecards Make Test Quality Auditable¶
The team no longer has to rely on intuition about whether the suite is "probably enough."
6.5 It Is Especially Strong on Concurrent and Time-Sensitive Go Code¶
Channel barriers, error fan-in, panic recovery, -race, and t.Parallel() safety rules show specialized design around Go's high-risk testing areas.
6.6 Its Real Increment Is Methodology and Auditability, Not Raw Path Count¶
The evaluation already shows that with-skill and without-skill were similar in core functional-path coverage. The real gap came from Failure Hypothesis Lists, Killer Cases, Boundary Checklists, test organization, and scorecards. That means the core value of unit-test is testing governance, not simply "generating more test cases."
7. When to Use It — and When Not To¶
| Scenario | Suitable | Reason |
|---|---|---|
| Adding unit tests for Go logic code or boundary behavior | Very suitable | This is the core use case |
| Concurrent, context-sensitive, or mapping-heavy code with real bug risk | Very suitable | Defect-First + -race are highly targeted here |
Reviewing or restructuring existing _test.go quality | Very suitable | Scorecard and Killer Case rules are practical |
| Benchmarks | Not suitable | Should go to benchmarking / performance workflows |
| Fuzz tests | Not suitable | Should go to fuzzing workflows |
| Integration / E2E / load tests | Not suitable | All are outside unit-test scope |
| Mock generation | Not suitable | That is not test design itself |
8. Conclusion¶
The real strength of unit-test is not that it can write Go tests faster. It is that it systematizes the parts of unit testing that most often become formalistic: choose the right mode based on risk, list defect hypotheses first, design a Killer Case for each target, and then constrain the final suite with boundary discipline, mutation-resistant assertions, deterministic concurrency control, real coverage / race evidence, and a tiered scorecard.
From a design perspective, the skill expresses a clear principle: the key to high-quality unit testing is not running more functions once, but being able to explain which bugs the suite is preventing, why certain assertions cannot be removed, which boundaries have been systematically covered, which risks still remain, and whether those conclusions are backed by real -race and coverage evidence. That is why it is especially well suited to Go logic testing, concurrency-sensitive code, and test-quality improvement work.
9. Document Maintenance¶
This document should be updated when:
- the Hard Rules, Execution Modes, Defect-First Workflow, Coverage Gate, Auto Scorecard, Property-Based Testing, Reporting Integrity, or Output Expectations in
skills/unit-test/SKILL.mdchange, - key patterns or examples in
skills/unit-test/references/killer-case-patterns.md,bug-finding-techniques.md,concurrency-testing.md, orproperty-based-testing.mdchange, - key supporting results in
evaluate/unit-test-skill-eval-report.mdorevaluate/unit-test-skill-eval-report.zh-CN.mdchange.
Review quarterly; review immediately if the mode-selection rules, Killer Case discipline, scorecard structure, or trigger description of unit-test change substantially.
10. Further Reading¶
skills/unit-test/SKILL.mdskills/unit-test/references/killer-case-patterns.mdskills/unit-test/references/bug-finding-techniques.mdskills/unit-test/references/concurrency-testing.mdskills/unit-test/references/property-based-testing.mdskills/unit-test/scripts/tests/COVERAGE.mdevaluate/unit-test-skill-eval-report.mdevaluate/unit-test-skill-eval-report.zh-CN.md