unit-test Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation subject: unit-test

unit-test is a Go-focused skill for generating and improving unit tests. It is suited for adding or strengthening logic tests, fixing low-signal tests, and designing more targeted test cases for concurrency, boundary, and mapping defects. Its three standout strengths are: high trigger accuracy, reliably distinguishing unit tests from benchmarks, fuzz tests, integration tests, and similar adjacent tasks; emphasis on failure hypotheses, Killer Cases, and boundary checklists, shifting test goals from "coverage chasing" to "catching real bugs"; and consistent use of table-driven tests, t.Run, race detection, and project assertion style adaptation so tests are both standard and aligned with existing codebase practices.

1. Evaluation Overview¶

This evaluation reviews the unit-test skill along two axes: trigger accuracy and actual task performance. Task performance covers 3 different types of Go concurrency/time-sensitive target code. Each target was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 34 assertions.

Dimension	With Skill	Without Skill	Delta
Trigger accuracy	20/20 (100%)	—	Recall 10/10, Precision 10/10
Assertion pass rate	34/34 (100%)	21/34 (61.6%)	+38.4 pp
Functional coverage (core paths)	Full	Full	No difference
Methodology output (hypothesis list / Killer Case / boundary checklist)	Full	Zero	Decisive difference
Test organization (table-driven + t.Run)	3/3	0/3	Skill consistently applied

2. Trigger Accuracy¶

2.1 Test Method¶

20 test queries were designed (10 should trigger / 10 should not trigger), covering Chinese and English, various unit-test scenarios, and easily confused adjacent tasks (benchmark, fuzz test, integration test, E2E, CI config, mock generation, documentation, translation, pprof analysis). Independent subagents simulated Cursor’s <agent_skills> trigger path. Each query was judged 3 times independently, for 60 judgments total.

Note on run_eval.py failure: The skill-creator run_eval.py script does not work inside Cursor IDE — the claude -p subprocess fails silently due to lost auth context (error: "Your organization does not have access to Claude"), causing all 60 queries to return triggered=false and meaningless 0% Recall / 50% Accuracy. This report’s trigger evaluation uses a Task subagent simulation instead: each round is evaluated by an independent agent with fresh context.

2.2 Results¶

Overall accuracy:  20/20 (100%)
Recall:            10/10 (100%) — all positive queries correctly triggered (3 rounds consistent)
Precision:         10/10 (100%) — all negative queries correctly excluded (3 rounds consistent)
F1:                100%
Total judgments:   60/60 (TP=30, FN=0, FP=0, TN=30)

2.3 Positive Queries (All Correctly Triggered)¶

#	Query	Judgment	Trigger reason
1	Help me write unit tests for service.go… concurrency issues	✅	"write unit tests" + concurrency scenario
2	I need unit tests for jwt.go… expiry boundary… zero coverage	✅	unit test + coverage gate
3	Unit test failed, TestUserService_Create/duplicate_email…	✅	fix test + test debugging
4	handler_test.go is all TestXxx, want to refactor to table-driven + t.Run	✅	table-driven + improve tests
5	Coverage dropped to 62%, CI blocked… add a few targeted tests	✅	coverage gate + add tests
6	MapReduce function… empty slice and single element… can you write tests to verify	✅	"verify this function works"
7	sync.Pool wrapper… want to confirm no data race under concurrency… run with -race	✅	-race + check for race conditions
8	Add unit tests for retry.go… retry count boundary, context cancellation	✅	unit test + boundary scenario
9	Help me write tests to verify middleware chain execution order…	✅	"write tests" + verify function
10	Review service_test.go test quality… killer case	✅	review tests + test quality

2.4 Negative Queries (All Correctly Excluded)¶

#	Query	Judgment	Exclusion reason
11	Help me write a benchmark comparing sync.Map… -benchmem	✅	Benchmark, not unit test
12	Need integration tests to verify UserRepository with real MySQL…	✅	Integration test, not unit test
13	Write a fuzz test for json_parser.go… go test -fuzz	✅	Fuzz test, not unit test
14	Help configure GitHub Actions CI workflow…	✅	CI config, not writing tests
15	Use mockgen to generate mock for UserStore interface…	✅	Mock generation, not writing tests
16	Help me write an E2E test… chromedp or playwright…	✅	E2E test, not unit test
17	Load test gRPC interface… run ghz for 10 seconds…	✅	Load test, not unit test
18	Help me write a technical doc on Go testing strategy…	✅	Documentation, not writing tests
19	Translate markdown in gocore/map/ to English…	✅	Translation, unrelated
20	Help analyze pprof CPU profile data…	✅	Profiling, not writing tests

2.5 Conclusion¶

The improved Description uses a four-layer strategy for trigger accuracy:

Irreplaceability signals — "references/ with killer-case pattern templates", "cannot be reproduced from memory", "mandatory 13-check tiered scorecard" make the model judge that its own knowledge cannot replace the skill
Strong imperative tone — "ALWAYS read this skill before writing, reviewing, or fixing ANY Go test file (_test.go)"
Broad trigger coverage — 12 keywords in Chinese and English + 4 indirect trigger modes (verify, check for race conditions, improve test quality, coverage is too low)
Explicit exclusion scope — "Do NOT use for benchmarks, fuzz tests, integration tests, E2E tests, load tests, or mock generation" effectively isolates 6 adjacent task types

3. Actual Task Performance¶

3.1 Test Method¶

Three Go code files in the repo with no existing tests were selected, covering different testing challenges:

Scenario	Target code	Testing challenge	Assertions
Eval 1: resilience.Do	`designpattern/circuitbreaker/resilience/resilience.go`	Combined rate-limit + circuit-breaker + retry; multi-component interaction, context propagation, retry boundaries	11
Eval 2: WorkerPool	`designpattern/bulkhead/pool/pool.go`	Concurrent worker pool; goroutine leak, double-Shutdown safety, task loss	11
Eval 3: Limiter	`designpattern/circuitbreaker/ratelimiter/ratelimiter.go`	Token bucket rate limiter; time-sensitive tests, concurrency races, float precision	12

Each scenario ran 1 with-skill + 1 without-skill subagent, 6 runs total.

3.2 Assertion Pass Rate Overview¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: resilience.Do	11	11/11 (100%)	7/11 (63.6%)	+36.4%
Eval 2: WorkerPool	11	11/11 (100%)	6/11 (54.5%)	+45.5%
Eval 3: Limiter	12	12/12 (100%)	8/12 (66.7%)	+33.3%
Total	34	34/34 (100%)	21/34 (61.6%)	+38.4%

3.3 Per-Item Comparison: Which Assertions Drove the Gap?¶

The 13 assertions that Without-skill failed across all 3 scenarios fall into 4 methodology dimensions:

Failure type	Count	Failed assertions
Failure Hypothesis List	3	All 3 scenarios — no formal defect hypothesis table
Killer Cases	3	All 3 scenarios — no named defect-hypothesis-driven killer cases
Table-driven + t.Run	3	All 3 scenarios use separate TestXxx functions instead of subtest organization
Boundary Checklist	4	All 3 scenarios missing + Eval 2 additionally missing goroutine leak discussion

Key observation: All 13 Without-skill failures are methodology-level, not functional coverage.

3.4 Functional Coverage Comparison¶

On "which code paths are tested", the two sides are similar:

Functional path	With Skill	Without Skill
Eval 1 core paths
Rate limit reject (ErrRateLimited)	✅	✅
Circuit breaker open (ErrBreakerOpen)	✅	✅
Context cancellation (during backoff)	✅	✅
Retry boundary (MaxRetries=0/1)	✅	✅
-race passes	✅	✅
Eval 2 core paths
TrySubmit queue full returns false	✅	✅
Shutdown drains tasks	✅	✅
Double-Shutdown safety	✅	✅
Concurrent Submit stress test	✅	✅
-race passes	✅	✅
Eval 3 core paths
Initial burst capacity	✅	✅
Tokens exhausted	✅	✅
Token refill	✅	✅
Burst cap	✅	✅
Concurrent Allow()	✅	✅
-race passes	✅	✅

Conclusion: Functional coverage is the same. Without-skill did not test fewer code paths; it lacked the methodology framework around those paths.

4. Skill Differentiator Value Deep Dive¶

4.1 Failure Hypothesis List¶

With Skill: Each scenario produces 7–9 numbered hypotheses (H1–H9), organized by category (Branching, Concurrency, Loop/index, Context/time), each mapped to specific test cases.

Without Skill: No such output. Tests are organized by functional area (Rate Limiting, Success Paths, Retry Exhaustion) but with no formal defect analysis.

Dimension	With Skill	Without Skill
Hypothesis count	Eval1: 9, Eval2: 7, Eval3: 9	0, 0, 0
Defect→test mapping	Each hypothesis labeled Covered By	None
Coverage analysis	Traceable which defects are tested	Only see which paths were run

Practical value: The Failure Hypothesis List matters not because "one more table" exists, but because it drives test design — first think "what bugs might this code have", then design tests accordingly, rather than "spread coverage by function signature".

4.2 Killer Cases¶

With Skill: Each scenario has 3–4 killer cases, each with: - Linked defect hypothesis (e.g. KC1→H3) - Fault injection description - Critical assertion (with concrete field and value) - Removal Risk Statement ("if this assertion is removed, what bug escapes")

Example (Eval 1 KC1):

Linked hypothesis: H3 — ErrBreakerOpen is retried instead of returned immediately Critical assertion: backoffCalls == 0 — no retry backoff was triggered Removal risk: If removed, the known bug (ErrBreakerOpen not short-circuiting retries) can escape detection — 4 unnecessary backoff+retry cycles would occur.

Without Skill: Tests cover the same paths (e.g. TestDo_BreakerOpenStopsRetry) but with no removal risk analysis. Developers cannot tell which assertion is "critical for regression" vs "nice to have".

4.3 Boundary Checklist¶

With Skill: Standard 12-item checklist, each labeled Covered / N/A + notes:

#	Item	Status
1	nil input	Covered — nil Limiter, nil Backoff
2	Empty value	N/A
3	Single element (len==1)	Covered — MaxRetries=1
4	Size boundary (n=2, n=3, last)	Covered — MaxRetries=0,2,3,-1
...	...	...

Without Skill: No such output. Boundary cases are scattered across test functions with no systematic audit.

4.4 Test Organization: Table-Driven vs Separate Functions¶

Dimension	With Skill	Without Skill
Organization	t.Run subtests (TestDo/12 subtests)	17 separate TestXxx functions
Parallel	t.Parallel()	None
Naming	Snake_case, verb+expectation (`rate_limited_returns_ErrRateLimited`)	PascalCase (`TestDo_RateLimited`)
Maintainability	Add case = add one table row	Add case = new function + repeated setup

4.5 Auto Scorecard (13-Check Tiered Scorecard)¶

With-skill output includes a structured 13-item scorecard (3 Critical + 5 Standard + 5 Hygiene) with pass/fail evidence and tiered summary. Without-skill has no such output.

4.6 Additional Findings¶

With-skill runs surfaced real insights in the code that Without-skill did not mention:

Finding	Scenario	Notes
Worker pool quit channel dead code	Eval 2	Worker `select`’s `quit` branch never triggers in `close(tasks)` + `range` mode
`Tokens()` state mutation risk	Eval 3	Reading Tokens calls internal `refill()`, changing state; read has side effects
Token fractional threshold precision risk	Eval 3	Float comparison `l.tokens >= 1` after refill may have precision issues

5. Comprehensive Analysis¶

5.1 Skill Differentiator Value Map¶

Dimension	Contribution	Notes
Methodology framework	★★★★★	Failure Hypothesis List + Killer Cases + Boundary Checklist are capabilities Without-skill does not produce
Test organization discipline	★★★★☆	table-driven + t.Run + t.Parallel consistently applied; Without-skill did not follow
Quality audit traceability	★★★★★	13-check Scorecard + Removal Risk Statement provide auditable evidence of test quality
Defect discovery	★★☆☆☆	Code insights (dead code, side effects) are bonus; functional path coverage same as Without-skill
Functional coverage difference	★☆☆☆☆	Core paths fully covered by both; no difference

5.2 Skill’s True Value Proposition¶

The skill is not for "testing more paths" — it is for "thinking systematically about why to test a path".

Core value by importance:

Defect-hypothesis-driven test design — List "possible bugs" (H1–H9) first, then design tests. Without-skill instead "traverses parameter combinations by API signature". The former finds bugs; the latter spreads coverage.
Killer Case + Removal Risk — Each killer case answers "what bug does this assertion prevent from escaping". Without this, maintainers cannot distinguish critical from redundant assertions and may delete them during refactors.
Structured quality audit — 13-check Scorecard provides quantifiable quality judgment (Critical tier all pass = mergeable), not subjective "looks well tested".
Systematic boundary checklist — 12-item standard checklist ensures nil, empty, boundary, concurrency, context cancellation, etc. are not missed; each item’s Covered/N/A provides audit trail.
Consistent test organization — table-driven + t.Run is not just style; it affects maintainability and the cost of adding cases.

5.3 Skill Weaknesses¶

No functional coverage difference: In all 3 scenarios, Without-skill covered the same core paths as With-skill (rate limit, circuit breaker, context cancel, concurrency). The skill’s differentiation is entirely at the methodology level, not "what to test".
Without-skill sometimes has more test cases: Eval 1 Without-skill produced 17 separate test functions vs With-skill’s 12 subtests. More cases ≠ higher quality, but shows Without-skill is not "testing less".
Methodology output value depends on team: Failure Hypothesis List and Killer Cases may be "helpful but not essential" for senior developers; more valuable for test newcomers or code review.
Limited evaluation scenarios: Only 3 concurrency/design-pattern scenarios; no database ops, HTTP handlers, pure logic functions, etc.

6. Score Summary¶

6.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Functional coverage	5.0/5	5.0/5	0.0
Methodology completeness	5.0/5	1.0/5	+4.0
Test organization	5.0/5	2.5/5	+2.5
Traceability (audit)	5.0/5	1.0/5	+4.0
Code insight	4.0/5	3.0/5	+1.0
Maintainability	4.5/5	3.0/5	+1.5
Overall mean	4.75/5	2.58/5	+2.17

6.2 Weighted Total Score¶

Dimension	Weight	Score	Weighted
Trigger accuracy	25%	10/10	2.50
Assertion pass rate (with/without delta)	20%	9.2/10	1.84
Methodology output (hypothesis/Killer/checklist)	20%	10/10	2.00
Test organization & maintainability	15%	9.0/10	1.35
Code insight added value	10%	7.0/10	0.70
Functional coverage difference (vs baseline)	10%	5.0/10	0.50
Weighted total			8.89/10

7. Evaluation Methodology¶

Trigger Evaluation¶

Method: Subagent simulation of trigger judgment (3 independent rounds × 20 queries = 60 judgments)
Query design: 10 positive (Chinese/English, direct/indirect trigger modes) + 10 negative (6 adjacent task types: benchmark/fuzz/integration/E2E/load/mock + CI/docs/translation/profiling)
Environment: Cursor IDE Task subagent (generalPurpose, fast model), fresh context each round
Limitation: Proxy test, not end-to-end real trigger; does not account for 50+ competing skills

Task Evaluation¶

Method: 3 scenarios × 2 configs = 6 independent subagent runs
Target code: All real Go code in repo with no existing tests (not artificially constructed)
Assertions: 34, covering file creation, methodology output, functional paths, test organization, race safety, quality audit (6 dimensions)
Scoring: Manual per-assertion comparison with subagent output; record pass/fail + evidence
Baseline: Same prompt, SKILL.md not read

Evaluation Materials¶

Trigger evaluation queries: unit-test-workspace/trigger-eval-set.json
Trigger evaluation results: unit-test-workspace/trigger-eval-results.json
Eval definitions: unit-test-workspace/evals/evals.json
Scoring results: unit-test-workspace/iteration-1/{resilience-do,worker-pool,rate-limiter}/{with_skill,without_skill}/grading.json
Benchmark summary: unit-test-workspace/iteration-1/benchmark.json
Description improvement report: unit-test-workspace/description-improvement-report.md
Eval viewer: unit-test-workspace/iteration-1/eval-review.html
Generated test code: unit-test-workspace/iteration-1/*/outputs/*_test.go
Generated reports: unit-test-workspace/iteration-1/*/outputs/report.md