fuzzing-test Skill Evaluation Report¶
Evaluation framework: skill-creator Evaluation date: 2026-03-12 Evaluation subject:
fuzzing-test
fuzzing-test is a skill specialized in generating high-signal fuzz tests for Go code, suitable for parsers, codecs, state transitions, and other targets with clear invariants. It also helps determine when a target is not worth fuzzing at all. Its three main strengths are: running an Applicability Gate first before deciding whether to enter the generation flow, avoiding "write fuzz for every function"; explicitly rejecting unsuitable targets with alternative suggestions instead of producing low-value code; and built-in target prioritization, cost tiers, and structured output for more controllable, cost-effective fuzz testing.
1. Evaluation Overview¶
This evaluation reviews the fuzzing-test skill along two dimensions: actual task performance and token cost-effectiveness. It uses 3 fuzz test generation scenarios (suitable parser target, unsuitable network-dependent target, package-level evaluation with multiple candidate functions), each run with both with-skill and without-skill configurations—3 scenarios × 2 configs = 6 independent subagent runs—scored against 35 assertions.
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Assertion pass rate | 35/35 (100%) | 16/35 (45.7%) | +54.3 pp |
| Applicability Gate correctness | 3/3 scenarios correct | 0/3 with formal gate | Skill-only |
| Rejection of unsuitable targets | Correct rejection + alternatives | Built workaround instead | Largest single delta |
| Output Contract structured report | 3/3 | 0/3 | Skill-only |
| Size guard coverage | 100% (all harnesses) | ~25% (partial harnesses) | Skill systematic |
| Skill Token cost (SKILL.md only) | ~4,100 tokens | 0 | — |
| Skill Token cost (typical load) | ~6,500 tokens | 0 | — |
| Token cost per 1% pass-rate gain | ~75 tok (SKILL.md only) / ~120 tok (typical) | — | — |
2. Test Methodology¶
2.1 Scenario Design¶
| Scenario | Repo / Target | Focus | Assertions |
|---|---|---|---|
| Eval 1: parser-fuzz | internal/parser/Parse — URL parser, pure function | Full flow for Tier 1 fuzzing target | 15 |
| Eval 2: fetch-reject | internal/github/fetcher.Fetch — network-dependent method | Correct rejection of unsuitable fuzzing target | 7 |
| Eval 3: converter-multi | internal/converter package — multiple candidate functions | Multi-target selection, priority evaluation, selective generation | 13 |
2.2 Execution¶
- Use
issue2mdproject as base; create independent copies per scenario (/tmp/fuzz-eval-*) - With-skill runs load SKILL.md and referenced materials first
- Without-skill runs load no skill; model uses default behavior
- All runs execute in parallel in independent subagents
2.3 Scenario Details¶
Eval 1 — parser.Parse (suitable target)
Parse(rawURL string) (ResourceRef, error) is a classic Tier 1 fuzz target: - Accepts string input (native Go fuzz type) - Pure function, no I/O, network, or state - Multiple verifiable invariants (non-empty Owner, Number > 0, Type ∈ valid set, canonical URL consistency, re-parse idempotency) - Fast execution (sub-microsecond)
Eval 2 — fetcher.Fetch (unsuitable target)
Fetch(ctx, ref, opts) (IssueData, error) is a classic unsuitable fuzz target: - All code paths perform real HTTP requests - Depends on GitHub API token auth - Includes retry + backoff logic - Interesting input space is API response, not method parameters
Eval 3 — converter package (multiple candidates)
5 candidate functions: 4 suitable, 1 unsuitable: - ✅ yamlQuote(string) string — YAML escaping, round-trip invariant - ✅ normalizeSummaryJSON(string) (string, error) — JSON extractor, json.Valid invariant - ✅ detectSummaryLanguage(string) string — Unicode analysis, finite return set invariant - ✅ capSummarySourceLength(string) string — rune truncation, length upper-bound invariant - ❌ Summarize(ctx, data, lang) — OpenAI HTTP call, network-dependent
3. Assertion Pass Rate¶
3.1 Overview¶
| Scenario | Assertions | With Skill | Without Skill | Delta |
|---|---|---|---|---|
| Eval 1: parser-fuzz | 15 | 15/15 (100%) | 8/15 (53.3%) | +46.7pp |
| Eval 2: fetch-reject | 7 | 7/7 (100%) | 0/7 (0%) | +100pp |
| Eval 3: converter-multi | 13 | 13/13 (100%) | 8/13 (61.5%) | +38.5pp |
| Total | 35 | 35/35 (100%) | 16/35 (45.7%) | +54.3pp |
3.2 Per-Scenario Assertion Details¶
Eval 1: parser-fuzz (15 assertions)¶
| # | Assertion | With Skill | Without Skill |
|---|---|---|---|
| A1.1 | Applicability gate before code execution | ✅ Full 5-item checklist | ❌ No formal gate, direct analysis |
| A1.2 | Correctly judged "suitable" | ✅ | ✅ (implicit) |
| A1.3 | 5-item checklist per-item Pass/Fail | ✅ Structured table | ❌ None |
| A1.4 | Fuzz mode identified as "parser robustness" | ✅ "Parser robustness + idempotency" | ❌ Not labeled |
| A1.5 | f.Add() ≥3 valid GitHub URLs | ✅ 5 | ✅ 4 |
| A1.6 | f.Add() includes malformed/boundary | ✅ 14 | ✅ 25 (more) |
| A1.7 | Size guard present | ✅ len > 2048 → t.Skip() | ❌ None |
| A1.8 | Oracle: Owner/Repo non-empty | ✅ | ✅ |
| A1.9 | Oracle: Number > 0 | ✅ | ✅ |
| A1.10 | Oracle: Type ∈ valid set | ✅ | ✅ |
| A1.11 | FuzzXxx naming | ✅ FuzzParse in fuzz_parse_test.go | ✅ FuzzParse in fuzz_test.go |
| A1.12 | Cost class assigned | ✅ "Low, 30-60s" | ❌ None |
| A1.13 | Quick commands provided | ✅ 3 commands | ❌ None |
| A1.14 | Output contract / structured report | ✅ Full Quality Scorecard | ❌ Narrative summary only |
| A1.15 | Corpus replay verification | ✅ 19 seeds green | ✅ 29 seeds passed |
Eval 2: fetch-reject (7 assertions)¶
| # | Assertion | With Skill | Without Skill |
|---|---|---|---|
| A2.1 | Applicability gate executed | ✅ 5-item structured table | ❌ No gate |
| A2.2 | Judged "unsuitable" | ✅ "Not suitable for fuzzing" | ❌ Did not reject; built workaround |
| A2.3 | Specific failing checks | ✅ Check 1/3/4/5 all Fail | ❌ No failure references |
| A2.4 | No fuzz code generated | ✅ "None" | ❌ Generated 112 lines |
| A2.5 | Alternative test strategies provided | ✅ 4 concrete strategies | ❌ No alternatives |
| A2.6 | Explanation specific (not generic) | ✅ References doWithRetry, f.rest, f.gql, etc. | ❌ No unsuitability explanation |
| A2.7 | Output contract | ✅ Full 5-section report | ❌ None |
Eval 3: converter-multi (13 assertions)¶
| # | Assertion | With Skill | Without Skill |
|---|---|---|---|
| A3.1 | Per-candidate gate evaluation | ✅ Per-function evaluation | ❌ Informal analysis table |
| A3.2 | Target priority evaluation | ✅ Priority ordering | ❌ No Tier ordering |
| A3.3 | Summarize rejected | ✅ | ✅ "Not suitable" |
| A3.4 | yamlQuote fuzz test generated | ✅ round-trip oracle | ✅ round-trip oracle |
| A3.5 | normalizeSummaryJSON generated | ✅ JSON validity oracle | ✅ JSON validity oracle |
| A3.6 | detectSummaryLanguage generated | ✅ valid set oracle | ✅ valid set oracle |
| A3.7 | capSummarySourceLength generated | ✅ rune count + truncation | ✅ rune count + truncation |
| A3.8 | Each harness has oracle | ✅ 4/4 with t.Fatalf | ✅ 4/4 with t.Fatalf |
| A3.9 | Each harness has seeds | ✅ ≥7 per target | ✅ ≥5 per target |
| A3.10 | Size guards coverage | ✅ 4/4 harnesses have guard | ❌ 0/4 have guard |
| A3.11 | Per-target cost class | ✅ | ❌ None |
| A3.12 | Output contract with per-target details | ✅ | ❌ No structured report |
| A3.13 | Corpus replay verification | ✅ 40 seeds pass | ✅ 38 seeds pass |
3.3 Classification of 19 Without-Skill Failed Assertions¶
| Failure type | Count | Evals | Notes |
|---|---|---|---|
| Missing Applicability Gate | 3 | Eval 1/2/3 | No formal 5-item checklist; direct coding or analysis |
| Unsuitable target not rejected | 4 | Eval 2 | Built HTTP stub workaround instead of reject + recommend alternatives |
| Missing Output Contract | 3 | Eval 1/2/3 | No structured report, Quality Scorecard |
| Missing Size Guard | 2 | Eval 1/3 | Eval 1 no len check; Eval 3 all four harnesses missing |
| Missing Cost Class | 2 | Eval 1/3 | No Low/Medium/High classification |
| Missing Quick Commands | 1 | Eval 1 | No go test -fuzz command reference |
| Missing Fuzz Mode label | 1 | Eval 1 | No "parser robustness" mode label |
| Missing Target Priority | 1 | Eval 3 | No Tier 1/2/3 priority ordering |
| Missing Checklist structure | 1 | Eval 1 | No per-item Pass/Fail marks |
| Missing alternative strategies | 1 | Eval 2 | Built solution directly instead of recommending better strategies |
3.4 Key Finding: Eval 2 +100pp Delta¶
This is the largest single-scenario delta among all evaluated skills. Analysis:
With-Skill behavior: - Runs 5-item Applicability Gate - Marks Check 1/3/4/5 as Fail (especially Check 3 — no oracle — triggers Hard Stop) - Produces "Not suitable" verdict - Recommends 4 alternative strategies, including "fuzz pure mapping functions in the package"
Without-Skill behavior: - No gate; directly analyzed how to make fuzz work - Creatively built fuzzRoundTripper (custom http.RoundTripper) to stub HTTP layer - Effectively fuzzed GraphQL JSON parsing path, not the Fetch method itself - Only oracle was "no panic"
Assessment: The baseline approach has practical value (can find panics in JSON parsing) but from fuzz testing best practices: 1. Oracle is only "no panic"; cannot find logic bugs (invariant violations) 2. Actually tests JSON parsing path, not the Fetch method under review 3. Does not tell the user "this is not optimal," missing the chance to steer them toward fuzzing pure functions
The skill's gate mechanism ensures honest engineering decisions: if unsuitable, do not proceed, and recommend better alternatives.
4. Dimension-by-Dimension Comparison¶
4.1 Applicability Gate¶
This is the skill's core differentiator, affecting all 3 scenarios.
| Scenario | With Skill | Without Skill |
|---|---|---|
| Eval 1 (suitable) | 5-item checklist all Pass, structured table | Informal analysis, no Pass/Fail marks |
| Eval 2 (unsuitable) | Check 1/3/4/5 Fail → Hard Stop | Not identified as unsuitable |
| Eval 3 (mixed) | Per-function gate; 4 of 5 Pass | Informal analysis table; Summarize correctly identified |
Practical value: - Applicability Gate prevents generating useless fuzz tests (Eval 2 saves cost of writing and maintaining low-value tests) - Structured checklist makes decisions auditable and reproducible - In Eval 3, enforces "evaluate first, then code" workflow
4.2 Systematic Size Guard Coverage¶
| Scenario | With Skill | Without Skill |
|---|---|---|
| Eval 1: FuzzParse | ✅ len > 2048 → t.Skip() | ❌ None |
| Eval 3: FuzzYamlQuote | ✅ len > 1<<16 → t.Skip() | ❌ None |
| Eval 3: FuzzNormalizeSummaryJSON | ✅ len > 1<<16 → t.Skip() | ❌ None |
| Eval 3: FuzzDetectSummaryLanguage | ✅ len > 1<<16 → t.Skip() | ❌ None |
| Eval 3: FuzzCapSummarySourceLength | ✅ len > 1<<20 → t.Skip() | ❌ None |
Analysis: The skill's "Size guard present" rule (in SKILL.md Templates A/B/C/D) ensures all string/[]byte harnesses have boundary protection. Without-skill had more seeds in Eval 1 (29 vs 19) but lacked size guards; long fuzz runs risk OOM.
4.3 Output Contract (Structured Report)¶
With-Skill runs produce structured reports including:
| Report item | Eval 1 | Eval 2 | Eval 3 |
|---|---|---|---|
| Applicability Verdict | ✅ Suitable | ✅ Not suitable | ✅ Per-function |
| Why (2–6 bullets) | ✅ 5 bullets | ✅ 4 bullets | ✅ Per-function |
| Action | ✅ Implemented | ✅ Stop | ✅ 4 targets implemented |
| Quality Scorecard (C/S/H) | ✅ All PASS | N/A | ✅ All PASS |
| Cost Class | ✅ Low | N/A | ✅ Per-target |
| Quick Commands | ✅ 3 commands | N/A | ✅ |
| Corpus Policy | ✅ | N/A | ✅ |
Without-Skill produces narrative summaries but no standardized structure.
4.4 Fuzz Code Quality Comparison¶
Using Eval 3 (best for code quality comparison), FuzzYamlQuote:
| Feature | With Skill | Without Skill |
|---|---|---|
| Seed count | 11 | 10 |
| Size guard | ✅ len > 1<<16 | ❌ None |
| Oracle: single-quote wrapping | ✅ | ✅ |
| Oracle: odd-quote detection | ✅ | ✅ |
| Oracle: round-trip | ✅ unescaped == value | ✅ unescaped == value |
| Large-input seed | None | strings.Repeat("a", 10000) |
Code quality is similar in oracle design; Claude's base model is already strong at fuzz code generation. The skill's main gains are process discipline (gate, cost class, size guard, output contract), not the code itself.
4.5 Alternative Strategy Recommendations¶
In Eval 2, With-Skill recommended 4 alternatives:
- Integration tests with real GitHub token (gated) — gated integration tests
- Unit tests with HTTP stubbing — httptest.Server stub tests
- Fuzz the pure mapping functions instead — e.g.
mapIssueTimelineNode - Table-driven unit tests for the dispatcher — table-driven unit tests
These recommendations both reject the unsuitable approach and steer users toward more valuable testing paths. Without-Skill built a workaround directly (valuable, but did not inform users of better options).
5. Token Cost-Effectiveness Analysis¶
5.1 Skill Size¶
| File | Lines | Words | Est. Tokens |
|---|---|---|---|
| SKILL.md | 679 | 3,062 | ~4,100 |
| references/applicability-checklist.md | 170 | 940 | ~1,250 |
| references/target-priority.md | 179 | 876 | ~1,170 |
| references/crash-handling.md | 76 | 312 | ~420 |
| references/ci-strategy.md | 118 | 463 | ~620 |
| Description (always in context) | — | ~50 | ~65 |
5.2 Load Scenarios¶
| Scenario | Files read | Total tokens |
|---|---|---|
| Suitable target (Eval 1) | SKILL.md + applicability + target-priority | ~6,520 |
| Unsuitable target (Eval 2) | SKILL.md + applicability | ~5,350 |
| Multi-target evaluation (Eval 3) | SKILL.md + applicability + target-priority | ~6,520 |
| SKILL.md only (min load) | SKILL.md | ~4,100 |
| Full load | All files | ~7,625 |
| Typical load | SKILL.md + applicability + target-priority | ~6,520 |
5.3 Token Cost for Quality Gain¶
| Metric | Value |
|---|---|
| With-skill pass rate | 100% (35/35) |
| Without-skill pass rate | 45.7% (16/35) |
| Pass-rate gain | +54.3 pp |
| Token cost per assertion fixed | ~216 tok (SKILL.md only) / ~343 tok (typical) |
| Token cost per 1% pass-rate gain | ~75 tok (SKILL.md only) / ~120 tok (typical) |
5.4 Token Segment Cost-Effectiveness¶
| Module | Est. tokens | Linked assertion delta | Cost-effectiveness |
|---|---|---|---|
| Applicability Gate rules | ~300 | 7 (3-scenario gate correctness) | Very high — 43 tok/assertion |
| Output Contract definition | ~200 | 3 (3-scenario report completeness) | Very high — 67 tok/assertion |
| Templates A–D | ~600 | 2 (size guard coverage) | High — 300 tok/assertion |
| Cost Class + Quick Commands | ~100 | 3 (classification + command refs) | Very high — 33 tok/assertion |
| Fuzz Mode classification | ~80 | 1 (mode label) | Very high — 80 tok/assertion |
| Target Priority rules | ~150 | 1 (Tier ordering) | High — 150 tok/assertion |
| Hard Stop rules | ~100 | 2 (unsuitable rejection + no code) | Very high — 50 tok/assertion |
| Quality Scorecard | ~200 | Indirect (structured self-check) | Medium |
| Anti-Examples | ~500 | Indirect (avoid common mistakes) | Medium |
| Coverage Feedback | ~400 | 0 (not tested) | Low |
| Go Version Gate | ~200 | 0 (not tested) | Low |
| Troubleshooting | ~350 | 0 (not tested) | Low |
| applicability-checklist.md | ~1,250 | Indirect (gate quality) | Medium |
| target-priority.md | ~1,170 | Indirect (priority quality) | Medium |
| crash-handling.md | ~420 | 0 (no crash scenario) | Low |
| ci-strategy.md | ~620 | 0 (CI integration not tested) | Low |
5.5 High-Leverage vs Low-Leverage Instructions¶
High leverage (~930 tokens SKILL.md → 19 assertion delta, 23% of SKILL.md): - Applicability Gate + Hard Stop rules (400 tok → 9 assertions) - Output Contract definition (200 tok → 3 assertions) - Cost Class + Quick Commands (100 tok → 3 assertions) - Size guard examples in Templates (150 tok → 2 assertions) - Fuzz Mode + Target Priority (80+150 tok → 2 assertions)
Medium leverage (~700 tokens → indirect): - Quality Scorecard (200 tok) — drives self-check flow - Anti-Examples (500 tok) — avoid common mistakes
Low leverage (~950 tokens → 0 assertion delta): - Coverage Feedback (~400 tok) — not used in eval scenarios - Go Version Gate (~200 tok) — not used in eval scenarios - Troubleshooting (~350 tok) — not used in eval scenarios
References (~3,460 tokens → indirect): - applicability-checklist.md (1,250 tok) — improves gate quality, concrete examples - target-priority.md (1,170 tok) — Tier ordering basis - crash-handling.md + ci-strategy.md (1,040 tok) — no direct contribution in eval
5.6 Token Efficiency Rating¶
| Rating | Conclusion |
|---|---|
| Overall ROI | Excellent — ~6,520 tokens (typical) for +54.3% pass rate |
| SKILL.md ROI | Good — ~4,100 tokens; high-leverage rules only 23% |
| High-leverage token share | 23% (930/4,100) directly contributes 19/19 assertion delta |
| Low-leverage token share | 23% (950/4,100) no incremental contribution in this eval |
| Reference cost-effectiveness | Medium — ~2,420 tokens (applicability + target-priority) indirect contribution |
| Unused references | ~1,040 tokens (crash-handling + ci-strategy) no contribution |
5.7 Cost-Effectiveness vs Other Skills¶
| Metric | fuzzing-test | go-makefile-writer | create-pr | go-ci-workflow |
|---|---|---|---|---|
| SKILL.md tokens | ~4,100 | ~1,960 | ~2,700 | ~1,500 |
| Typical load tokens | ~6,520 | ~4,100 | ~4,800 | ~4,500 |
| Pass-rate gain | +54.3% | +31.0% | +71.0% | +33.0% |
| Tokens per 1% (SKILL.md) | ~75 tok | ~63 tok | ~38 tok | ~45 tok |
| Tokens per 1% (typical) | ~120 tok | ~132 tok | ~68 tok | ~136 tok |
Analysis: - fuzzing-test has the largest delta (+54.3%), mainly from Eval 2's +100pp extreme delta - SKILL.md cost-effectiveness (~75 tok/1%) is mid-range: higher than create-pr (38) and go-ci-workflow (45), lower than go-makefile-writer (63) - Typical-load cost-effectiveness (~120 tok/1%) is better than go-makefile-writer and go-ci-workflow, worse than create-pr - SKILL.md size (679 lines / ~4,100 tokens) is the largest among evaluated skills, but its delta is also the largest
6. Boundary Analysis vs Claude Base Model¶
6.1 Base Model Capabilities (No Skill Increment)¶
| Capability | Evidence |
|---|---|
| Go fuzz test basics | 3/3 scenarios use testing.F correctly |
| f.Add() seed corpus | 3/3 scenarios provide good seeds |
| Oracle design (no-panic, round-trip, valid set) | Eval 1/3 oracle quality close to with-skill |
| Multi-candidate recognition (partial) | Eval 3 correctly identifies Summarize as unsuitable |
File naming *_test.go | 3/3 scenarios correct |
| Corpus replay verification | 3/3 scenarios run verification |
6.2 Base Model Gaps (Skill Fills)¶
| Gap | Evidence | Risk level |
|---|---|---|
| Rejecting unsuitable targets | Eval 2: built workaround instead of reject | High — would maintain low-value fuzz tests in prod |
| Systematic Size Guard | 5/5 harnesses missing size guard | High — OOM risk in long fuzz runs |
| Applicability Gate flow | 3/3 scenarios no formal gate | Medium — no decision audit |
| Output Contract | 3/3 scenarios no structured report | Medium — no change traceability |
| Cost Class assignment | 2/3 scenarios no classification | Medium — CI budget cannot be allocated |
| Quick Commands | 1/3 scenarios no command reference | Low — user must look up docs |
| Fuzz Mode label | 1/3 scenarios not labeled | Low — affects readability |
| Target Priority | 1/3 scenarios no Tier ordering | Low — no priority guidance for multi-target |
7. Overall Score¶
7.1 Dimension Scores¶
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Applicability Gate correctness | 5.0/5 | 1.5/5 | +3.5 |
| Rejection of unsuitable targets | 5.0/5 | 0.0/5 | +5.0 |
| Fuzz code quality (oracle, seed, guard) | 5.0/5 | 3.5/5 | +1.5 |
| Structured report (Output Contract) | 5.0/5 | 0.5/5 | +4.5 |
| Alternative strategy recommendations | 5.0/5 | 1.0/5 | +4.0 |
| Process discipline (cost class, mode, commands) | 5.0/5 | 1.5/5 | +3.5 |
| Overall mean | 5.0/5 | 1.33/5 | +3.67 |
7.2 Weighted Total Score¶
| Dimension | Weight | Score | Rationale | Weighted |
|---|---|---|---|---|
| Assertion pass-rate delta | 25% | 10.0/10 | +54.3pp is highest delta among evaluated skills | 2.50 |
| Applicability Gate correctness | 20% | 10.0/10 | 3/3 scenarios gate correct; Eval 2 shows Hard Stop value | 2.00 |
| Rejection + alternative recommendations | 15% | 10.0/10 | +100pp single-scenario delta; 4 concrete alternatives | 1.50 |
| Structured report (Output Contract) | 15% | 10.0/10 | 3/3 scenarios full contract; Quality Scorecard | 1.50 |
| Token cost-effectiveness | 15% | 6.0/10 | SKILL.md ~4,100 tok large; ~950 tok low-leverage; ~1,040 tok refs unused | 0.90 |
| Fuzz code quality | 10% | 8.0/10 | Code quality similar to baseline; main gain is size guard | 0.80 |
| Weighted total | 100% | 9.20/10 |
7.3 Comparison with Other Skills¶
| Skill | Weighted total | Pass-rate delta | Tokens/1% (typical) | Strongest dimension |
|---|---|---|---|---|
| create-pr | 9.55/10 | +71pp | ~68 | Gate flow (+3.5), Output Contract (+4.0) |
| fuzzing-test | 9.20/10 | +54.3pp | ~120 | Rejection (+5.0), Output Contract (+4.5) |
| go-makefile-writer | 9.16/10 | +31pp | ~132 | CI reproducibility (+3.0), Output Contract (+4.0) |
| go-ci-workflow | 8.83/10 | +33pp | ~136 | Degradation handling (+4.5), Output Contract (+4.0) |
Analysis: - fuzzing-test rejection (+5.0 delta) is the largest single-dimension delta among evaluated skills - +54.3pp delta is also the highest, proving Applicability Gate value - Token cost-effectiveness score (6.0/10) is lower due to SKILL.md size (679 lines) and ~950 tokens low-leverage content
8. Conclusion¶
The fuzzing-test skill adds clear value in three areas:
-
Applicability Gate rejection (+100pp single-scenario delta): The largest single-scenario delta among evaluated skills, showing that "when not to fuzz" is a major gap for Claude. The baseline builds workarounds for unsuitable targets (not without value) but does not inform users of better strategies.
-
Systematic Size Guard coverage (5/5 vs 0/5): The skill's templates and rules ensure all
string/[]byteharnesses have length bounds, preventing OOM in long fuzz runs. A common omission with large production impact. -
Structured Output Contract: Quality Scorecard (Critical/Standard/Hygiene) makes fuzz test quality measurable and auditable.
Main risk: SKILL.md size (~4,100 tokens) is the largest among evaluated skills; ~23% (~950 tokens) is low-leverage. Trimming Coverage Feedback, Troubleshooting, Anti-Examples, and Go Version Gate could reduce SKILL.md ~29% and improve typical-load cost-effectiveness from ~120 tok/1% to ~76 tok/1%.
9. Evaluation Materials¶
| Material | Path |
|---|---|
| Eval 1 with-skill output | /tmp/fuzz-eval-1/internal/parser/fuzz_parse_test.go |
| Eval 1 without-skill output | /tmp/fuzz-eval-b1/internal/parser/fuzz_test.go |
| Eval 2 with-skill output | (no file — gate rejected, no code generated) |
| Eval 2 without-skill output | /tmp/fuzz-eval-b2/internal/github/fetcher_fuzz_test.go |
| Eval 3 with-skill output | /tmp/fuzz-eval-3/internal/converter/{frontmatter,summary_openai}_fuzz_test.go |
| Eval 3 without-skill output | /tmp/fuzz-eval-b3/internal/converter/{fuzz_frontmatter,fuzz_summary_openai}_test.go |
| Evaluated skill | /Users/john/.codex/skills/fuzzing-test/SKILL.md |