fuzzing-test Skill Design Rationale¶

fuzzing-test is a high-signal decision and generation framework for Go fuzz testing. Its core idea is: fuzzing starts with deciding whether the target is worth fuzzing, whether native Go fuzzing can drive it effectively, whether a strong oracle exists, and whether the cost is justified. That is why the skill turns the Applicability Gate, Target Priority, Risk and Cost, Execution Integrity, Crash Handling, and structured reporting into one explicit engineering path.

1. Definition¶

fuzzing-test is used for Go-code scenarios such as:

generating fuzz tests for parsers, decoders, and protocol handlers,
designing round-trip or property-style oracles for codecs, validators, and state transitions,
prioritizing among multiple candidate fuzz targets,
explicitly rejecting targets that are not worth fuzzing and recommending alternatives,
preserving crash corpus, adding regressions, and documenting root cause after failures,
shaping local and CI fuzz-cost strategies.

Its output is not only fuzz code. It also includes:

an Applicability Verdict,
concrete reasons for the verdict,
the resulting Action (implement, stop, or redirect),

When a target is suitable and the workflow proceeds into implementation, high-quality deliverables will often also include:

a cost class and suggested fuzz-time budget,
execution status and commands,
quality scoring and corpus policy.

From a design perspective, it is much closer to a fuzz engineering decision framework than to a prompt that merely emits testing.F boilerplate.

2. Background and Problems¶

The core problem this skill addresses is not "people do not know the syntax of f.Fuzz(...)." It is that real Go fuzzing work tends to fail in two opposite ways:

clearly unsuitable targets still get fuzz harnesses,
clearly suitable targets get only a minimal runnable harness with weak bug-finding yield.

Without a clear framework, failures usually cluster into eight categories:

Problem	Typical consequence
Suitability is never checked first	Network-, DB-, or global-state paths get low-value fuzz harnesses
Oracle is vague or missing	The result collapses into "must not panic" as the only assertion
Target priority is never ranked	Teams spend budget on low-yield functions while missing parser/decoder entry points
No size guard exists	Long fuzz runs risk OOM or pathological allocation paths
Seeds are invented casually	Initial corpus has shallow structure, leading to high skip rate or weak coverage
Fuzz cost is never classified	Local, PR, and nightly budgets become uncontrolled
Crash handling is incomplete	Bugs get fixed without preserving corpus or adding deterministic regressions
Output is unstructured	Teams cannot tell why a target was accepted or rejected, or what to run next

The design logic of fuzzing-test is to make "should this target be fuzzed?" explicit before "how should the fuzz harness be written?"

3. Comparison with Common Alternatives¶

Before looking at the internals, it helps to compare the skill with a few common alternatives:

Dimension	`fuzzing-test` skill	Asking a model to "write a fuzz test"	Manually adding scattered `FuzzXxx` harnesses
Suitability judgment	Strong	Weak	Weak
Rejection of unsuitable targets	Strong	Weak	Weak
Target prioritization	Strong	Weak	Weak
Oracle discipline	Strong	Medium	Medium
Size-guard discipline	Strong	Weak	Weak
Cost classification	Strong	Weak	Weak
Crash closure	Strong	Medium	Weak
Auditability of output	Strong	Weak	Weak

Its value is not only that it can generate fuzz code. Its value is that it turns fuzzing from a one-off code patch into a test workflow that can be justified, audited, and governed.

4. Core Design Rationale¶

4.1 The Applicability Gate Must Come First¶

The first principle of fuzzing-test is simple: before any fuzz code is written, the Applicability Gate must run.

It requires a target to be judged on five conditions:

whether it has meaningful input space,
whether it can be driven by Go-supported fuzz parameter types,
whether it has a clear oracle or invariant,
whether it is deterministic and local enough,
whether it is fast enough for high-iteration fuzzing.

This is critical because the biggest source of waste in fuzzing is not ugly code. It is choosing the wrong target from the start. The evaluation report's Eval 2 delta makes this especially visible: with-skill rejects a network-dependent target outright, while without-skill creatively constructs a workaround that may run but no longer reflects the best bug-finding path.

4.2 Check 2 and Check 3 Are Hard Stops¶

The most important Applicability checks are:

Check 2: whether the target can be driven effectively by native Go fuzz parameter types,
Check 3: whether a clear oracle exists.

The skill treats both as hard stops. If either fails, the workflow stops. That is a very strong and very intentional design choice:

without fuzz-compatible types, the native harness cannot explore the real input space,
without an oracle, even millions of iterations cannot reliably detect logic bugs.

This is also the root cause of many "looks like fuzzing, but is really just no-panic smoke testing" outcomes. fuzzing-test explicitly refuses to normalize that pattern.

4.3 It First Decides Whether Rejection Is Appropriate vs. Defaulting to a Workaround¶

One of the most distinctive parts of this skill is not its generation ability, but its refusal ability.

When a target is not well suited to fuzzing, many models tend to:

stub an HTTP or DB layer,
reshape inputs until a harness becomes technically possible,
reduce the target to a weakened no-panic path.

Those approaches are not completely useless, but they can hide a more important engineering fact: the requested target is not necessarily the most valuable thing to fuzz.

The current contract is more specific than a blanket refusal posture: when Check 2 or Check 3 fails, the skill hard-stops; when a target is dominated by external dependencies, it first surfaces the risk and usually redirects toward purer targets, while still leaving room to proceed if the external layer can be fully stubbed and a strong oracle still exists. That is why fuzzing-test first clarifies the applicability facts and then recommends better alternatives, such as:

fuzz a pure parser or mapping function instead,
use table-driven unit tests for trivial logic,
use integration tests for real DB / network paths,
use property-based testing where complex generators are the real need.

That makes it more like a testing decision assistant than a code generator that feels obliged to emit code every time.

4.4 The Target Priority Gate Is a Separate Layer¶

When a package contains multiple possible fuzz targets, fuzzing-test does not assume they should all be fuzzed. It requires ranking them by expected bug-finding yield:

Tier 1: parsers / decoders / protocol handlers / compression / encoding,
Tier 2: round-trip encode / decode paths, validators / sanitizers, and state transitions with strict invariants,
Tier 3: differential comparison candidates, formatters / renderers, and ordinary configuration loaders.

This is an engineering-focused decision because fuzz budgets are always limited. Priority solves not "which function is more interesting?" but "which target is most likely to find real bugs within bounded time?"

That is also why references/target-priority.md exists as a separate asset. It is not extra commentary; it is a resource-allocation rule.

4.5 Size Guard Is Treated as a Critical Quality Requirement¶

The current Quality Scorecard puts size guards in the Critical section, requiring every []byte / string harness to bound input size.

This is highly practical. Many baseline fuzz harnesses look fine in terms of seeds and oracles, but they omit:

len(data) > N,
t.Skip() for impossible combinations,
upper bounds for expensive payloads.

Short local runs may still pass, but once the harness enters longer fuzz windows, scheduled CI, or multi-worker execution, OOM and allocation blow-up risk rises quickly. The evaluation report makes this visible as well: with-skill covered this systematically, while without-skill often omitted it.

4.6 Seed Mining Is Designed as "Find Real Inputs First, Then Write `f.Add(...)`"¶

fuzzing-test explicitly requires seed mining before writing f.Add(...) calls:

mine real inputs from existing unit tests,
scan testdata/, fixtures, samples, and golden files,
extract payloads from production-like config or data files when relevant.

This matters because seeds are not just decoration. Their real value is that they:

help the mutator enter meaningful structural space quickly,
lower skip rate,
increase early coverage depth.

The skill even requires each target's seeds to cover at least three structural categories. That shows it cares not only whether seeds exist, but whether they are exploration-worthy.

4.7 The Risk and Cost Gate Makes Fuzz Budgets Explicit¶

fuzzing-test classifies targets as Low, Medium, or High cost, and ties each class to different time-budget guidance:

Low: local fuzz for 30-60 seconds,
Medium: local fuzz for 15-45 seconds with stricter guards,
High: corpus replay in PRs, real fuzzing moved to nightly or scheduled lanes.

This is a mature design because fuzzing is not simply "the longer, the better." It depends on:

per-call execution cost,
skip rate,
memory profile,
what CI can actually afford.

Making cost class part of the output helps teams treat fuzzing as a planned testing asset rather than a personal experiment.

4.8 It Draws a Boundary Between Native Fuzzing and Property-Based Testing¶

fuzzing-test explicitly distinguishes between:

good native fuzzing targets: byte/string-like inputs, crash discovery, parsers,
good property-based testing targets: complex generators, rich domain constraints, or targets that would spend most fuzz iterations in t.Skip().

That boundary is important because many Go testing questions are not really "should we do generative testing?" but "should we use fuzzing or property-based testing here?"

The skill does not argue that one technique is universally better. It chooses based on target structure. That makes it more mature than a prompt that only knows how to emit fuzz harnesses.

4.9 Crash Handling Requires Corpus Preservation and Deterministic Regression¶

When fuzzing finds a crash, fuzzing-test requires a full closure loop:

record the minimal reproduction command,
keep the crashing corpus,
classify the failure type,
fix with minimal code change,
replay corpus and rerun a short fuzz burst,
document root cause and the prevention guard.

This is highly valuable because the biggest benefit of fuzzing is rarely "it ran for a long time." It is "the discovered bug became a durable regression asset." If the crashing input never lands in testdata/fuzz/FuzzXxx/, the same defect can easily reappear later.

4.10 References Are Loaded Selectively vs. Expanded Every Time¶

The reference structure in fuzzing-test is intentionally restrained:

applicability-checklist.md is loaded only for borderline suitability decisions,
target-priority.md is loaded only when three or more candidates need prioritization,
crash-handling.md is loaded only after an actual crash,
ci-strategy.md is loaded only when CI integration is requested,
advanced-tuning.md is loaded only for tuning, OOM, leak, or flaky-run work.

This is a strong design choice because fuzz tasks differ widely. Most ordinary generation tasks need only the gate rules and templates. Heavier contexts such as crash closure, CI integration, or performance tuning should not be paid for on every invocation.

This is a classic production-grade pattern: keep the default context compact, and load heavy engineering detail only when it becomes relevant.

4.11 Go Version, Race, Parallelism, and Struct-Aware Input Are Part of the Same Framework¶

The current skill goes beyond basic testing.F generation and explicitly includes:

a Go Version Gate,
race detection alongside fuzzing,
worker-parallelism guidance,
a go-fuzz-headers bridge,
fuzz performance baseline measurement.

That means the design goal is no longer merely "generate a harness." It is to address the real engineering questions that appear throughout a fuzzing lifecycle:

does the Go version support or favor a certain approach,
should a concurrent path also be replayed under -race,
will worker count hide determinism issues,
should complex structured input be derived from []byte,
is the current execs/sec baseline high enough to justify longer fuzz windows.

That makes the skill much closer to a fuzz engineering handbook than to a single-pass code template.

5. Problems This Design Solves¶

Combining the current SKILL.md, references, and evaluation report, the skill addresses the following engineering problems:

Problem type	Corresponding design	Practical effect
Unsuitable targets still get fuzzed	Applicability Gate + Hard Stop	Rejects and stops the workflow explicitly
Fuzz effort is aimed at the wrong targets	Target Priority Gate	Pushes budget toward high-yield targets first
Only weak no-panic assertions exist	Oracle / invariant requirement	Forces a testable property, not just execution
Long fuzz runs risk OOM	Size guard + harness guards	Bounds expensive or pathological inputs
Seeds are weak and skip rate is high	Seed mining strategy	Uses real inputs to improve exploration
Local and CI budgets are uncontrolled	Risk and Cost Gate	Makes fuzz-time allocation explicit
Crash fixes do not persist	Crash Handling + corpus policy	Preserves replayable regression inputs
Reports are not auditable	Output Contract + structured self-check practice	Makes verdict, rationale, and action explicit; when implementation proceeds, it also makes execution details and quality checks easier to surface consistently
It is unclear when to choose another test method	Fuzz vs property-based boundary + alternative strategies	Helps users pick the right test approach

6. Key Highlights¶

6.1 "Decide Whether It Should Be Fuzzed First" Is the Core of the Skill¶

This is the largest source of differentiation. The most important gain in the evaluation is not more elaborate fuzz code. It is knowing when to stop.

6.2 Its Rejection of Unsuitable Targets Is Deliberately Strong¶

Many models still try to produce code for bad targets. fuzzing-test protects engineering decision quality before protecting output volume.

6.3 It Systematizes Size Guards, Cost Class, and Quick Commands¶

Each of these rules is simple on its own, but together they greatly improve the long-term operability of fuzzing within teams and CI.

6.4 It Takes Real Seeds and Real Structure Seriously¶

The skill does not treat f.Add(...) as decoration. It treats seed quality as a core determinant of exploration quality.

6.5 Its Crash Handling Has a Full Closure Loop¶

From preserving corpus to replaying regressions and documenting prevention guards, it ensures that fuzz findings do not remain one-off discoveries.

6.6 The Current Version Goes Further Than the Evaluation Snapshot in Runtime Engineering¶

The evaluation most strongly validates the skill's value in Applicability Gate discipline, rejection ability, size guards, and structured output. At the same time, the current SKILL.md extends that foundation with Go Version Gate, -race, worker parallelism, a go-fuzz-headers bridge, and performance baseline guidance. In other words, the evaluation validates the core direction, while the current skill broadens it into a more complete fuzz engineering practice.

7. When to Use It — and When Not To¶

Scenario	Suitable	Reason
Parser / decoder / protocol handler	Yes	These are the highest-yield fuzzing targets
Round-trip codec	Yes	Strong invariants make fuzzing valuable
Validator / sanitizer	Yes	No-panic plus domain constraints are clear
Package-level screening across multiple candidates	Yes	Target prioritization and rejection are highly valuable
Network / DB / heavily external paths	Usually no	Prefer fuzzing the pure logic layer or using integration tests
Trivial wrappers	No	Unit tests are more appropriate
Functions without a clear oracle	No	Even many iterations will not reliably find logic bugs
Complex structured domains with very high skip rate	Not always for native fuzzing	Property-based testing may be the better fit

8. Conclusion¶

The real strength of fuzzing-test is not that it can write a FuzzXxx template. It is that it systematizes the engineering judgments most often ignored in fuzz testing: run the Applicability Gate first, use priority and cost class to decide what to fuzz and how deeply, and then use size guards, real seeds, strong oracles, and crash-closure rules to make the resulting harnesses both bug-finding and maintainable.

From a design perspective, the skill embodies a clear principle: high-quality fuzzing depends less on "running more" than on choosing the right target, defining the right oracle, controlling cost, and turning discoveries into replayable assets. That is why it is especially well suited to parsers, codecs, validators, and multi-target screening work.

9. Document Maintenance¶

This document should be updated when:

the Applicability Gate, Risk and Cost Gate, Output Contract, Quality Scorecard, or Guardrails in skills/fuzzing-test/SKILL.md change,
key rules in skills/fuzzing-test/references/applicability-checklist.md, target-priority.md, crash-handling.md, ci-strategy.md, or advanced-tuning.md change,
the skill's recommendations around Go version, -race, parallelism, or corpus policy change materially,
key supporting conclusions in evaluate/fuzzing-test-skill-eval-report.md change,
the gap between the evaluation snapshot and the current implementation grows further.

Review quarterly; review immediately if the gates, guardrails, or crash/CI strategy of fuzzing-test changes substantially.

10. Further Reading¶

skills/fuzzing-test/SKILL.md
skills/fuzzing-test/references/applicability-checklist.md
skills/fuzzing-test/references/target-priority.md
skills/fuzzing-test/references/crash-handling.md
skills/fuzzing-test/references/ci-strategy.md
skills/fuzzing-test/references/advanced-tuning.md
evaluate/fuzzing-test-skill-eval-report.md