security-review Skill Design Rationale¶

security-review is an exploitability-first security review framework. Its core idea is: the goal of security review is to first determine how deep the review should go, which security domains are actually applicable, which risks have a real exploit path, which suspicious points should be suppressed, and which areas remain uncovered, and only then deliver findings with confidence labels, standards mapping, baseline status, and explicit coverage gaps. That is why the skill turns Review Depth, Evidence Confidence, False-Positive Suppression, Applicability-First Execution, Gates A-F, Scenario Checklists, Automation Evidence, and Output Contract into one fixed process.

1. Definition¶

security-review is used for:

running exploitability-first security review on code changes,
covering auth, input, secrets, API, data flow, dependencies, resource lifecycle, concurrency, and container risk,
classifying Lite / Standard / Deep review depth from change scope and trigger signals,
expressing findings with confidence, CWE/OWASP mapping, and baseline status,
suppressing false positives while explicitly recording uncovered risks,
and enforcing general mandatory gates such as Gate A together with Go-specific secure-coding coverage such as Gate D when relevant.

Its output is not only findings. Depending on review depth, it may also include:

review depth and rationale,
Go 10-Domain Coverage,
automation evidence,
open questions / assumptions,
risk acceptance register,
remediation plan,
machine-readable JSON,
hardening suggestions,
uncovered risk list.

From a design perspective, it is closer to a security-review governance framework than to a generic prompt for commenting on code safety.

2. Background and Problems¶

The main problem this skill addresses is not that models cannot spot security issues. It is that security review tends to distort in a few dangerous ways:

it finds issues but does not separate exploitable vulnerabilities from theoretical concerns,
it reports security problems without confidence labels or standards mapping,
it produces reports that look complete without declaring what was never covered.

Without an explicit process, the most common failures cluster into eight categories:

Problem	Typical consequence
Review depth is not selected first	simple changes get over-reviewed, while complex changes may still be under-reviewed
Applicability is not triaged first	every domain gets reviewed mechanically, at high cost and with many empty N/A outputs
`confirmed` / `likely` / `suspected` are not separated	severity and evidence strength get mixed together
No false-positive suppression exists	path traversal, CSRF, randomness, and similar areas get over-reported
Resource lifecycle is not checked	response bodies, transactions, connections, and goroutines leak without being reviewed as security risk
Uncovered risk is never declared	the report implies false completeness
No standards mapping exists	findings do not integrate cleanly into audit and governance workflows
No baseline comparison exists	new issues, regressions, and legacy issues get blended together

The design logic of security-review is to make "how deep should this review go, which domains are relevant, and which paths are actually exploitable?" explicit before deciding how findings should be written and what the report is allowed to claim.

3. Comparison with Common Alternatives¶

It helps to compare the skill with a few common alternatives:

Dimension	`security-review` skill	Asking a model to "do a security review"	Manual experience-driven review
Review-depth routing	Strong	Weak	Medium
False-positive suppression discipline	Strong	Weak	Medium
Applicability-first execution	Strong	Weak	Weak
Confidence and standards mapping	Strong	Weak	Medium
Resource lifecycle review	Strong	Medium	Medium
Uncovered-risk declaration	Strong	Weak	Weak
Machine-consumable output	Strong	Weak	Weak
Baseline comparison support	Strong	Weak	Medium

Its value is not only that the report looks more audit-ready. Its value is that it turns security review from one-off issue spotting into an engineering review process with boundaries, gates, and evidence levels.

4. Core Design Rationale¶

4.1 Review Depth Selection Comes First¶

The first step in security-review is not vulnerability hunting. It is selecting:

Lite,
Standard,
or Deep,

based on file count and trigger signals.

This is the structural axis of the skill because one of the most common review failures is not total neglect, but applying the same depth to every change. security-review explicitly says:

small changes with no security-sensitive paths can use Lite,
auth, crypto, payment, new endpoints, dependency changes, and infra changes force Standard or Deep,
large changes, new services, new external integrations, or auth redesign push the review into Deep.

The evaluation showed this as one of the clearest skill-only outputs: without-skill could still find important issues, but it never explained why a given review should be Lite or Standard, and therefore never made review cost or coverage boundaries explicit.

4.2 Lite / Standard / Deep Are Cost-Control Mechanisms, Not Just Labels¶

The skill does not treat review depth as a cosmetic label. Each depth changes the process:

Lite follows only a subset of the gates,
Standard runs the full 15-step process,
Deep runs the full 15-step process plus extended call-graph tracing.

This matters because the cost of security review is not uniform. Lite does not mean "no review"; it means a smaller required subset of the process for genuinely low-risk changes, with Gate B/C/E skipped by scope policy and Fast Pass available when all conditions are met. Deep requires longer-path tracing beyond the immediate diff. In other words, the skill turns review intensity from an implicit judgment into an explicit control surface.

4.3 Applicability-First Execution Is Necessary¶

The skill forces a two-phase execution model:

Phase 1: classify each Go domain as Applicable or N/A,
Phase 2: run deep review and domain-specific tooling only for Applicable domains.

This is a critical design choice because security review is easy to drown in exhaustive checklist behavior. Applicability-first execution lets the skill decide which domains are genuinely relevant before paying the cost of deeper review and domain-specific tooling. It does not promise total coverage and then fill large tables with empty N/A; it first proves why a domain deserves attention.

4.4 False-Positive Suppression Must Be a First-Class Rule¶

Before publishing a finding, security-review requires four suppression checks:

an upstream guard already blocks the path,
the input is not attacker-controlled,
the sink is safely handled by framework guarantees,
the issue is only theoretical environmental risk without reachable path.

This is extremely important because the fastest way to erode team trust in security review is not to miss a low-priority hardening issue; it is to over-report non-findings as serious vulnerabilities. In the evaluation, without-skill reported /convert as CSRF and openAPISpecPath as path traversal, while with-skill suppressed or reclassified them to the correct root cause. That shows one of the skill's biggest increments is not only "finding issues," but "not misclassifying issues."

4.5 Evidence Confidence Is Mandatory¶

Every finding must carry one confidence label:

confirmed,
likely,
suspected.

This is not just formatting. It is evidence discipline. Many security reviews are not completely wrong, but they still blur "this looks bad" and "this is proven exploitable." security-review requires:

stronger evidence for high-severity claims,
confirmed to be supported by code and/or reproducible path evidence,
likely to name the one missing runtime assumption,
suspected to say clearly that the evidence is still weak.

In the evaluation, without-skill had no confidence labels in any scenario, while with-skill had them in all three. That makes confidence labeling one of the skill's clearest process-level increments.

4.6 Gate A Separately Audits Constructor-Release Pairing¶

Gate A requires pairing analysis for every acquisition or constructor in changed code and immediately related call paths, such as:

New*,
Open*,
Acquire*,
Begin*,
Dial*,
Listen*,
Create*,
WithCancel/WithTimeout/WithDeadline,

and verifying matching cleanup such as:

Close,
Release,
Rollback/Commit,
Stop,
Cancel,
or explicit ownership transfer documented in code.

This is a strong design choice because many security-relevant failures are not classic "user-input vulnerabilities." They are lifecycle defects that create availability or consistency risk. By making resource pairing a mandatory gate rather than an optional quality concern, the skill treats leaks, transaction-boundary defects, and unbounded goroutine lifetime as first-class security review issues.

4.7 Gate D's 10-Domain Coverage Is the Structural Core¶

For Go repositories, the skill always routes through 10 domains:

randomness safety,
injection + SQL lifecycle,
sensitive data handling,
secret/config management,
TLS safety,
crypto primitives,
concurrency safety,
Go-specific injection sinks,
static scanner posture,
dependency posture.

This is the point where the skill most clearly becomes a framework rather than a prompt. It does not assume a reviewer will naturally remember these domains on every change. Instead, it hard-codes them into the structure and then uses Applicable/N/A to control cost. The evaluation also makes this explicit: without-skill was not weak at finding core vulnerabilities, but it had no systematic domain-coverage structure at all, and Gate D coverage itself was 0/3 without the skill.

4.8 Gate E's Second-Pass Falsification Matters¶

After the first-pass findings, the skill forces a second pass that asks:

what might have been missed because the first pass focused too heavily on one exploit class,
whether availability, consistency, lifecycle, or partial-failure paths were under-reviewed,
whether transaction, rollback, cleanup, or idempotency-race issues were missed.

This is mature design because review bias often comes from over-fixating on the first class of issue that appears. Gate E forces the reviewer to actively challenge the first-pass conclusion rather than simply polishing it.

4.9 `Uncovered Risk List` as Gate F's Required Output¶

security-review explicitly requires an uncovered-risk list whether or not findings exist.

Each item must explain:

what area was not covered,
why it was not covered,
what the impact would be if a defect were hiding there,
and what follow-up action and owner suggestion make sense.

This is one of the most governance-relevant parts of the skill. Many security reviews become dangerous not because they contain too few findings, but because they imply "everything important was checked." Gate F directly resists false completeness. In the evaluation, without-skill omitted Gate F in all three scenarios, while with-skill included it every time.

4.10 Findings Must Be Standards-Mapped Output Artifacts¶

Each finding should include standards mapping when applicable:

CWE-xxx,
OWASP ASVS <section>.

The value here is that security-review output becomes useful not only to the immediate engineer, but also to audit, governance, and cross-team tracking. Without standards mapping, a review reads more like an opinionated memo. With mapping, it can enter compliance logs, remediation trackers, and risk registers. This was also one of the clearest skill-only differences in the evaluation.

4.11 The Focused Automation Gate Uses "Run What Matters, State What Was Skipped"¶

The skill's stance on automation is not "run every tool all the time." It is:

always run the baseline secret-pattern sweep,
run gosec, govulncheck, and go test -race according to applicable domains and cost,
if a tool is skipped, say exactly why.

This is practical because security automation is valuable, but tool availability, build health, and testability are not always present. The skill therefore refuses both extremes:

pretending tools were run when they were not,
and requiring every tool on every repository regardless of applicability.

It turns automation into evidence discipline instead of process theater.

4.12 Language Extension Hooks Matter¶

Although security-review is deepest on Go, it does not bind its core method to Go alone. The skill explicitly includes extension hooks for:

Node.js / TypeScript,
Java / Spring,
Python / FastAPI / Django.

This shows that the stable core of the skill is not Go syntax knowledge itself, but:

exploitability-first review,
depth routing,
suppression discipline,
uncovered-risk declaration,
and structured output.

Go is simply the most fully developed reference path today. That separation between review-governance logic and language-specific checklists is what gives the skill long-term extensibility.

4.13 Baseline Diff Mode Is Preserved¶

When previous review artifacts exist, the skill classifies changes as:

new,
regressed,
unchanged,
resolved.

This was not heavily exercised in the current evaluation, but it is still important because security review is rarely a one-time event. Without baseline diffing, teams cannot distinguish:

what this change introduced,
what older issues got worse,
what has actually been fixed.

So Baseline Diff Mode gives the skill continuity across repeated reviews instead of treating every security report as an isolated artifact.

5. Problems This Design Solves¶

Combining the current SKILL.md, key references, and the evaluation report, the skill solves the following problems:

Problem type	Corresponding design	Practical effect
Review depth is uncontrolled	Review Depth Selection	Small changes are not over-reviewed; risky ones are not under-reviewed
N/A coverage is noisy and expensive	Applicability-First Execution	Triage happens before deep review
False positives are common	False-Positive Suppression Rules	Improves developer trust
Evidence strength is unclear	Evidence Confidence	Separates confirmed from likely or suspected
Resource lifecycle issues get missed	Gate A + Gate B	Improves coverage of response-body, transaction, connection, and goroutine risks
Security-domain coverage is unsystematic	Gate D 10-Domain	Makes review structure more complete
Reports imply false completeness	Gate F Uncovered Risk List	Makes blind spots explicit
Security reports are hard to govern	CWE/OWASP mapping + JSON summary	Better for audit, CI, and tracking

6. Key Highlights¶

6.1 It Turns Security Review into an Exploitability-First Process¶

This is not just "more checklist coverage." It begins by asking whether the path can actually be exploited.

6.2 Review-Depth Routing Is One of Its Most Visible Structural Strengths¶

Lite, Standard, and Deep bind review cost to change risk. That is one of the biggest things missing from default model behavior.

6.3 Its False-Positive Suppression Is Critically Important¶

Many teams do not reject security review because they dislike security. They reject it because they cannot trust over-reported findings. security-review directly improves that trust boundary.

6.4 Gates A, D, and F Form a Clear Governance Loop¶

Gate A handles lifecycle pairing, Gate D handles domain coverage, and Gate F handles uncovered-risk declaration. Together they reduce the chance of "formal-looking but actually incomplete" review.

6.5 Its Output Contract Is Built for Downstream Governance¶

Confidence, CWE/OWASP mapping, baseline status, and JSON summary make the result useful beyond the immediate conversation.

6.6 Its Real Increment Is Process Discipline More Than Vulnerability Discovery¶

The evaluation already shows this clearly: the base model was not weak at finding many core issues. The main delta came from depth routing, suppression, standards mapping, uncovered-risk declaration, JSON output, and systematic coverage. In other words, the skill's real value is review governance.

7. When to Use It — and When Not To¶

Scenario	Suitable	Reason
Sensitive changes in auth, input handling, secrets, payments, or APIs	Very suitable	Trigger signals and multi-domain coverage are strong
Go services or infra-related changes	Very suitable	Gate A / D support is strongest here
Reviews that need audit traceability	Very suitable	Confidence, mapping, JSON, and Gate F are highly useful
Benign or low-risk changes	Suitable	Lite + Fast Pass can control review cost
Quick informal checking for obvious issues only	Not always	Full structured output may be heavier than needed
Contexts that do not need structured outputs at all	Not always optimal	A plain review may sometimes be enough

8. Conclusion¶

The real strength of security-review is not that it can produce more lines that sound like security findings. It is that it systematizes the engineering judgments that security review most often distorts: choose depth based on risk and scope, decide which domains are actually applicable, use suppression discipline to control false positives, use confidence and standards mapping to control claim strength, and then use Gate F to say explicitly what the report did not cover.

From a design perspective, the skill embodies a clear principle: the key to a high-quality security review is not making the report longer, but making every finding carry an exploit path, making every uncovered area visible, and making the review know what it looked at, what it skipped, and why. That is why it is especially well suited to engineering security review, audit traceability, and structured remediation workflows.

9. Document Maintenance¶

This document should be updated when:

the Review Depth logic, Evidence Confidence rules, Suppression Rules, Gate A-F definitions, Scenario Checklists, Focused Automation Gate, Standards Mapping, or Output Contract in skills/security-review/SKILL.md change,
key rules in skills/security-review/references/go-secure-coding.md, scenario-checklists.md, severity-calibration.md, anti-examples.md, security-review.md, or the language-specific references change,
key supporting conclusions in evaluate/security-review-skill-eval-report.md or evaluate/security-review-skill-eval-report.zh-CN.md change.

Review quarterly; review immediately if the depth-routing logic, suppression rules, Gate D / Gate F requirements, or standards-mapping rules of security-review change substantially.

10. Further Reading¶

skills/security-review/SKILL.md
skills/security-review/references/go-secure-coding.md
skills/security-review/references/scenario-checklists.md
skills/security-review/references/severity-calibration.md
skills/security-review/references/anti-examples.md
evaluate/security-review-skill-eval-report.md
evaluate/security-review-skill-eval-report.zh-CN.md