security-review Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Subject: security-review

security-review is an exploitability-first security review skill for assessing authentication, input, secrets, API, data flow, dependencies, and resource lifecycle risks in code changes, with emphasis on reproducible, actionable security findings. Its three main strengths are: choosing review depth and multi-domain gate coverage first so changes of different risk levels get matching check intensity; every finding emphasizes evidence, confidence, and CWE/OWASP mapping for audit and governance; and it has systematic false-positive suppression and uncovered-risk recording so "real vulnerabilities" are separated from "suspicious points not yet findings".

1. Evaluation Overview¶

This evaluation assesses the security-review skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 security review scenarios of increasing complexity (Web Handler review, OpenAI API client review, benign pure-function review with no security risk). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 40 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	40/40 (100%)	20/40 (50.0%)	+50.0 percentage points
Review Depth selection	3/3 correct	0/3	Skill-only
Confidence labels	3/3	0/3	Skill-only
CWE/OWASP mapping	3/3	0/3	Skill-only
Gate D 10-Domain coverage	3/3	0/3	Skill-only
Machine-Readable JSON	3/3	0/3	Skill-only
Gate F Uncovered Risk list	3/3	0/3	Skill-only
False-Positive suppression	3/3 correct	1/3	Largest quality delta
Skill Token cost (SKILL.md)	~3,800 tokens	0	—
Skill Token cost (incl. Go references)	~9,600 tokens	0	—
Token cost per 1% pass-rate gain	~76 tokens (SKILL.md) / ~192 tokens (full)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	Target code	Core focus	Assertions
Eval 1: Web Handler review	`internal/webapp/handler.go` (285 lines) + `parser.go` + `urlutil.go`	HTTP input validation, SSRF, injection, resource lifecycle, false-positive suppression	15
Eval 2: OpenAI API client review	`internal/converter/summary_openai.go` (294 lines) + `urlutil.go` + `config/loader.go`	Secret management, external HTTP calls, SSRF, prompt injection, response body lifecycle	15
Eval 3: Benign pure-function review	`internal/cli/exitcode.go` (57 lines)	Lite depth judgment, 0 false positives, correct N/A labeling	10

2.2 Execution¶

With-skill runs first read SKILL.md and its referenced Go secure-coding and scenario checklists
Without-skill runs read no skill; review follows model default security review behavior
All runs execute in independent subagents

3. Assertion Pass Rate¶

3.1 Summary¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: Web Handler	15	15/15 (100%)	7/15 (46.7%)	+53.3%
Eval 2: API Client	15	15/15 (100%)	9/15 (60.0%)	+40.0%
Eval 3: Benign Code	10	10/10 (100%)	4/10 (40.0%)	+60.0%
Total	40	40/40 (100%)	20/40 (50.0%)	+50.0%

3.2 Classification of 20 Without-Skill Failed Assertions¶

Failure type	Count	Evals	Notes
Missing Review Depth selection	3	1/2/3	No Lite/Standard/Deep classification, no trigger signal analysis
Missing Confidence labels	3	1/2/3	No confirmed/likely/suspected distinction
Missing CWE/OWASP mapping	3	1/2/3	Only HIGH/MEDIUM/LOW severity, no standard mapping
Missing Gate D 10-Domain coverage	3	1/2/3	No systematic domain coverage assessment
Missing Machine-Readable JSON	3	1/2/3	No CI/inbox-consumable JSON summary
Missing Gate F Uncovered Risk list	3	1/2/3	No declaration of uncovered areas; may imply false completeness
Gate A construct-release pairing audit missing	1	1	No explicit resource lifecycle audit
Insufficient false-positive suppression	1	1	`openAPISpecPath` reported as MEDIUM but path not user-controlled

3.3 Pass Rate by Assertion Category¶

Category	With Skill	Without Skill	Delta
Structural compliance (depth/gates/output contract)	18/18 (100%)	0/18 (0%)	+100%
Security analysis quality (attack surface, suppression, remediation)	13/13 (100%)	12/13 (92.3%)	+7.7%
Standards mapping (CWE/OWASP/confidence)	9/9 (100%)	0/9 (0%)	+100%

Key finding: The skill’s core value is structural compliance and standards mapping—without-skill pass rate for these categories is 0%. Security analysis quality (finding real vulnerabilities) differs by only 7.7%, so the base model already has strong security review ability; the skill’s incremental value is process discipline, not discovery capability.

4. Dimension-by-Dimension Comparison¶

4.1 Review Depth Selection (Skill-Only Capability)¶

This is the skill’s most distinctive output.

Scenario	With Skill	Without Skill
Eval 1 (HTTP handler)	Standard — "new HTTP endpoints exposed" trigger signal	No depth selection
Eval 2 (API client)	Standard — "new external integration + secret management" trigger signal	No depth selection
Eval 3 (exitcode)	Lite — "1 file, no security-sensitive paths" + full exclusion rationale	No depth selection

Practical value: Review Depth controls cost-benefit: - Lite mode skips Gates B/C/E, saving ~40% review time - Standard/Deep distinction ensures security-sensitive code gets adequate review - Without-skill applies the same depth to all scenarios, over-reviewing simple code and possibly under-reviewing complex code

4.2 False-Positive Suppression Quality¶

This is the skill’s largest quality delta.

Suppression scenario	With Skill	Without Skill
SSRF via user URL (parser restricts github.com)	Correctly suppressed — "parser restricts host to github.com, handler doesn't make HTTP requests to raw URL"	Not reported (implicit handling)
Path traversal via openAPISpecPath	Correctly suppressed — "set at construction time from config, not user-controlled" (Rule 2)	❌ Reported as MEDIUM
Open redirect via http.Redirect	Correctly suppressed — "redirect target is hardcoded /swagger/index.html" (Rule 2)	Not reported (but reported catch-all route)
XSS via template	Correctly suppressed — "html/template auto-escapes" (Rule 3)	Correctly identified (positive observation)
appendThreadText recursion	Correctly suppressed — "GitHub API limits nesting depth"	❌ Reported as LOW (F-8)
CSRF on /convert	Correct N/A — "stateless form, no session, no state mutation"	❌ Reported as HIGH

Analysis: Without-skill’s CSRF finding (Eval 1 Finding #1) conflated cost exhaustion (rate limit exhaustion) with CSRF. With-skill correctly attributed the root cause to missing rate limiting (SEC-001 P2), not CSRF—because /convert is stateless with no session/cookie/state mutation. This demonstrates the skill’s suppression discipline: it prevents inflated severity by separating root cause from delivery mechanism.

4.3 Output Structure Comparison¶

Output section	With Skill	Without Skill
Review Depth + rationale	✅ 3/3	❌ 0/3
Trust Boundary Mapping	✅ 3/3	❌ 0/3 (Eval 2 has similar content)
Scenario Checklists (11 items)	✅ 3/3	❌ 0/3
Gate A pairing table	✅ 3/3	❌ 0/3
Gate D 10-Domain table	✅ 3/3	❌ 0/3
Suppression Filter table	✅ 2/2 (Eval 3 N/A)	❌ 0/2
Gate E secondary verification	✅ 2/2 (Lite skips)	❌ 0/2
Findings (severity+confidence+CWE)	✅ 3/3	Partial (no confidence/CWE)
Remediation Plan (immediate/short/backlog)	✅ 3/3	Partial (priority but no SLA)
Risk Acceptance Register	✅ 3/3	❌ 0/3
JSON Summary	✅ 3/3	❌ 0/3
Gate F Uncovered Risk List	✅ 3/3	❌ 0/3

4.4 Security Finding Quality Comparison¶

Despite large structural differences, both configurations overlap significantly on core security findings:

Core finding	With Skill	Without Skill
Rate limiting missing	SEC-001 P2 ✅	Finding #2 HIGH ✅
Security headers missing	SEC-002/003 P3 ✅	Finding #3 MEDIUM ✅
Prompt injection	SEC-002 P2 (Eval 2) ✅	F-3 MEDIUM ✅
Unbounded response body	SEC-003 P2 (Eval 2) ✅	F-4 MEDIUM ✅
Redirect following leak	SEC-004 P2 (Eval 2) ✅	F-2 HIGH ✅
SSRF DNS rebinding	SEC-005 P3 (Eval 2) ✅	F-1 HIGH ✅
API key plain string	SEC-001 P3 (Eval 2) ✅	F-6 LOW ✅

Without-skill-only findings: - CSRF on /convert (HIGH) — root-cause misattribution - URL scheme enforcement in parser (MEDIUM) — valid defense-in-depth - Unbounded pagination (MEDIUM) — valid; with-skill mentioned in Gate F - Token via CLI flag (LOW) — valid but out of changed scope - appendThreadText recursion (LOW) — with-skill correctly suppresses

With-skill-only findings: - No core finding is with-skill-only (base model discovery is strong) - With-skill severity calibration is more precise (e.g., SSRF DNS rebinding correctly P3 suspected, not HIGH)

4.5 Eval 3 (Benign Code) — Lite Review Quality¶

This is the skill’s clearest efficiency advantage.

Dimension	With Skill	Without Skill
Output lines	227	46
Review Depth declaration	"Lite (1 file, no security-sensitive paths)" + 9 trigger signals excluded	None
Findings	0 (correct)	0 (correct)
Domain Coverage	10/10 N/A (each domain has code evidence)	Table but no numbered domains
Gates Skipped declaration	"Gates B/C/E skipped per Lite scope policy"	No Gate concept
Gate F Uncovered Risk	4 items (gosec not run, govulncheck not run, etc.)	None
JSON Summary	`pass: true`, 0 findings	None
Review depth appropriateness	✅ No over-review of simple code	⚠️ Unclear if brief or appropriate

Analysis: Without-skill’s 46-line output correctly concluded "no security vulnerabilities" but lacked audit traceability. With-skill’s 227-line output provides full audit record: why Lite was chosen, why each domain is N/A, which checks were skipped and why. For compliance (e.g., SOC 2 audit), this traceability is required.

5. Token Cost-Effectiveness Analysis¶

5.1 Skill Size¶

File	Lines	Words	Bytes	Est. Tokens
SKILL.md	456	2,870	20,818	~3,800
references/go-secure-coding.md	723	2,957	25,019	~4,600
references/scenario-checklists.md	140	889	6,588	~1,200
references/security-review.md	112	557	4,165	~800
references/lang-nodejs.md	149	701	5,389	~1,000
references/lang-java.md	123	561	4,716	~900
references/lang-python.md	122	542	4,363	~800
Description (always in context)	—	~40	—	~50

Typical load scenarios:

Scenario	Files read	Total Tokens
Go code review (Standard/Deep)	SKILL.md + go-secure-coding.md + scenario-checklists.md	~9,600
Go code review (Lite)	SKILL.md + scenario-checklists.md	~5,000
Node.js code review	SKILL.md + scenario-checklists.md + lang-nodejs.md	~6,000
Java code review	SKILL.md + scenario-checklists.md + lang-java.md	~5,900
SKILL.md only (minimal)	SKILL.md	~3,800

5.2 Token Cost for Quality Gain¶

Metric	Value
With-skill pass rate	100% (40/40)
Without-skill pass rate	50.0% (20/40)
Pass-rate gain	+50.0 percentage points
Token cost per assertion fixed	~190 tokens (SKILL.md only) / ~480 tokens (Go full)
Token cost per 1% pass-rate gain	~76 tokens (SKILL.md only) / ~192 tokens (Go full)

5.3 Token Segment Cost-Effectiveness¶

SKILL.md content split by functional module:

Module	Est. Tokens	Related assertion delta	Cost-effectiveness
Review Depth Selection	~250	3 (3 evals depth selection)	Very high — 83 tok/assertion
Evidence Confidence	~100	3 (3 evals confidence labels)	Very high — 33 tok/assertion
Suppression Rules	~180	2 (Eval 1/2 suppression quality)	Very high — 90 tok/assertion
Output Contract	~500	3 (3 evals JSON summary)	High — 167 tok/assertion
Gate D 10-Domain	~400	3 (3 evals domain coverage)	High — 133 tok/assertion
Gate A pairing	~150	1 (Eval 1 pairing audit)	High — 150 tok/assertion
Gate F Uncovered Risk	~80	3 (3 evals risk list)	Very high — 27 tok/assertion
Standards Mapping	~50	3 (3 evals CWE mapping)	Very high — 17 tok/assertion
Severity Model + SLA	~200	Indirect (more precise severity calibration)	Medium — no direct assertion
Anti-Examples	~350	Indirect (avoids AE-1/AE-3/AE-5 errors)	Medium — defensive value
Scenario Checklists pointer	~200	Indirect (11-scenario systematic coverage)	Medium — structured review
Baseline Diff Mode	~100	0 (no baseline scenario tested)	Low — not tested
Language Extension Hooks	~150	0 (Go only tested)	Low — not tested
Focused Automation Gate	~350	Indirect (automation tool execution consistency)	Medium — tool discipline
go-secure-coding.md (reference)	~4,600	Indirect (Gate B/D detailed check guide)	Medium — deep review support
scenario-checklists.md (reference)	~1,200	Indirect (11-scenario detailed checks)	Medium — systematic coverage

5.4 High-Leverage vs Low-Leverage Instructions¶

High leverage (~1,710 tokens SKILL.md → 18 assertion delta): - Review Depth Selection (250 tok → 3) - Evidence Confidence (100 tok → 3) - Suppression Rules (180 tok → 2) - Output Contract (500 tok → 3) - Gate D 10-Domain (400 tok → 3) - Gate F Uncovered Risk (80 tok → 3) - Standards Mapping (50 tok → 3) - Gate A pairing (150 tok → 1)

Medium leverage (~1,100 tokens → indirect quality gain): - Anti-Examples (350 tok) — prevents false positives - Scenario Checklists pointer (200 tok) — systematic - Severity Model + SLA (200 tok) — severity calibration - Focused Automation Gate (350 tok) — tool execution discipline

Low leverage (~250 tokens → no contribution in this evaluation): - Baseline Diff Mode (100 tok) — not tested - Language Extension Hooks (150 tok) — Go only tested

References (~5,800 tokens → indirect review depth): - go-secure-coding.md (4,600 tok) — Gate B/D depth support - scenario-checklists.md (1,200 tok) — scenario systematic coverage

5.5 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Excellent — ~9,600 tokens for +50.0% pass rate (highest among evaluated skills)
SKILL.md ROI	Excellent — ~3,800 tokens contains all high-leverage rules
High-leverage token share	~45% (1,710/3,800) directly contributes 18/20 assertion delta
Low-leverage token share	~6.6% (250/3,800) contributes nothing in this evaluation
Reference cost-effectiveness	High — though 60% of total tokens, provides required depth for Gate B/D

5.6 Comparison with Other Skills’ Cost-Effectiveness¶

Metric	security-review	go-makefile-writer	google-search	deep-research	tdd-workflow
SKILL.md Tokens	~3,800	~1,960	~3,500	~2,200	~2,800
Total load Tokens	~9,600	~4,100–4,600	~6,900	~3,500	~4,200
Pass-rate gain	+50.0%	+31.0%	+74.1%	+66.7%	+46.2%
Tokens per 1% (SKILL.md)	~76 tok	~63 tok	~47 tok	~33 tok	~61 tok
Tokens per 1% (full)	~192 tok	~149 tok	~93 tok	~53 tok	~91 tok

Analysis: security-review’s SKILL.md cost-effectiveness (76 tok/1%) is mid-to-low among evaluated skills, but its absolute pass-rate gain (+50.0%) is highest, meaning the skill addresses a more fundamental gap—the base model has a large gap in security review structural compliance (without-skill structural compliance pass rate 0%), and the skill fully fills it.

References account for ~60% of tokens, but the Go secure-coding reference is required for Gate B/D and cannot be simplified. If selective loading is introduced (Lite skips go-secure-coding.md), Lite scenario token cost could drop from ~9,600 to ~5,000.

6. Boundary Analysis vs Claude Base Model Capabilities¶

6.1 Base Model Capabilities (No Skill Increment)¶

Capability	Evidence
Identify rate limiting missing	3/3 relevant scenarios correct
Identify prompt injection risk	1/1 scenario correct (Eval 2)
Identify unbounded response body	1/1 scenario correct (Eval 2)
Identify HTTP redirect following risk	1/1 scenario correct (Eval 2)
Identify SSRF DNS rebinding	1/1 scenario correct (Eval 2)
Identify API key storage issue	1/1 scenario correct (Eval 2)
Correctly judge benign code has no vulnerabilities	1/1 scenario correct (Eval 3)
MaxBytesReader positive defense identification	1/1 scenario correct (Eval 1)
html/template safety identification	1/1 scenario correct (Eval 1)
Provide code-level remediation	3/3 scenarios correct

6.2 Base Model Gaps (Skill Fills)¶

Gap	Evidence	Risk level
No Review Depth classification	3/3 scenarios no depth selection	High — review cost uncontrolled
No Confidence labels	3/3 scenarios no confirmed/likely/suspected	High — can’t distinguish confirmed vs hypothetical
No CWE/OWASP mapping	3/3 scenarios no standard mapping	High — doesn’t meet compliance audit requirements
No systematic domain coverage	3/3 scenarios no Gate D 10-Domain	High — may miss entire security domains
No Machine-Readable output	3/3 scenarios no JSON	Medium — CI automation gates unavailable
No Uncovered Risk declaration	3/3 scenarios no Gate F	High — false completeness (AE-5)
Insufficient false-positive suppression	Eval 1 path traversal false positive; CSRF root-cause misattribution	Medium — developer trust erosion
No resource lifecycle audit	Eval 1 no Gate A pairing table	Medium — may miss resource leaks

7. Overall Score¶

7.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Review process structure	5.0/5	1.0/5	+4.0
Security finding quality	4.5/5	4.0/5	+0.5
False-positive suppression accuracy	5.0/5	2.5/5	+2.5
Severity calibration	5.0/5	3.0/5	+2.0
Standards mapping compliance	5.0/5	0.5/5	+4.5
Output consumability (JSON/audit)	5.0/5	1.0/5	+4.0
Overall mean	4.92/5	2.0/5	+2.92

7.2 Weighted Total¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	10/10	2.50
Review process structure	20%	10/10	2.00
False-positive suppression & severity calibration	20%	9.5/10	1.90
Standards mapping compliance	15%	10/10	1.50
Token cost-effectiveness	10%	7.0/10	0.70
Maintainability & extensibility	10%	8.0/10	0.80
Weighted total			9.40/10

8. Evaluation Artifacts¶

Artifact	Path
Eval 1 with-skill output	`/tmp/secreview-eval/eval-1/with_skill/response.md`
Eval 1 without-skill output	`/tmp/secreview-eval/eval-1/without_skill/response.md`
Eval 2 with-skill output	`/tmp/secreview-eval/eval-2/with_skill/response.md`
Eval 2 without-skill output	`/tmp/secreview-eval/eval-2/without_skill/response.md`
Eval 3 with-skill output	`/tmp/secreview-eval/eval-3/with_skill/response.md`
Eval 3 without-skill output	`/tmp/secreview-eval/eval-3/without_skill/response.md`
Skill file	`/Users/john/.codex/skills/security-review/SKILL.md`
Go secure-coding reference	`/Users/john/.codex/skills/security-review/references/go-secure-coding.md`
Scenario checklist reference	`/Users/john/.codex/skills/security-review/references/scenario-checklists.md`