Skip to content

security-review Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Subject: security-review


security-review is an exploitability-first security review skill for assessing authentication, input, secrets, API, data flow, dependencies, and resource lifecycle risks in code changes, with emphasis on reproducible, actionable security findings. Its three main strengths are: choosing review depth and multi-domain gate coverage first so changes of different risk levels get matching check intensity; every finding emphasizes evidence, confidence, and CWE/OWASP mapping for audit and governance; and it has systematic false-positive suppression and uncovered-risk recording so "real vulnerabilities" are separated from "suspicious points not yet findings".

1. Evaluation Overview

This evaluation assesses the security-review skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 security review scenarios of increasing complexity (Web Handler review, OpenAI API client review, benign pure-function review with no security risk). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 40 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 40/40 (100%) 20/40 (50.0%) +50.0 percentage points
Review Depth selection 3/3 correct 0/3 Skill-only
Confidence labels 3/3 0/3 Skill-only
CWE/OWASP mapping 3/3 0/3 Skill-only
Gate D 10-Domain coverage 3/3 0/3 Skill-only
Machine-Readable JSON 3/3 0/3 Skill-only
Gate F Uncovered Risk list 3/3 0/3 Skill-only
False-Positive suppression 3/3 correct 1/3 Largest quality delta
Skill Token cost (SKILL.md) ~3,800 tokens 0
Skill Token cost (incl. Go references) ~9,600 tokens 0
Token cost per 1% pass-rate gain ~76 tokens (SKILL.md) / ~192 tokens (full)

2. Test Methodology

2.1 Scenario Design

Scenario Target code Core focus Assertions
Eval 1: Web Handler review internal/webapp/handler.go (285 lines) + parser.go + urlutil.go HTTP input validation, SSRF, injection, resource lifecycle, false-positive suppression 15
Eval 2: OpenAI API client review internal/converter/summary_openai.go (294 lines) + urlutil.go + config/loader.go Secret management, external HTTP calls, SSRF, prompt injection, response body lifecycle 15
Eval 3: Benign pure-function review internal/cli/exitcode.go (57 lines) Lite depth judgment, 0 false positives, correct N/A labeling 10

2.2 Execution

  • With-skill runs first read SKILL.md and its referenced Go secure-coding and scenario checklists
  • Without-skill runs read no skill; review follows model default security review behavior
  • All runs execute in independent subagents

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: Web Handler 15 15/15 (100%) 7/15 (46.7%) +53.3%
Eval 2: API Client 15 15/15 (100%) 9/15 (60.0%) +40.0%
Eval 3: Benign Code 10 10/10 (100%) 4/10 (40.0%) +60.0%
Total 40 40/40 (100%) 20/40 (50.0%) +50.0%

3.2 Classification of 20 Without-Skill Failed Assertions

Failure type Count Evals Notes
Missing Review Depth selection 3 1/2/3 No Lite/Standard/Deep classification, no trigger signal analysis
Missing Confidence labels 3 1/2/3 No confirmed/likely/suspected distinction
Missing CWE/OWASP mapping 3 1/2/3 Only HIGH/MEDIUM/LOW severity, no standard mapping
Missing Gate D 10-Domain coverage 3 1/2/3 No systematic domain coverage assessment
Missing Machine-Readable JSON 3 1/2/3 No CI/inbox-consumable JSON summary
Missing Gate F Uncovered Risk list 3 1/2/3 No declaration of uncovered areas; may imply false completeness
Gate A construct-release pairing audit missing 1 1 No explicit resource lifecycle audit
Insufficient false-positive suppression 1 1 openAPISpecPath reported as MEDIUM but path not user-controlled

3.3 Pass Rate by Assertion Category

Category With Skill Without Skill Delta
Structural compliance (depth/gates/output contract) 18/18 (100%) 0/18 (0%) +100%
Security analysis quality (attack surface, suppression, remediation) 13/13 (100%) 12/13 (92.3%) +7.7%
Standards mapping (CWE/OWASP/confidence) 9/9 (100%) 0/9 (0%) +100%

Key finding: The skill’s core value is structural compliance and standards mapping—without-skill pass rate for these categories is 0%. Security analysis quality (finding real vulnerabilities) differs by only 7.7%, so the base model already has strong security review ability; the skill’s incremental value is process discipline, not discovery capability.


4. Dimension-by-Dimension Comparison

4.1 Review Depth Selection (Skill-Only Capability)

This is the skill’s most distinctive output.

Scenario With Skill Without Skill
Eval 1 (HTTP handler) Standard — "new HTTP endpoints exposed" trigger signal No depth selection
Eval 2 (API client) Standard — "new external integration + secret management" trigger signal No depth selection
Eval 3 (exitcode) Lite — "1 file, no security-sensitive paths" + full exclusion rationale No depth selection

Practical value: Review Depth controls cost-benefit: - Lite mode skips Gates B/C/E, saving ~40% review time - Standard/Deep distinction ensures security-sensitive code gets adequate review - Without-skill applies the same depth to all scenarios, over-reviewing simple code and possibly under-reviewing complex code

4.2 False-Positive Suppression Quality

This is the skill’s largest quality delta.

Suppression scenario With Skill Without Skill
SSRF via user URL (parser restricts github.com) Correctly suppressed — "parser restricts host to github.com, handler doesn't make HTTP requests to raw URL" Not reported (implicit handling)
Path traversal via openAPISpecPath Correctly suppressed — "set at construction time from config, not user-controlled" (Rule 2) ❌ Reported as MEDIUM
Open redirect via http.Redirect Correctly suppressed — "redirect target is hardcoded /swagger/index.html" (Rule 2) Not reported (but reported catch-all route)
XSS via template Correctly suppressed — "html/template auto-escapes" (Rule 3) Correctly identified (positive observation)
appendThreadText recursion Correctly suppressed — "GitHub API limits nesting depth" ❌ Reported as LOW (F-8)
CSRF on /convert Correct N/A — "stateless form, no session, no state mutation" ❌ Reported as HIGH

Analysis: Without-skill’s CSRF finding (Eval 1 Finding #1) conflated cost exhaustion (rate limit exhaustion) with CSRF. With-skill correctly attributed the root cause to missing rate limiting (SEC-001 P2), not CSRF—because /convert is stateless with no session/cookie/state mutation. This demonstrates the skill’s suppression discipline: it prevents inflated severity by separating root cause from delivery mechanism.

4.3 Output Structure Comparison

Output section With Skill Without Skill
Review Depth + rationale ✅ 3/3 ❌ 0/3
Trust Boundary Mapping ✅ 3/3 ❌ 0/3 (Eval 2 has similar content)
Scenario Checklists (11 items) ✅ 3/3 ❌ 0/3
Gate A pairing table ✅ 3/3 ❌ 0/3
Gate D 10-Domain table ✅ 3/3 ❌ 0/3
Suppression Filter table ✅ 2/2 (Eval 3 N/A) ❌ 0/2
Gate E secondary verification ✅ 2/2 (Lite skips) ❌ 0/2
Findings (severity+confidence+CWE) ✅ 3/3 Partial (no confidence/CWE)
Remediation Plan (immediate/short/backlog) ✅ 3/3 Partial (priority but no SLA)
Risk Acceptance Register ✅ 3/3 ❌ 0/3
JSON Summary ✅ 3/3 ❌ 0/3
Gate F Uncovered Risk List ✅ 3/3 ❌ 0/3

4.4 Security Finding Quality Comparison

Despite large structural differences, both configurations overlap significantly on core security findings:

Core finding With Skill Without Skill
Rate limiting missing SEC-001 P2 ✅ Finding #2 HIGH ✅
Security headers missing SEC-002/003 P3 ✅ Finding #3 MEDIUM ✅
Prompt injection SEC-002 P2 (Eval 2) ✅ F-3 MEDIUM ✅
Unbounded response body SEC-003 P2 (Eval 2) ✅ F-4 MEDIUM ✅
Redirect following leak SEC-004 P2 (Eval 2) ✅ F-2 HIGH ✅
SSRF DNS rebinding SEC-005 P3 (Eval 2) ✅ F-1 HIGH ✅
API key plain string SEC-001 P3 (Eval 2) ✅ F-6 LOW ✅

Without-skill-only findings: - CSRF on /convert (HIGH) — root-cause misattribution - URL scheme enforcement in parser (MEDIUM) — valid defense-in-depth - Unbounded pagination (MEDIUM) — valid; with-skill mentioned in Gate F - Token via CLI flag (LOW) — valid but out of changed scope - appendThreadText recursion (LOW) — with-skill correctly suppresses

With-skill-only findings: - No core finding is with-skill-only (base model discovery is strong) - With-skill severity calibration is more precise (e.g., SSRF DNS rebinding correctly P3 suspected, not HIGH)

4.5 Eval 3 (Benign Code) — Lite Review Quality

This is the skill’s clearest efficiency advantage.

Dimension With Skill Without Skill
Output lines 227 46
Review Depth declaration "Lite (1 file, no security-sensitive paths)" + 9 trigger signals excluded None
Findings 0 (correct) 0 (correct)
Domain Coverage 10/10 N/A (each domain has code evidence) Table but no numbered domains
Gates Skipped declaration "Gates B/C/E skipped per Lite scope policy" No Gate concept
Gate F Uncovered Risk 4 items (gosec not run, govulncheck not run, etc.) None
JSON Summary pass: true, 0 findings None
Review depth appropriateness ✅ No over-review of simple code ⚠️ Unclear if brief or appropriate

Analysis: Without-skill’s 46-line output correctly concluded "no security vulnerabilities" but lacked audit traceability. With-skill’s 227-line output provides full audit record: why Lite was chosen, why each domain is N/A, which checks were skipped and why. For compliance (e.g., SOC 2 audit), this traceability is required.


5. Token Cost-Effectiveness Analysis

5.1 Skill Size

File Lines Words Bytes Est. Tokens
SKILL.md 456 2,870 20,818 ~3,800
references/go-secure-coding.md 723 2,957 25,019 ~4,600
references/scenario-checklists.md 140 889 6,588 ~1,200
references/security-review.md 112 557 4,165 ~800
references/lang-nodejs.md 149 701 5,389 ~1,000
references/lang-java.md 123 561 4,716 ~900
references/lang-python.md 122 542 4,363 ~800
Description (always in context) ~40 ~50

Typical load scenarios:

Scenario Files read Total Tokens
Go code review (Standard/Deep) SKILL.md + go-secure-coding.md + scenario-checklists.md ~9,600
Go code review (Lite) SKILL.md + scenario-checklists.md ~5,000
Node.js code review SKILL.md + scenario-checklists.md + lang-nodejs.md ~6,000
Java code review SKILL.md + scenario-checklists.md + lang-java.md ~5,900
SKILL.md only (minimal) SKILL.md ~3,800

5.2 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (40/40)
Without-skill pass rate 50.0% (20/40)
Pass-rate gain +50.0 percentage points
Token cost per assertion fixed ~190 tokens (SKILL.md only) / ~480 tokens (Go full)
Token cost per 1% pass-rate gain ~76 tokens (SKILL.md only) / ~192 tokens (Go full)

5.3 Token Segment Cost-Effectiveness

SKILL.md content split by functional module:

Module Est. Tokens Related assertion delta Cost-effectiveness
Review Depth Selection ~250 3 (3 evals depth selection) Very high — 83 tok/assertion
Evidence Confidence ~100 3 (3 evals confidence labels) Very high — 33 tok/assertion
Suppression Rules ~180 2 (Eval 1/2 suppression quality) Very high — 90 tok/assertion
Output Contract ~500 3 (3 evals JSON summary) High — 167 tok/assertion
Gate D 10-Domain ~400 3 (3 evals domain coverage) High — 133 tok/assertion
Gate A pairing ~150 1 (Eval 1 pairing audit) High — 150 tok/assertion
Gate F Uncovered Risk ~80 3 (3 evals risk list) Very high — 27 tok/assertion
Standards Mapping ~50 3 (3 evals CWE mapping) Very high — 17 tok/assertion
Severity Model + SLA ~200 Indirect (more precise severity calibration) Medium — no direct assertion
Anti-Examples ~350 Indirect (avoids AE-1/AE-3/AE-5 errors) Medium — defensive value
Scenario Checklists pointer ~200 Indirect (11-scenario systematic coverage) Medium — structured review
Baseline Diff Mode ~100 0 (no baseline scenario tested) Low — not tested
Language Extension Hooks ~150 0 (Go only tested) Low — not tested
Focused Automation Gate ~350 Indirect (automation tool execution consistency) Medium — tool discipline
go-secure-coding.md (reference) ~4,600 Indirect (Gate B/D detailed check guide) Medium — deep review support
scenario-checklists.md (reference) ~1,200 Indirect (11-scenario detailed checks) Medium — systematic coverage

5.4 High-Leverage vs Low-Leverage Instructions

High leverage (~1,710 tokens SKILL.md → 18 assertion delta): - Review Depth Selection (250 tok → 3) - Evidence Confidence (100 tok → 3) - Suppression Rules (180 tok → 2) - Output Contract (500 tok → 3) - Gate D 10-Domain (400 tok → 3) - Gate F Uncovered Risk (80 tok → 3) - Standards Mapping (50 tok → 3) - Gate A pairing (150 tok → 1)

Medium leverage (~1,100 tokens → indirect quality gain): - Anti-Examples (350 tok) — prevents false positives - Scenario Checklists pointer (200 tok) — systematic - Severity Model + SLA (200 tok) — severity calibration - Focused Automation Gate (350 tok) — tool execution discipline

Low leverage (~250 tokens → no contribution in this evaluation): - Baseline Diff Mode (100 tok) — not tested - Language Extension Hooks (150 tok) — Go only tested

References (~5,800 tokens → indirect review depth): - go-secure-coding.md (4,600 tok) — Gate B/D depth support - scenario-checklists.md (1,200 tok) — scenario systematic coverage

5.5 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~9,600 tokens for +50.0% pass rate (highest among evaluated skills)
SKILL.md ROI Excellent — ~3,800 tokens contains all high-leverage rules
High-leverage token share ~45% (1,710/3,800) directly contributes 18/20 assertion delta
Low-leverage token share ~6.6% (250/3,800) contributes nothing in this evaluation
Reference cost-effectiveness High — though 60% of total tokens, provides required depth for Gate B/D

5.6 Comparison with Other Skills’ Cost-Effectiveness

Metric security-review go-makefile-writer google-search deep-research tdd-workflow
SKILL.md Tokens ~3,800 ~1,960 ~3,500 ~2,200 ~2,800
Total load Tokens ~9,600 ~4,100–4,600 ~6,900 ~3,500 ~4,200
Pass-rate gain +50.0% +31.0% +74.1% +66.7% +46.2%
Tokens per 1% (SKILL.md) ~76 tok ~63 tok ~47 tok ~33 tok ~61 tok
Tokens per 1% (full) ~192 tok ~149 tok ~93 tok ~53 tok ~91 tok

Analysis: security-review’s SKILL.md cost-effectiveness (76 tok/1%) is mid-to-low among evaluated skills, but its absolute pass-rate gain (+50.0%) is highest, meaning the skill addresses a more fundamental gap—the base model has a large gap in security review structural compliance (without-skill structural compliance pass rate 0%), and the skill fully fills it.

References account for ~60% of tokens, but the Go secure-coding reference is required for Gate B/D and cannot be simplified. If selective loading is introduced (Lite skips go-secure-coding.md), Lite scenario token cost could drop from ~9,600 to ~5,000.


6. Boundary Analysis vs Claude Base Model Capabilities

6.1 Base Model Capabilities (No Skill Increment)

Capability Evidence
Identify rate limiting missing 3/3 relevant scenarios correct
Identify prompt injection risk 1/1 scenario correct (Eval 2)
Identify unbounded response body 1/1 scenario correct (Eval 2)
Identify HTTP redirect following risk 1/1 scenario correct (Eval 2)
Identify SSRF DNS rebinding 1/1 scenario correct (Eval 2)
Identify API key storage issue 1/1 scenario correct (Eval 2)
Correctly judge benign code has no vulnerabilities 1/1 scenario correct (Eval 3)
MaxBytesReader positive defense identification 1/1 scenario correct (Eval 1)
html/template safety identification 1/1 scenario correct (Eval 1)
Provide code-level remediation 3/3 scenarios correct

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
No Review Depth classification 3/3 scenarios no depth selection High — review cost uncontrolled
No Confidence labels 3/3 scenarios no confirmed/likely/suspected High — can’t distinguish confirmed vs hypothetical
No CWE/OWASP mapping 3/3 scenarios no standard mapping High — doesn’t meet compliance audit requirements
No systematic domain coverage 3/3 scenarios no Gate D 10-Domain High — may miss entire security domains
No Machine-Readable output 3/3 scenarios no JSON Medium — CI automation gates unavailable
No Uncovered Risk declaration 3/3 scenarios no Gate F High — false completeness (AE-5)
Insufficient false-positive suppression Eval 1 path traversal false positive; CSRF root-cause misattribution Medium — developer trust erosion
No resource lifecycle audit Eval 1 no Gate A pairing table Medium — may miss resource leaks

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Review process structure 5.0/5 1.0/5 +4.0
Security finding quality 4.5/5 4.0/5 +0.5
False-positive suppression accuracy 5.0/5 2.5/5 +2.5
Severity calibration 5.0/5 3.0/5 +2.0
Standards mapping compliance 5.0/5 0.5/5 +4.5
Output consumability (JSON/audit) 5.0/5 1.0/5 +4.0
Overall mean 4.92/5 2.0/5 +2.92

7.2 Weighted Total

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Review process structure 20% 10/10 2.00
False-positive suppression & severity calibration 20% 9.5/10 1.90
Standards mapping compliance 15% 10/10 1.50
Token cost-effectiveness 10% 7.0/10 0.70
Maintainability & extensibility 10% 8.0/10 0.80
Weighted total 9.40/10

8. Evaluation Artifacts

Artifact Path
Eval 1 with-skill output /tmp/secreview-eval/eval-1/with_skill/response.md
Eval 1 without-skill output /tmp/secreview-eval/eval-1/without_skill/response.md
Eval 2 with-skill output /tmp/secreview-eval/eval-2/with_skill/response.md
Eval 2 without-skill output /tmp/secreview-eval/eval-2/without_skill/response.md
Eval 3 with-skill output /tmp/secreview-eval/eval-3/with_skill/response.md
Eval 3 without-skill output /tmp/secreview-eval/eval-3/without_skill/response.md
Skill file /Users/john/.codex/skills/security-review/SKILL.md
Go secure-coding reference /Users/john/.codex/skills/security-review/references/go-secure-coding.md
Scenario checklist reference /Users/john/.codex/skills/security-review/references/scenario-checklists.md