security-review Skill Evaluation Report¶
Evaluation framework: skill-creator Evaluation date: 2026-03-12 Subject:
security-review
security-review is an exploitability-first security review skill for assessing authentication, input, secrets, API, data flow, dependencies, and resource lifecycle risks in code changes, with emphasis on reproducible, actionable security findings. Its three main strengths are: choosing review depth and multi-domain gate coverage first so changes of different risk levels get matching check intensity; every finding emphasizes evidence, confidence, and CWE/OWASP mapping for audit and governance; and it has systematic false-positive suppression and uncovered-risk recording so "real vulnerabilities" are separated from "suspicious points not yet findings".
1. Evaluation Overview¶
This evaluation assesses the security-review skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 security review scenarios of increasing complexity (Web Handler review, OpenAI API client review, benign pure-function review with no security risk). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 40 assertions.
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Assertion pass rate | 40/40 (100%) | 20/40 (50.0%) | +50.0 percentage points |
| Review Depth selection | 3/3 correct | 0/3 | Skill-only |
| Confidence labels | 3/3 | 0/3 | Skill-only |
| CWE/OWASP mapping | 3/3 | 0/3 | Skill-only |
| Gate D 10-Domain coverage | 3/3 | 0/3 | Skill-only |
| Machine-Readable JSON | 3/3 | 0/3 | Skill-only |
| Gate F Uncovered Risk list | 3/3 | 0/3 | Skill-only |
| False-Positive suppression | 3/3 correct | 1/3 | Largest quality delta |
| Skill Token cost (SKILL.md) | ~3,800 tokens | 0 | — |
| Skill Token cost (incl. Go references) | ~9,600 tokens | 0 | — |
| Token cost per 1% pass-rate gain | ~76 tokens (SKILL.md) / ~192 tokens (full) | — | — |
2. Test Methodology¶
2.1 Scenario Design¶
| Scenario | Target code | Core focus | Assertions |
|---|---|---|---|
| Eval 1: Web Handler review | internal/webapp/handler.go (285 lines) + parser.go + urlutil.go | HTTP input validation, SSRF, injection, resource lifecycle, false-positive suppression | 15 |
| Eval 2: OpenAI API client review | internal/converter/summary_openai.go (294 lines) + urlutil.go + config/loader.go | Secret management, external HTTP calls, SSRF, prompt injection, response body lifecycle | 15 |
| Eval 3: Benign pure-function review | internal/cli/exitcode.go (57 lines) | Lite depth judgment, 0 false positives, correct N/A labeling | 10 |
2.2 Execution¶
- With-skill runs first read SKILL.md and its referenced Go secure-coding and scenario checklists
- Without-skill runs read no skill; review follows model default security review behavior
- All runs execute in independent subagents
3. Assertion Pass Rate¶
3.1 Summary¶
| Scenario | Assertions | With Skill | Without Skill | Delta |
|---|---|---|---|---|
| Eval 1: Web Handler | 15 | 15/15 (100%) | 7/15 (46.7%) | +53.3% |
| Eval 2: API Client | 15 | 15/15 (100%) | 9/15 (60.0%) | +40.0% |
| Eval 3: Benign Code | 10 | 10/10 (100%) | 4/10 (40.0%) | +60.0% |
| Total | 40 | 40/40 (100%) | 20/40 (50.0%) | +50.0% |
3.2 Classification of 20 Without-Skill Failed Assertions¶
| Failure type | Count | Evals | Notes |
|---|---|---|---|
| Missing Review Depth selection | 3 | 1/2/3 | No Lite/Standard/Deep classification, no trigger signal analysis |
| Missing Confidence labels | 3 | 1/2/3 | No confirmed/likely/suspected distinction |
| Missing CWE/OWASP mapping | 3 | 1/2/3 | Only HIGH/MEDIUM/LOW severity, no standard mapping |
| Missing Gate D 10-Domain coverage | 3 | 1/2/3 | No systematic domain coverage assessment |
| Missing Machine-Readable JSON | 3 | 1/2/3 | No CI/inbox-consumable JSON summary |
| Missing Gate F Uncovered Risk list | 3 | 1/2/3 | No declaration of uncovered areas; may imply false completeness |
| Gate A construct-release pairing audit missing | 1 | 1 | No explicit resource lifecycle audit |
| Insufficient false-positive suppression | 1 | 1 | openAPISpecPath reported as MEDIUM but path not user-controlled |
3.3 Pass Rate by Assertion Category¶
| Category | With Skill | Without Skill | Delta |
|---|---|---|---|
| Structural compliance (depth/gates/output contract) | 18/18 (100%) | 0/18 (0%) | +100% |
| Security analysis quality (attack surface, suppression, remediation) | 13/13 (100%) | 12/13 (92.3%) | +7.7% |
| Standards mapping (CWE/OWASP/confidence) | 9/9 (100%) | 0/9 (0%) | +100% |
Key finding: The skill’s core value is structural compliance and standards mapping—without-skill pass rate for these categories is 0%. Security analysis quality (finding real vulnerabilities) differs by only 7.7%, so the base model already has strong security review ability; the skill’s incremental value is process discipline, not discovery capability.
4. Dimension-by-Dimension Comparison¶
4.1 Review Depth Selection (Skill-Only Capability)¶
This is the skill’s most distinctive output.
| Scenario | With Skill | Without Skill |
|---|---|---|
| Eval 1 (HTTP handler) | Standard — "new HTTP endpoints exposed" trigger signal | No depth selection |
| Eval 2 (API client) | Standard — "new external integration + secret management" trigger signal | No depth selection |
| Eval 3 (exitcode) | Lite — "1 file, no security-sensitive paths" + full exclusion rationale | No depth selection |
Practical value: Review Depth controls cost-benefit: - Lite mode skips Gates B/C/E, saving ~40% review time - Standard/Deep distinction ensures security-sensitive code gets adequate review - Without-skill applies the same depth to all scenarios, over-reviewing simple code and possibly under-reviewing complex code
4.2 False-Positive Suppression Quality¶
This is the skill’s largest quality delta.
| Suppression scenario | With Skill | Without Skill |
|---|---|---|
| SSRF via user URL (parser restricts github.com) | Correctly suppressed — "parser restricts host to github.com, handler doesn't make HTTP requests to raw URL" | Not reported (implicit handling) |
| Path traversal via openAPISpecPath | Correctly suppressed — "set at construction time from config, not user-controlled" (Rule 2) | ❌ Reported as MEDIUM |
| Open redirect via http.Redirect | Correctly suppressed — "redirect target is hardcoded /swagger/index.html" (Rule 2) | Not reported (but reported catch-all route) |
| XSS via template | Correctly suppressed — "html/template auto-escapes" (Rule 3) | Correctly identified (positive observation) |
| appendThreadText recursion | Correctly suppressed — "GitHub API limits nesting depth" | ❌ Reported as LOW (F-8) |
| CSRF on /convert | Correct N/A — "stateless form, no session, no state mutation" | ❌ Reported as HIGH |
Analysis: Without-skill’s CSRF finding (Eval 1 Finding #1) conflated cost exhaustion (rate limit exhaustion) with CSRF. With-skill correctly attributed the root cause to missing rate limiting (SEC-001 P2), not CSRF—because /convert is stateless with no session/cookie/state mutation. This demonstrates the skill’s suppression discipline: it prevents inflated severity by separating root cause from delivery mechanism.
4.3 Output Structure Comparison¶
| Output section | With Skill | Without Skill |
|---|---|---|
| Review Depth + rationale | ✅ 3/3 | ❌ 0/3 |
| Trust Boundary Mapping | ✅ 3/3 | ❌ 0/3 (Eval 2 has similar content) |
| Scenario Checklists (11 items) | ✅ 3/3 | ❌ 0/3 |
| Gate A pairing table | ✅ 3/3 | ❌ 0/3 |
| Gate D 10-Domain table | ✅ 3/3 | ❌ 0/3 |
| Suppression Filter table | ✅ 2/2 (Eval 3 N/A) | ❌ 0/2 |
| Gate E secondary verification | ✅ 2/2 (Lite skips) | ❌ 0/2 |
| Findings (severity+confidence+CWE) | ✅ 3/3 | Partial (no confidence/CWE) |
| Remediation Plan (immediate/short/backlog) | ✅ 3/3 | Partial (priority but no SLA) |
| Risk Acceptance Register | ✅ 3/3 | ❌ 0/3 |
| JSON Summary | ✅ 3/3 | ❌ 0/3 |
| Gate F Uncovered Risk List | ✅ 3/3 | ❌ 0/3 |
4.4 Security Finding Quality Comparison¶
Despite large structural differences, both configurations overlap significantly on core security findings:
| Core finding | With Skill | Without Skill |
|---|---|---|
| Rate limiting missing | SEC-001 P2 ✅ | Finding #2 HIGH ✅ |
| Security headers missing | SEC-002/003 P3 ✅ | Finding #3 MEDIUM ✅ |
| Prompt injection | SEC-002 P2 (Eval 2) ✅ | F-3 MEDIUM ✅ |
| Unbounded response body | SEC-003 P2 (Eval 2) ✅ | F-4 MEDIUM ✅ |
| Redirect following leak | SEC-004 P2 (Eval 2) ✅ | F-2 HIGH ✅ |
| SSRF DNS rebinding | SEC-005 P3 (Eval 2) ✅ | F-1 HIGH ✅ |
| API key plain string | SEC-001 P3 (Eval 2) ✅ | F-6 LOW ✅ |
Without-skill-only findings: - CSRF on /convert (HIGH) — root-cause misattribution - URL scheme enforcement in parser (MEDIUM) — valid defense-in-depth - Unbounded pagination (MEDIUM) — valid; with-skill mentioned in Gate F - Token via CLI flag (LOW) — valid but out of changed scope - appendThreadText recursion (LOW) — with-skill correctly suppresses
With-skill-only findings: - No core finding is with-skill-only (base model discovery is strong) - With-skill severity calibration is more precise (e.g., SSRF DNS rebinding correctly P3 suspected, not HIGH)
4.5 Eval 3 (Benign Code) — Lite Review Quality¶
This is the skill’s clearest efficiency advantage.
| Dimension | With Skill | Without Skill |
|---|---|---|
| Output lines | 227 | 46 |
| Review Depth declaration | "Lite (1 file, no security-sensitive paths)" + 9 trigger signals excluded | None |
| Findings | 0 (correct) | 0 (correct) |
| Domain Coverage | 10/10 N/A (each domain has code evidence) | Table but no numbered domains |
| Gates Skipped declaration | "Gates B/C/E skipped per Lite scope policy" | No Gate concept |
| Gate F Uncovered Risk | 4 items (gosec not run, govulncheck not run, etc.) | None |
| JSON Summary | pass: true, 0 findings | None |
| Review depth appropriateness | ✅ No over-review of simple code | ⚠️ Unclear if brief or appropriate |
Analysis: Without-skill’s 46-line output correctly concluded "no security vulnerabilities" but lacked audit traceability. With-skill’s 227-line output provides full audit record: why Lite was chosen, why each domain is N/A, which checks were skipped and why. For compliance (e.g., SOC 2 audit), this traceability is required.
5. Token Cost-Effectiveness Analysis¶
5.1 Skill Size¶
| File | Lines | Words | Bytes | Est. Tokens |
|---|---|---|---|---|
| SKILL.md | 456 | 2,870 | 20,818 | ~3,800 |
| references/go-secure-coding.md | 723 | 2,957 | 25,019 | ~4,600 |
| references/scenario-checklists.md | 140 | 889 | 6,588 | ~1,200 |
| references/security-review.md | 112 | 557 | 4,165 | ~800 |
| references/lang-nodejs.md | 149 | 701 | 5,389 | ~1,000 |
| references/lang-java.md | 123 | 561 | 4,716 | ~900 |
| references/lang-python.md | 122 | 542 | 4,363 | ~800 |
| Description (always in context) | — | ~40 | — | ~50 |
Typical load scenarios:
| Scenario | Files read | Total Tokens |
|---|---|---|
| Go code review (Standard/Deep) | SKILL.md + go-secure-coding.md + scenario-checklists.md | ~9,600 |
| Go code review (Lite) | SKILL.md + scenario-checklists.md | ~5,000 |
| Node.js code review | SKILL.md + scenario-checklists.md + lang-nodejs.md | ~6,000 |
| Java code review | SKILL.md + scenario-checklists.md + lang-java.md | ~5,900 |
| SKILL.md only (minimal) | SKILL.md | ~3,800 |
5.2 Token Cost for Quality Gain¶
| Metric | Value |
|---|---|
| With-skill pass rate | 100% (40/40) |
| Without-skill pass rate | 50.0% (20/40) |
| Pass-rate gain | +50.0 percentage points |
| Token cost per assertion fixed | ~190 tokens (SKILL.md only) / ~480 tokens (Go full) |
| Token cost per 1% pass-rate gain | ~76 tokens (SKILL.md only) / ~192 tokens (Go full) |
5.3 Token Segment Cost-Effectiveness¶
SKILL.md content split by functional module:
| Module | Est. Tokens | Related assertion delta | Cost-effectiveness |
|---|---|---|---|
| Review Depth Selection | ~250 | 3 (3 evals depth selection) | Very high — 83 tok/assertion |
| Evidence Confidence | ~100 | 3 (3 evals confidence labels) | Very high — 33 tok/assertion |
| Suppression Rules | ~180 | 2 (Eval 1/2 suppression quality) | Very high — 90 tok/assertion |
| Output Contract | ~500 | 3 (3 evals JSON summary) | High — 167 tok/assertion |
| Gate D 10-Domain | ~400 | 3 (3 evals domain coverage) | High — 133 tok/assertion |
| Gate A pairing | ~150 | 1 (Eval 1 pairing audit) | High — 150 tok/assertion |
| Gate F Uncovered Risk | ~80 | 3 (3 evals risk list) | Very high — 27 tok/assertion |
| Standards Mapping | ~50 | 3 (3 evals CWE mapping) | Very high — 17 tok/assertion |
| Severity Model + SLA | ~200 | Indirect (more precise severity calibration) | Medium — no direct assertion |
| Anti-Examples | ~350 | Indirect (avoids AE-1/AE-3/AE-5 errors) | Medium — defensive value |
| Scenario Checklists pointer | ~200 | Indirect (11-scenario systematic coverage) | Medium — structured review |
| Baseline Diff Mode | ~100 | 0 (no baseline scenario tested) | Low — not tested |
| Language Extension Hooks | ~150 | 0 (Go only tested) | Low — not tested |
| Focused Automation Gate | ~350 | Indirect (automation tool execution consistency) | Medium — tool discipline |
| go-secure-coding.md (reference) | ~4,600 | Indirect (Gate B/D detailed check guide) | Medium — deep review support |
| scenario-checklists.md (reference) | ~1,200 | Indirect (11-scenario detailed checks) | Medium — systematic coverage |
5.4 High-Leverage vs Low-Leverage Instructions¶
High leverage (~1,710 tokens SKILL.md → 18 assertion delta): - Review Depth Selection (250 tok → 3) - Evidence Confidence (100 tok → 3) - Suppression Rules (180 tok → 2) - Output Contract (500 tok → 3) - Gate D 10-Domain (400 tok → 3) - Gate F Uncovered Risk (80 tok → 3) - Standards Mapping (50 tok → 3) - Gate A pairing (150 tok → 1)
Medium leverage (~1,100 tokens → indirect quality gain): - Anti-Examples (350 tok) — prevents false positives - Scenario Checklists pointer (200 tok) — systematic - Severity Model + SLA (200 tok) — severity calibration - Focused Automation Gate (350 tok) — tool execution discipline
Low leverage (~250 tokens → no contribution in this evaluation): - Baseline Diff Mode (100 tok) — not tested - Language Extension Hooks (150 tok) — Go only tested
References (~5,800 tokens → indirect review depth): - go-secure-coding.md (4,600 tok) — Gate B/D depth support - scenario-checklists.md (1,200 tok) — scenario systematic coverage
5.5 Token Efficiency Rating¶
| Rating | Conclusion |
|---|---|
| Overall ROI | Excellent — ~9,600 tokens for +50.0% pass rate (highest among evaluated skills) |
| SKILL.md ROI | Excellent — ~3,800 tokens contains all high-leverage rules |
| High-leverage token share | ~45% (1,710/3,800) directly contributes 18/20 assertion delta |
| Low-leverage token share | ~6.6% (250/3,800) contributes nothing in this evaluation |
| Reference cost-effectiveness | High — though 60% of total tokens, provides required depth for Gate B/D |
5.6 Comparison with Other Skills’ Cost-Effectiveness¶
| Metric | security-review | go-makefile-writer | google-search | deep-research | tdd-workflow |
|---|---|---|---|---|---|
| SKILL.md Tokens | ~3,800 | ~1,960 | ~3,500 | ~2,200 | ~2,800 |
| Total load Tokens | ~9,600 | ~4,100–4,600 | ~6,900 | ~3,500 | ~4,200 |
| Pass-rate gain | +50.0% | +31.0% | +74.1% | +66.7% | +46.2% |
| Tokens per 1% (SKILL.md) | ~76 tok | ~63 tok | ~47 tok | ~33 tok | ~61 tok |
| Tokens per 1% (full) | ~192 tok | ~149 tok | ~93 tok | ~53 tok | ~91 tok |
Analysis: security-review’s SKILL.md cost-effectiveness (76 tok/1%) is mid-to-low among evaluated skills, but its absolute pass-rate gain (+50.0%) is highest, meaning the skill addresses a more fundamental gap—the base model has a large gap in security review structural compliance (without-skill structural compliance pass rate 0%), and the skill fully fills it.
References account for ~60% of tokens, but the Go secure-coding reference is required for Gate B/D and cannot be simplified. If selective loading is introduced (Lite skips go-secure-coding.md), Lite scenario token cost could drop from ~9,600 to ~5,000.
6. Boundary Analysis vs Claude Base Model Capabilities¶
6.1 Base Model Capabilities (No Skill Increment)¶
| Capability | Evidence |
|---|---|
| Identify rate limiting missing | 3/3 relevant scenarios correct |
| Identify prompt injection risk | 1/1 scenario correct (Eval 2) |
| Identify unbounded response body | 1/1 scenario correct (Eval 2) |
| Identify HTTP redirect following risk | 1/1 scenario correct (Eval 2) |
| Identify SSRF DNS rebinding | 1/1 scenario correct (Eval 2) |
| Identify API key storage issue | 1/1 scenario correct (Eval 2) |
| Correctly judge benign code has no vulnerabilities | 1/1 scenario correct (Eval 3) |
| MaxBytesReader positive defense identification | 1/1 scenario correct (Eval 1) |
| html/template safety identification | 1/1 scenario correct (Eval 1) |
| Provide code-level remediation | 3/3 scenarios correct |
6.2 Base Model Gaps (Skill Fills)¶
| Gap | Evidence | Risk level |
|---|---|---|
| No Review Depth classification | 3/3 scenarios no depth selection | High — review cost uncontrolled |
| No Confidence labels | 3/3 scenarios no confirmed/likely/suspected | High — can’t distinguish confirmed vs hypothetical |
| No CWE/OWASP mapping | 3/3 scenarios no standard mapping | High — doesn’t meet compliance audit requirements |
| No systematic domain coverage | 3/3 scenarios no Gate D 10-Domain | High — may miss entire security domains |
| No Machine-Readable output | 3/3 scenarios no JSON | Medium — CI automation gates unavailable |
| No Uncovered Risk declaration | 3/3 scenarios no Gate F | High — false completeness (AE-5) |
| Insufficient false-positive suppression | Eval 1 path traversal false positive; CSRF root-cause misattribution | Medium — developer trust erosion |
| No resource lifecycle audit | Eval 1 no Gate A pairing table | Medium — may miss resource leaks |
7. Overall Score¶
7.1 Dimension Scores¶
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Review process structure | 5.0/5 | 1.0/5 | +4.0 |
| Security finding quality | 4.5/5 | 4.0/5 | +0.5 |
| False-positive suppression accuracy | 5.0/5 | 2.5/5 | +2.5 |
| Severity calibration | 5.0/5 | 3.0/5 | +2.0 |
| Standards mapping compliance | 5.0/5 | 0.5/5 | +4.5 |
| Output consumability (JSON/audit) | 5.0/5 | 1.0/5 | +4.0 |
| Overall mean | 4.92/5 | 2.0/5 | +2.92 |
7.2 Weighted Total¶
| Dimension | Weight | Score | Weighted |
|---|---|---|---|
| Assertion pass rate (delta) | 25% | 10/10 | 2.50 |
| Review process structure | 20% | 10/10 | 2.00 |
| False-positive suppression & severity calibration | 20% | 9.5/10 | 1.90 |
| Standards mapping compliance | 15% | 10/10 | 1.50 |
| Token cost-effectiveness | 10% | 7.0/10 | 0.70 |
| Maintainability & extensibility | 10% | 8.0/10 | 0.80 |
| Weighted total | 9.40/10 |
8. Evaluation Artifacts¶
| Artifact | Path |
|---|---|
| Eval 1 with-skill output | /tmp/secreview-eval/eval-1/with_skill/response.md |
| Eval 1 without-skill output | /tmp/secreview-eval/eval-1/without_skill/response.md |
| Eval 2 with-skill output | /tmp/secreview-eval/eval-2/with_skill/response.md |
| Eval 2 without-skill output | /tmp/secreview-eval/eval-2/without_skill/response.md |
| Eval 3 with-skill output | /tmp/secreview-eval/eval-3/with_skill/response.md |
| Eval 3 without-skill output | /tmp/secreview-eval/eval-3/without_skill/response.md |
| Skill file | /Users/john/.codex/skills/security-review/SKILL.md |
| Go secure-coding reference | /Users/john/.codex/skills/security-review/references/go-secure-coding.md |
| Scenario checklist reference | /Users/john/.codex/skills/security-review/references/scenario-checklists.md |