create-pr Skill Evaluation Report¶
Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation target:
create-pr
create-pr is a structured PR-creation skill for main-branch pull requests. It runs branch hygiene checks, quality verification, security scanning, and PR content generation before submission, aiming for reviewable, traceable, and safely mergeable PRs. Its three main strengths are: an 8-gate mandatory preflight flow that prevents unverified changes from entering review; a three-tier confidence model (confirmed / likely / suspected) that drives draft vs ready decisions and reduces misclassification; and a fixed-section PR Body template that makes test evidence, risk notes, and uncovered items more complete and consistent.
1. Skill Overview¶
create-pr is a structured PR-creation skill that defines 8 mandatory Gates (A–H), a 3-tier confidence model, non-negotiable rules, and a PR Body template with 8 required sections. Its goal is to ensure every PR passes full preflight, quality checks, security scanning, and commit-format validation before push.
Core components:
| File | Lines | Role |
|---|---|---|
SKILL.md | 373 | Main skill definition (Gate flow, rules, template references) |
references/pr-body-template.md | 55 | PR Body 8-section template |
references/create-pr-checklists.md | 59 | Stage-specific checklists |
references/create-pr-config.example.yaml | 59 | Repo-level config example |
scripts/create_pr.py | 1449 | One-shot script (Gate execution + PR creation) |
scripts/tests/test_create_pr.py | 276 | Script unit tests |
2. Test Design¶
2.1 Scenario Definition¶
| # | Scenario | Branch | Core challenge | Expected result |
|---|---|---|---|---|
| 1 | Clean feature | feat/add-word-count | Small Go change, conventional commits, all checks should pass | ready, confirmed |
| 2 | Poor commit hygiene | quick-fix | Non-CC commit message + non-standard branch name | draft, suspected |
| 3 | Security-sensitive change | fix/token-handling | Hardcoded ghp_ GitHub token in code | draft, suspected |
2.2 Assertion Matrix (34 items)¶
Scenario 1 — Clean Feature (13 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| A1 | Systematically run all Gates A–G (with command evidence) | PASS | FAIL |
| A2 | Gate A: Check auth, remote, base branch | PASS | FAIL |
| A3 | Gate B: Check branch naming | PASS | FAIL |
| A4 | Gate C: Change-size classification (≤400 lines = normal) | PASS | FAIL |
| A5 | Gate D: Run tests + lint and record results | PASS | PARTIAL |
| A6 | Gate E: Run security scan on changed files | PASS | FAIL |
| A7 | Gate F: Check docs/compatibility | PASS | PARTIAL |
| A8 | Gate G: Validate Conventional Commits format | PASS | FAIL |
| A9 | PR title follows CC format (≤50 chars) | PASS | PASS |
| A10 | PR Body includes all 8 required sections | PASS | FAIL |
| A11 | Explicit Confidence Level declaration | PASS | FAIL |
| A12 | Draft/ready decision based on Gate results | PASS | FAIL |
| A13 | Output follows Output Contract | PASS | FAIL |
Scenario 2 — Poor Commit Hygiene (10 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| B1 | Systematically run all Gates A–G | PASS | FAIL |
| B2 | Gate B: Warn on non-standard branch name | PASS | FAIL |
| B3 | Gate D: Run tests + lint | PASS | PARTIAL |
| B4 | Gate G: Flag non-CC commit message | PASS | PASS |
| B5 | PR title follows CC format | PASS | PARTIAL |
| B6 | PR Body includes all 8 required sections | PASS | FAIL |
| B7 | Explicit Confidence Level declaration | PASS | FAIL |
| B8 | Recommend draft based on Gate failures | PASS | PASS |
| B9 | Identify new function missing unit tests | PASS | PASS |
| B10 | Output follows Output Contract (structured Gate verdict) | PASS | FAIL |
Scenario 3 — Security-Sensitive Change (11 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| C1 | Systematically run all Gates A–G | PASS | FAIL |
| C2 | Gate E: Explicitly detect hardcoded ghp_ token | PASS | PASS |
| C3 | Gate E: Mark as blocking security issue | PASS | PASS |
| C4 | Confidence = suspected (multiple Gate failures) | PASS | FAIL |
| C5 | Recommend draft | PASS | PASS |
| C6 | PR title follows CC format | PASS | PASS |
| C7 | PR Body includes all 8 required sections | PASS | FAIL |
| C8 | Security Notes section specifically calls out token issue | PASS | PARTIAL |
| C9 | Output includes structured Uncovered Risk List | PASS | FAIL |
| C10 | Explicitly advise not to push/create PR until key removed | PASS | PARTIAL |
| C11 | Output follows Output Contract | PASS | FAIL |
3. Pass Rate Comparison¶
3.1 Overall Pass Rate¶
| Config | Pass | Partial | Fail | Pass rate |
|---|---|---|---|---|
| With Skill | 34 | 0 | 0 | 100% |
| Without Skill | 10 | 5 | 19 | 29% (with PARTIAL = 0.5: 37%) |
Pass-rate gain: +71 pp (with PARTIAL: +63 pp)
3.2 Pass Rate by Scenario¶
| Scenario | With-Skill | Without-Skill | Delta |
|---|---|---|---|
| 1. Clean feature | 13/13 (100%) | 2/13 (15%) | +85 pp |
| 2. Poor commit hygiene | 10/10 (100%) | 3.5/10 (35%) | +65 pp |
| 3. Security-sensitive change | 11/11 (100%) | 4.5/11 (41%) | +59 pp |
3.3 Substantive Dimensions (Core Capabilities Independent of Flow Structure)¶
To control for "flow-assertion bias", 12 substantive checks unrelated to flow structure were evaluated:
| ID | Check | With-Skill | Without-Skill |
|---|---|---|---|
| S1 | Scenario 1: Run tests and pass | PASS | PASS |
| S2 | Scenario 1: Run lint | PASS | FAIL |
| S3 | Scenario 1: Security scan (rg/gosec/govulncheck) | PASS | FAIL |
| S4 | Scenario 1: PR title CC format | PASS | PASS |
| S5 | Scenario 2: Branch naming issue flagged | PASS | FAIL |
| S6 | Scenario 2: Commit message issue flagged | PASS | PASS |
| S7 | Scenario 2: Missing GoDoc flagged | PASS | PASS |
| S8 | Scenario 2: Missing test flagged | PASS | PASS |
| S9 | Scenario 3: Hardcoded token detection | PASS | PASS |
| S10 | Scenario 3: Mark as draft/blocking | PASS | PASS |
| S11 | Scenario 3: Multi-tool cross-validation | PASS | FAIL |
| S12 | All: Structured PR Body | PASS | FAIL |
Substantive pass rate: With-Skill 12/12 (100%) vs Without-Skill 7/12 (58%), gain +42 pp.
4. Key Differences¶
4.1 With-Skill-Only Behaviors (Baseline Never Shows)¶
| Behavior | Impact |
|---|---|
| Systematic 8-Gate flow | Each Gate explicitly executed with command evidence and PASS/FAIL/SUPPRESSED verdict |
| Gate A: GitHub auth preflight | Validates gh auth status, gh repo view, branch protection rules |
| Gate B: Branch naming check | Automatically detects quick-fix violates type/short-description pattern |
| Gate C: Change risk classification | Tiers by line count (≤400 / 401–800 / >800), flags high-risk areas |
| Gate E: Multi-tool security scan | rg regex + gosec + govulncheck triple cross-validation |
| Confidence model | confirmed/likely/suspected tiers, directly tied to draft/ready |
| Output Contract | Structured report: Gate results → Uncovered Risk → PR metadata → Next Actions |
| PR Body 8-section template | Problem, What Changed, Why, Risk/Rollback, Test Evidence, Security, Breaking Changes, Reviewer Checklist |
4.2 Behaviors Baseline Can Do but at Lower Quality¶
| Behavior | With-Skill quality | Without-Skill quality |
|---|---|---|
| Security issue detection | 3-tool cross-validation, structured report | Code review finds issues, no tool evidence |
| Commit message validation | Precise format check + char count | Identifies issues but no length check |
| Test execution | make test + golangci-lint + go build | Only make test (occasionally go vet) |
| PR Body | 8-section structure | Free-form, missing key sections |
| Draft/Ready decision | Formal reasoning from Gate verdicts | Subjective judgment |
4.3 Scenario-Level Findings¶
Scenario 1 (clean feature): - With-Skill: All 7 Gates pass, confidence = confirmed, recommend ready. Ran full toolchain: gosec, govulncheck, golangci-lint. - Without-Skill: Only make test + go vet, no security scan. Incorrectly recommended draft (based on YAGNI, not Gate failures).
Scenario 2 (poor commits): - With-Skill: Gate B warns on branch name, Gate D detects lint failure (missing GoDoc), Gate G detects non-CC commit. Confidence = suspected, recommend draft. Provides 6-step fix plan. - Without-Skill: Identified commit message and GoDoc issues but not branch naming; did not run lint; no structured Gate verdict.
Scenario 3 (security-sensitive): - With-Skill: Gate E detects ghp_ token via rg/gosec/golangci-lint, produces detailed security report with CWE ID, severity, fix steps, token revocation steps. Explicitly blocks push/create. - Without-Skill: Found token via code review, correctly marked CRITICAL, but no tool evidence chain, no CWE reference, fix advice less specific.
5. Token Cost-Effectiveness¶
5.1 Skill Context Token Cost¶
| Component | Lines | Est. tokens | Load timing |
|---|---|---|---|
SKILL.md | 373 | ~2,500 | Always |
pr-body-template.md | 55 | ~400 | On demand |
create-pr-checklists.md | 59 | ~500 | On demand |
create-pr-config.example.yaml | 59 | ~350 | On demand |
| Typical total | ~487 | ~3,400 | SKILL.md + template + checklists |
Note: scripts/create_pr.py (1449 lines, ~10,000 tokens) is only loaded in script mode and is not part of default context.
5.2 Cost-Effectiveness¶
| Metric | Value |
|---|---|
| Overall pass-rate gain | +71 pp (strict) / +63 pp (with PARTIAL) |
| Substantive pass-rate gain | +42 pp |
| Skill context cost | ~3,400 tokens |
| Token cost per 1% pass-rate gain (overall) | ~48 tokens/1% |
| Token cost per 1% pass-rate gain (substantive) | ~81 tokens/1% |
5.3 Comparison with Other Skills¶
| Skill | Token cost | Pass-rate gain | Tokens/1% |
|---|---|---|---|
git-commit | ~1,150 | +22 pp | ~51 |
go-makefile-writer | ~1,960 (SKILL.md) / ~4,300 (full) | +31 pp | ~63–139 |
create-pr | ~3,400 | +71 pp | ~48 |
create-pr has the best tokens/1% among these skills, mainly because its pass-rate delta is very large (+71 pp)—the baseline is weak in structured PR creation, so the skill’s marginal value is high.
5.4 Token Return Curve¶
Token investment vs. return:
~2,500 tokens (SKILL.md only):
→ Gains: Gate flow, Non-Negotiables, Confidence model, Command Playbook
→ Estimated coverage: ~90% of pass-rate gain
+400 tokens (pr-body-template.md):
→ Gains: 8-section PR Body template
→ Estimated coverage: +8% pass-rate gain (PR Body structure assertions)
+500 tokens (checklists):
→ Gains: Stage-specific checklists
→ Estimated coverage: +2% pass-rate gain (low marginal value)
SKILL.md alone provides ~90% of the value; reference files add the remaining 10%.
6. Overall Score¶
6.1 Dimension Scores¶
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Gate execution completeness (A–G systematic + command evidence) | 5.0/5 | 1.5/5 | +3.5 |
| PR Body structure quality (8-section template) | 5.0/5 | 2.0/5 | +3.0 |
| Security scan capability (multi-tool cross-validation) | 5.0/5 | 2.0/5 | +3.0 |
| Confidence/Draft decision accuracy | 5.0/5 | 3.0/5 | +2.0 |
| Commit format compliance check | 5.0/5 | 2.5/5 | +2.5 |
| Structured output report (Output Contract) | 5.0/5 | 1.0/5 | +4.0 |
| Mean | 5.0/5 | 2.0/5 | +3.0 |
Dimension notes:
- Gate execution completeness: With-Skill ran all 7 Gates systematically in all 3 scenarios, with exact commands and output evidence. Without-Skill only ran
make test(occasionallygo vet), with no auth check, branch naming check, risk classification, or security scan tools. - PR Body structure quality: With-Skill always produced 8 required sections (Problem/Context, What Changed, Why, Risk/Rollback, Test Evidence, Security Notes, Breaking Changes, Reviewer Checklist). Without-Skill produced free-form bodies, often missing Risk/Rollback, Security Notes, Breaking Changes.
- Security scan capability: With-Skill used
rgregex +gosec+govulnchecktriple cross-validation. In Scenario 3, all three tools independently detected theghp_token (including CWE-798). Without-Skill found the token via code review only, with no tool evidence chain. - Confidence/Draft decision: With-Skill was correct in 3/3 scenarios (1: confirmed→ready; 2: suspected→draft; 3: suspected→draft). Without-Skill incorrectly recommended draft in Scenario 1 (YAGNI, not Gate results); Scenarios 2/3 recommended draft correctly but without formal reasoning.
- Commit format compliance: With-Skill Gate G validates CC format, char count, and tone. In Scenario 2 it identified 3 violations (missing type(scope):, past tense, no structured format). Without-Skill identified issues in Scenario 2 but no char count or precise format check.
- Structured output report: With-Skill strictly followed Output Contract (Gate verdict → Uncovered Risk → PR metadata → Next Actions). Without-Skill had no structured output or Gate verdict summary table.
6.2 Weighted Total¶
| Dimension | Weight | Score | Rationale | Weighted |
|---|---|---|---|---|
| Assertion pass rate (delta) | 25% | 10.0/10 | +71 pp (overall) / +42 pp (substantive), largest delta among skills | 2.50 |
| Gate execution completeness | 20% | 10.0/10 | 3/3 scenarios ran all 7 Gates systematically + command evidence | 2.00 |
| PR Body structure quality | 15% | 10.0/10 | 3/3 scenarios full 8 sections + evidence tables | 1.50 |
| Security scan capability | 15% | 9.5/10 | Strong triple-tool validation; room for more regex patterns | 1.43 |
| Token cost-effectiveness | 15% | 7.5/10 | ~48 tok/1% best, but ~30% content unused (script, Monorepo, merge strategy) | 1.13 |
| Confidence/Draft decision accuracy | 10% | 10.0/10 | 3/3 correct decisions, clear formal reasoning | 1.00 |
| Weighted total | 100% | 9.55/10 |
6.3 Comparison with Other Skills¶
| Skill | Weighted total | Pass-rate delta | Tokens/1% | Strongest dimension |
|---|---|---|---|---|
| create-pr | 9.55/10 | +71 pp | ~48 | Gate flow (+3.5), Output Contract (+4.0) |
| go-makefile-writer | 9.16/10 | +31 pp | ~63 | CI reproducibility (+3.0), Output Contract (+4.0) |
| git-commit | — | +22 pp | ~51 | — |
create-pr has the highest overall score among these skills because:
- Very large pass-rate delta (+71 pp): PR creation is a weak area for the baseline model
- No weak dimensions: 5 of 6 dimensions at full score; only Token cost-effectiveness below full due to ~30% content redundancy
- Best token cost-effectiveness (~48 tok/1%): Despite higher absolute token count (~3,400), the large pass-rate delta makes unit cost lowest
Deductions: Token cost-effectiveness (7.5/10) is the only clearly below-full dimension, mainly because: - Bundled Script section (~500 tokens) is 14% of SKILL.md but unused by agents - Merge Strategy Guidance (~200 tokens) has no value in non-Squash scenarios - Monorepo Support (~80 tokens) is useless for single-module repos
7. Conclusion¶
The create-pr skill has the largest pass-rate delta in this evaluation (+71 pp) and the best tokens/1% (~48 tokens/1%). This indicates that PR creation is an area where the baseline model lacks structured capability, so the skill’s marginal value is very high.
Core value: 1. 8-Gate mandatory flow: Ensures security scan, lint, auth, etc. are not skipped 2. Confidence model: Turns draft/ready from subjective judgment into formal reasoning 3. Multi-tool cross-validation: Gate E’s rg + gosec + govulncheck triple detection stood out in Scenario 3 4. 8-section PR Body: Standardized output gives reviewers a consistent experience
Main risk: ~30% of SKILL.md (script description, Monorepo, merge strategy) is unused in typical scenarios; token cost-effectiveness could be improved with modular trimming.