writing-plans Skill Evaluation Report¶
Evaluation framework: skill-creator Evaluation date: 2026-03-27 Evaluation target:
writing-plans
writing-plans is a structured skill for pre-implementation planning in multi-step tasks. It links requirement clarification, applicability checks, path discovery, and scope assessment through 4 mandatory Gates, with the goal of producing high-quality implementation plans that have verified paths, graded risks, defined interfaces, and are ready to execute immediately. Its three strongest advantages are: a 4-Gate upfront flow that prevents "ghost plans" from being written before the task is clear; four execution modes (SKIP / Lite / Standard / Deep) that scale output to task complexity and avoid document overload; and a verification system built around [Existing] / [New] / [Inferred] / [Speculative] path labels plus [interface] / [test-assertion] / [command] code-block tags, turning the plan into an engineering document that is both verifiable and executable.
1. Skill Overview¶
writing-plans is a structured implementation-planning skill. It defines 4 mandatory Gates, 4 execution modes, 10 anti-pattern checks, and a template system covering 6 change types. Its purpose is to ensure every plan completes requirement clarification, path verification, and risk grading before writing begins.
Core components:
| File | Lines | Role |
|---|---|---|
SKILL.md | 301 | Main skill definition (4-Gate flow, 4 execution modes, Output Contract) |
references/requirements-clarity-gate.md | 128 | Gate 1: 5-dimension requirement-clarity rules |
references/applicability-gate.md | 51 | Gate 2: applicability decision tree and mode selection |
references/repo-discovery-protocol.md | 80 | Gate 3: path verification protocol and 4-label system |
references/golden-scenarios.md | 157 | GOOD/BAD examples across 6 scenario types |
references/reviewer-checklist.md | 71 | Three-layer review checklist: B / N / SB |
references/anti-examples.md | 104 | 10 anti-patterns (BAD/GOOD + WHY) |
references/plan-update-protocol.md | 44 | Drift severity and replanning thresholds |
references/plan-templates/feature.md | 39 | Feature-plan template |
references/plan-templates/bugfix.md | 31 | Bug-fix template |
references/plan-templates/refactor.md | 48 | Refactor template |
references/plan-templates/migration.md | 44 | Migration template |
references/plan-templates/api-change.md | 42 | API-change template |
references/plan-templates/docs-only.md | 45 | Documentation-change template, mainly for the SKIP path |
Test suite (test_skill_contract.py + test_golden_scenarios.py) | 831 | Contract tests + golden-scenario validation |
2. Test Design¶
2.1 Scenario Definition¶
| # | Scenario | Core challenge | Expected result |
|---|---|---|---|
| 1 | Clear feature request | JWT auth for a Go API, crossing auth boundaries in 5 packages | Standard-mode plan, all Gates pass, path labels, interface code blocks |
| 2 | Vague request | "Make the system faster", with no scope, metrics, or target | Gate 1 STOP, ask clarifying questions, no plan document generated |
| 3 | Documentation change | Update README with a new API section | Gate 2 SKIP, concise execution checklist, no full plan document |
Scenario 1 test prompt:
"I need to add JWT-based user authentication to our Go REST API. The API currently serves
/usersand/productsendpoints. I want to add/auth/login,/auth/register, and/auth/refreshendpoints with middleware that protects existing routes."
Scenario 2 test prompt:
"Make the system faster. There are some performance issues we need to fix."
Scenario 3 test prompt:
"Update the README.md to add a section about the new API endpoints we just added. Just document what they do and show example curl commands."
2.2 Assertion Matrix (34 items)¶
Scenario 1: Clear feature request (13 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| A1 | Systematically run all 4 Gates, with command evidence | PASS | FAIL |
| A2 | Gate 2 (Applicability) selects Standard mode | PASS | FAIL |
| A3 | Gate 3 (Repo Discovery) adds [Existing] / [New] labels to all paths | PASS | FAIL |
| A4 | Uses the feature.md template structure, with all required sections | PASS | FAIL |
| A5 | Uses [interface] code blocks, not full implementations | PASS | FAIL |
| A6 | Uses [command] code blocks for verification steps, with exact commands | PASS | FAIL |
| A7 | Passes all Critical items in the Quality Scorecard (B: 6/6) | PASS | FAIL |
| A8 | Includes a reviewer loop, at least 1 round | PASS | FAIL |
| A9 | Does not include full function implementations (Anti-Pattern #2) | PASS | FAIL |
| A10 | Includes rollback and risk assessment for each task | PASS | PARTIAL |
| A11 | Plan structure matches the Output Contract | PASS | FAIL |
| A12 | Scope and risk grading are explicit, as Gate 4 output | PASS | FAIL |
| A13 | Independent tasks are marked as parallelizable | PASS | FAIL |
Scenario 2: Vague request (10 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| B1 | Gate 1 identifies ambiguity, with multiple STOP dimensions triggered | PASS | PARTIAL |
| B2 | Asks specific clarifying questions, at least 3 | PASS | PASS |
| B3 | Does not skip Gate 1 and jump straight to a plan document | PASS | PASS |
| B4 | Clarifying questions cover goals, scope, and constraints | PASS | PASS |
| B5 | Questions include concrete dimensions such as performance metrics, component scope, baseline, and target | PASS | PARTIAL |
| B6 | Clearly explains why clarification is needed instead of guessing | PASS | PASS |
| B7 | Does not use [Speculative] paths, with no degraded-mode abuse | PASS | PASS |
| B8 | Does not generate a plan body | PASS | PASS |
| B9 | Explains the path to continue after clarification | PASS | PASS |
| B10 | Output format matches the Gate 1 failure protocol, with a STOP declaration | PASS | FAIL |
Scenario 3: Documentation change (11 items)
| ID | Assertion | With-Skill | Without-Skill |
|---|---|---|---|
| C1 | Gate 2 (Applicability) correctly chooses SKIP mode | PASS | PARTIAL |
| C2 | Explicitly states the reason for SKIP: docs-only change with no cross-module dependency | PASS | PASS |
| C3 | Does not generate a full Standard or Deep plan document | PASS | PASS |
| C4 | Recommends direct execution, or provides an execution checklist | PASS | PASS |
| C5 | Does not run the full Gate 3 (Repo Discovery) flow | PASS | PASS |
| C6 | Does not invent unverified file paths or endpoints | PASS | FAIL |
| C7 | Does not run a Quality Scorecard evaluation, which is unnecessary in SKIP mode | PASS | PASS |
| C8 | Does not trigger a reviewer loop | PASS | PASS |
| C9 | Output stays concise, with a clear decision section | PARTIAL | FAIL |
| C10 | Matches the SKIP signals in the docs-only.md template | PASS | FAIL |
| C11 | Follows the Output Contract for the SKIP branch | PASS | FAIL |
3. Pass Rate Comparison¶
3.1 Overall Pass Rate¶
| Config | Pass | Partial | Fail | Pass rate |
|---|---|---|---|---|
| With Skill | 33 | 1 | 0 | 97% (counting PARTIAL as 0.5 = 98.5%) |
| Without Skill | 13 | 4 | 17 | 38% (counting PARTIAL as 0.5 = 44%) |
Pass-rate gain: +59 pp (with PARTIAL: +54.5 pp)
3.2 Pass Rate by Scenario¶
| Scenario | With-Skill | Without-Skill | Delta |
|---|---|---|---|
| 1. Clear feature request | 13/13 (100%) | 0.5/13 (4%) | +96 pp |
| 2. Vague request | 10/10 (100%) | 8/10 (80%) | +20 pp |
| 3. Documentation change | 10.5/11 (95%) | 6.5/11 (59%) | +36 pp |
Note: Scenario 2 has a smaller gap (+20 pp) because when a request is clearly vague, the baseline model also tends to ask clarifying questions naturally. The skill's added value is in the structured Gate 1 analysis, questions mapped precisely to the D1-D5 dimensions, and the standardized STOP-protocol output.
3.3 Substantive Dimensions (Core Capabilities Independent of Flow Structure)¶
To control for "flow-assertion bias", 12 additional substantive checks that do not depend on the flow itself were evaluated:
| ID | Check | With-Skill | Without-Skill |
|---|---|---|---|
| S1 | Scenario 2: correctly identifies ambiguity and refuses to plan immediately | PASS | PASS |
| S2 | Scenario 3: recognizes that a docs-only change does not need a formal plan | PASS | PARTIAL |
| S3 | Scenario 1: all file paths are verified before being written into the plan | PASS | FAIL |
| S4 | Scenario 1: each task includes an independent rollback step | PASS | FAIL |
| S5 | Scenario 1: plan contains interface definitions only, with no full function bodies | PASS | FAIL |
| S6 | Scenario 1: parallelizable tasks are explicitly marked | PASS | FAIL |
| S7 | Scenario 1: verification steps include runnable, exact commands | PASS | PASS |
| S8 | Scenario 1: execution mode (SKIP / Lite / Standard / Deep) is explicitly declared | PASS | FAIL |
| S9 | Scenario 1: plan contains clear in-scope and out-of-scope boundaries | PASS | FAIL |
| S10 | Scenario 1: change risk level is explicitly classified | PASS | FAIL |
| S11 | Scenario 1: plan is validated against the review checklist (B / N / SB) | PASS | FAIL |
| S12 | Scenario 3: output contains no invented paths or speculative endpoints | PASS | FAIL |
Substantive pass rate: With-Skill 12/12 (100%) vs Without-Skill 3/12 (25%), gain +75 pp (counting PARTIAL = 3.5/12 ≈ 29%, gain +71 pp).
4. Key Difference Analysis¶
4.1 Behaviors Unique to With-Skill (Completely Missing in the Baseline)¶
| Behavior | Impact |
|---|---|
| Systematic 4-Gate flow | Gate 1 checks requirement clarity, Gate 2 selects the mode, Gate 3 verifies paths, and Gate 4 classifies risk, with explicit output at each step |
| Four-label path-verification system | [Existing] / [New] / [Inferred] / [Speculative] prevents ghost paths from appearing in plan documents |
| Semantic code-block labels | [interface] contains only signatures and structs, [test-assertion] captures expected behavior, and [command] contains exact commands, preventing implementation code from leaking into the plan |
SKIP / Lite / Standard / Deep mode decisions | Adjusts output size to task complexity; docs-only changes do not trigger Standard plans, which avoids over-engineering |
| Per-task rollback protocol | Every task block ends with a concrete rollback step, not a single line at the bottom of a checklist |
| Reviewer loop | Standard mode triggers 1 round of three-layer review (B / N / SB) as a self-check mechanism |
| Output Contract structured output | Fixed structure: Gate verdicts -> file map -> task blocks with dependencies and blockers -> verification commands |
| Gate 1 STOP protocol | For vague requests, explicitly declares STOP, explains why, and gives a "continue after clarification" pipeline |
4.2 Behaviors the Baseline Can Do, but at Lower Quality¶
| Behavior | With-Skill quality | Without-Skill quality |
|---|---|---|
| Ambiguity detection | Systematic Gate 1 analysis with 4 STOP-trigger dimensions and a structured STOP declaration | Natural-language recognition; can ask questions, but without a dimension framework |
| Clarifying-question design | 5 precise questions mapped to D1-D5 dimensions | 4 questions with similar coverage, but weaker structure |
| Handling no-plan-needed scenarios | Formal SKIP decision + execution checklist + Gate summary table | Writes README content directly; useful but oversized and without a decision explanation |
| Commands | [command] tag + exact commands + expected-output notes | Bare command blocks, with no expected output |
| Risk treatment | Formal Gate 4 grading (Medium-High) + per-task rollback | Safety checklist with 8 items, but no risk levels or rollback |
4.3 Key Findings by Scenario¶
Scenario 1 (clear feature): - With-Skill: All 4 Gates pass in Standard mode. The 580-line plan includes a file map with 10 fully labeled paths, 6 task blocks with a dependency graph, Tasks 4 and 5 marked as parallelizable, and a Reviewer Loop with B:6/6 + N:7/7 + SB:6/6. - Without-Skill: Produces a 13-section plan that includes a full Config struct, full handler logic, and full token-service code, violating Anti-Pattern #2. It has no path labels, no parallelization markers, no rollback, and no reviewer loop. The gap is large.
Scenario 2 (vague request): - With-Skill: Gate 1 clearly identifies 4 STOP triggers, asks 5 precise questions covering p99 latency, component scope, baseline vs target, constraints, and existing profiling data, explains that writing a plan would "invent the problem by inertia", and gives a 4-step continuation pipeline: rerun Gate 1 -> classify -> discovery -> planning. - Without-Skill: Asks 4 questions of comparable quality and also proactively gives a pprof usage guide and a classification of common Go performance issues. The main difference is that it has no STOP declaration and no Gate protocol, so it does not define when to "move into planning."
Scenario 3 (documentation change): - With-Skill: Runs the Gate 2 decision tree fully, shows a decision table with 6 signals all pointing to SKIP, and outputs "no formal plan needed, execute directly" plus a 5-step execution checklist and a Gate summary table, 72 lines total. - Without-Skill: Correctly recognizes that no plan is needed, but then writes about 200 lines of README content, including 10 endpoints inferred from handler filenames and full curl examples. The output is useful to the user, but the core issue is path hygiene: inferred endpoint paths such as GET /users and POST /products were written without verification, which counts as invented paths in the output.
5. Token Cost-Effectiveness Analysis¶
5.1 Skill Context Token Cost¶
| Component | Lines | Estimated tokens | Load timing |
|---|---|---|---|
SKILL.md | 301 | ~2,200 | Always |
applicability-gate.md | 51 | ~360 | Gate 2, in most scenarios |
repo-discovery-protocol.md | 80 | ~560 | Gate 3, for Standard / Deep |
requirements-clarity-gate.md | 128 | ~900 | Gate 1, for vague requests |
| A plan template (any one) | 31-48 | ~220-340 | When matching the scenario type |
| Typical Standard scenario total | ~505 | ~3,390 | SKILL.md + Gate 2 + Gate 3 + 1 template |
| Typical Gate 1 STOP scenario | ~429 | ~3,100 | SKILL.md + Gate 1 reference |
| Typical SKIP scenario | ~397 | ~2,875 | SKILL.md + Gate 2 + docs template |
| Weighted average across the 3 scenarios | ~444 | ~3,122 | - |
Note: golden-scenarios.md (157 lines), reviewer-checklist.md (71 lines), and anti-examples.md (104 lines) are only loaded during reviewer loops or as references and are not counted in typical scenario context.
5.2 Cost-Effectiveness Calculation¶
| Metric | Value |
|---|---|
| Overall pass-rate gain (with PARTIAL) | +54.5 pp |
| Overall pass-rate gain (strict PASS only) | +59 pp |
| Substantive pass-rate gain | +75 pp |
| Skill context cost (typical scenario) | ~3,100 tokens |
| Token cost per 1% pass-rate gain (overall) | ~57 tokens/1% |
| Token cost per 1% pass-rate gain (substantive) | ~41 tokens/1% |
5.3 Comparison with Other Skills¶
| Skill | Token cost | Pass-rate gain | Tokens/1% |
|---|---|---|---|
git-commit | ~1,150 | +22 pp | ~51 |
go-makefile-writer | ~3,960 (full) | +31 pp | ~128 |
create-pr | ~3,400 | +71 pp | ~48 |
writing-plans | ~3,100 | +54.5 pp | ~57 |
writing-plans is slightly less efficient than create-pr on a tokens/1% basis (~57 vs ~48), mainly because in Scenario 2 the baseline can already ask clarifying questions naturally. That narrows the Scenario 2 gap to +20 pp and lowers the overall cost-effectiveness. On the substantive dimension, however, the skill's efficiency (~41 tokens/1%) is better than all compared skills.
5.4 Token Return Curve¶
Mapping token investment to return:
~2,200 tokens (SKILL.md only):
-> Gains: 4-Gate flow skeleton, 4 execution modes, path-label rules,
code-block labeling, Output Contract, 10 anti-patterns
-> Estimated coverage: ~85% of total pass-rate gain
+360 tokens (applicability-gate.md):
-> Gains: decision tree, 7 signal types, "Looks Small But Isn't" patterns
-> Estimated coverage: +8% gain (Gate 2 related assertions)
+560 tokens (repo-discovery-protocol.md):
-> Gains: 5-step discovery protocol, label definitions, path-verification rules
-> Estimated coverage: +5% gain (path-label assertions)
+220-340 tokens (plan template):
-> Gains: scenario-specific template structure and trigger signals
-> Estimated coverage: +2% gain (template-compliance assertions)
SKILL.md alone provides about 85% of the total value; the applicability gate plus discovery protocol add another 13%; templates contribute the final 2% at the margin.
6. Overall Score¶
6.1 Scores by Dimension¶
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Gate execution completeness (systematic 4-Gate flow + command evidence) | 5.0/5 | 1.0/5 | +4.0 |
| Plan structure quality (template compliance + path labels + code-block labels) | 5.0/5 | 1.5/5 | +3.5 |
Mode-selection accuracy (SKIP / Lite / Standard / Deep) | 5.0/5 | 2.0/5 | +3.0 |
| Path verification + anti-pattern avoidance (path labels + no ghost paths + no full implementation) | 5.0/5 | 1.5/5 | +3.5 |
| Requirement-clarification quality (structured Gate 1 STOP vs natural questioning) | 5.0/5 | 4.0/5 | +1.0 |
| Structured-output compliance (Output Contract + SKIP branch compliance) | 5.0/5 | 1.5/5 | +3.5 |
| Overall average | 5.0/5 | 1.9/5 | +3.1 |
Notes on the dimension scores:
- Gate execution completeness: With-Skill runs the Gates systematically across all 3 scenarios: Gate 1 STOP in Scenario 2, Gate 1 + 2 with SKIP in Scenario 3, and all 4 Gates in Scenario 1. Each Gate has explicit output and decision evidence. Without-Skill has no Gate system, so the maximum reasonable score is 1.0/5.
- Plan structure quality: In Scenario 1, With-Skill produces a complete 580-line plan with a file map, 6 task blocks, a dependency graph, per-task rollback, and
[interface]/[command]code blocks. Without-Skill produces a 13-section unstructured plan that includes full implementation code and lacks path labels, template sections, and a review loop, so it scores 1.5/5. - Mode-selection accuracy: With-Skill selects the correct mode in all 3 scenarios (
Standard/STOP/SKIP). Without-Skill does not declare a mode in Scenario 1, and in Scenario 3 it writes README content directly instead of a SKIP decision plus checklist, so it scores 2.0/5. - Path verification + anti-pattern avoidance: In Scenario 1, all 10 paths in the With-Skill file map are labeled, and in Scenario 3 it does not write speculative endpoints. Without-Skill includes full implementation code in Scenario 1 (Anti-Pattern #2) and writes 10 inferred endpoint paths in Scenario 3, so it scores 1.5/5.
- Requirement-clarification quality: In Scenario 2, both versions can ask clarifying questions, and the baseline even adds a
pprofusage guide, which is extra value. The main gap is that Without-Skill lacks a structured STOP declaration and a follow-up pipeline, so it scores 4.0/5. - Structured-output compliance: With-Skill produces standardized output in all 3 scenarios, including Gate summary tables, decision-tree walk-throughs, and the Output Contract. Without-Skill has no Output Contract, so it scores 1.5/5.
6.2 Weighted Total Score¶
| Dimension | Weight | Score | Reason | Weighted |
|---|---|---|---|---|
| Assertion pass rate (delta) | 25% | 9.5/10 | +54.5 pp overall / +75 pp substantive; lower than create-pr (+71 pp) because Scenario 2 has a smaller gap | 2.375 |
| Gate execution completeness | 20% | 10.0/10 | Gates executed systematically in all 3 scenarios, with explicit output at each step | 2.00 |
| Plan structure quality | 15% | 10.0/10 | Path labels, code-block labels, and task dependency graphs are all present | 1.50 |
| Mode-selection accuracy | 15% | 10.0/10 | Correct SKIP / STOP / Standard decisions in all 3 scenarios | 1.50 |
| Token cost-effectiveness | 15% | 7.5/10 | ~57 tokens/1% overall; strong baseline performance in Scenario 2 narrows the gap; on substantive checks ~41 tokens/1% is best-in-class | 1.125 |
| Path verification + anti-pattern avoidance | 10% | 9.5/10 | Only C9 is PARTIAL, because the output length is 72 lines vs a suggested <=15 lines; but the SKIP scenario needs a Gate decision table, so the overage is reasonable | 0.95 |
| Weighted total | 100% | 9.45/10 |
6.3 Comparison with Other Skills¶
| Skill | Weighted total | Pass-rate delta | Tokens/1% | Strongest dimension |
|---|---|---|---|---|
| create-pr | 9.55/10 | +71 pp | ~48 | Gate flow (+3.5), Output Contract (+4.0) |
| writing-plans | 9.45/10 | +54.5 pp | ~57 | Gate execution (+4.0), path verification (+3.5) |
go-makefile-writer | 9.16/10 | +31 pp | ~128 | CI reproducibility (+3.0) |
writing-plans receives the second-highest overall score in this evaluation, at 9.45/10, slightly below create-pr at 9.55/10. The main reasons for the gap are:
- Slightly smaller pass-rate delta (+54.5 pp vs +71 pp): in Scenario 2, the baseline also performs well, which reduces the overall difference.
- Slightly weaker token efficiency (~57 tokens/1% vs ~48 tokens/1%): again driven by the small Scenario 2 gap.
What the two skills share is that PR creation and implementation planning are both areas where the baseline model lacks strong structure, so the marginal value of a dedicated skill is high.
Why it lost points:
- Assertion pass rate (9.5/10): In Scenario 2, the baseline model can naturally ask clarifying questions, so the gap is only +20 pp. If the evaluation added boundary cases such as "complex feature + partially existing paths," the difference would likely be larger.
- Token cost-effectiveness (7.5/10):
golden-scenarios.md(157 lines, ~1,100 tokens) andplan-update-protocol.md(44 lines, ~310 tokens) were not loaded in typical scenarios. They are on-demand, low-frequency references rather than real waste.
7. Conclusion¶
In this evaluation, the writing-plans skill demonstrates highly consistent 4-Gate execution and precise mode-selection logic. Its substantive pass rate reaches 100% (12/12), and its overall pass rate is 98.5%, compared with 44% for the baseline, a gap of +54.5 percentage points.
Core value:
- 4-Gate upfront flow: It blocks "start writing a plan for a vague request" (Anti-Pattern #10) at Gate 1, and blocks "run the full Standard flow for a README update" at Gate 2.
- Four-label path-verification system:
[Existing]/[New]/[Inferred]/[Speculative]makes every path in the plan traceable and removes ghost paths. - Semantic code-block labels:
[interface]/[test-assertion]/[command]prevents implementation code from leaking into the plan (Anti-Pattern #2) and keeps the plan at interface-level precision. - Dynamic mode selection:
SKIPfor documentation changes,STOPfor vague requests, andStandardfor cross-package feature work. All 3 scenarios chose the correct mode.
Main risks and improvement space:
- Scenario 2 gap is narrow (+20 pp): when a request is obviously vague, the baseline also tends to ask questions. The skill's differentiated value is in the systematic STOP declaration, D1-D5 question design, and follow-up pipeline, but reviewers can easily overlook that structural value.
- C9's line-count limit is too strict: the SKIP scenario needs to show a Gate decision table, which is structured evidence. A 72-line output is reasonable, so the "<=15 lines" assertion would be better replaced with "no full plan body."
- Low usage of
golden-scenarios.md: this 157-line reference was not actively loaded in any of the 3 test scenarios.SKILL.mdshould give clearer guidance on when to pull it into the reviewer-loop phase.