update-doc is a documentation-synchronization skill for patching or rebuilding repository docs after code changes, including README files, codemaps, and related engineering documentation. Its three main strengths are: project-type routing plus lightweight/full output modes, which keep updates scoped to the actual repository shape and the size of the change; evidence-backed diffs, scorecards, and codemap contracts, which make documentation updates traceable instead of ad hoc; and CI drift guardrails plus maintenance guidance, which reduce the chance that documentation falls behind the code again after future changes.
This evaluation reviews the update-doc skill along two dimensions: actual task performance and token cost-effectiveness. It uses 3 progressively more complex documentation-update scenarios: a lightweight README patch for a CLI tool, a full README update for a backend service, and codemap generation plus README refactoring for a monorepo. Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios x 2 configs = 6 independent subagent runs, scored against 42 assertions.
Eval 1: go-file-converter - cmd/convert/main.go: flag parsing (--format default "json", --output-dir default ".", --verbose) - Existing README.md: missing --output-dir, outdated default for --format (documented as csv) - Focus: patch only the 2 mismatches without rewriting the whole file
Eval 2: go-notification-service - cmd/api/main.go (API server) + cmd/worker/main.go (new Worker mode) - New environment variables: WORKER_CONCURRENCY (default 5), QUEUE_URL (required) - Makefile with 9 targets, docker-compose.yml with 4 services - Existing README.md covers only API mode
Eval 3: platform-monorepo - services/auth/ (Go, port changed from 8080 to 8443 TLS), services/billing/ (Go, new Stripe integration) - packages/ui-kit/ (TS), packages/api-client/ (TS, new AuthClient + BillingClient) - .github/workflows/ci.yml includes markdownlint - Existing README.md is missing billing and api-client, and the auth port is outdated
Interpretation: the core value of the skill is that it injects 17 pieces of methodology discipline, while anti-patterns and the diff-scope gate provide 7 quality guardrails.
3.4 Trend: the Skill Has the Largest Advantage in the Most Complex Scenario¶
Scenario complexity
Without-skill pass rate
With-skill advantage
Eval 1 (simple)
38.5%
+61.5%
Eval 2 (medium)
53.3%
+46.7%
Eval 3 (complex)
35.7%
+64.3%
Unlike go-makefile-writer, where the largest advantage appeared in the simplest scenario, update-doc shows its biggest advantage in the most complex monorepo scenario. The reason is that Eval 3 requires skill-specific capabilities such as the Codemap Output Contract, multi-module routing, and CI drift detection, which the baseline model is very unlikely to produce on its own.
This is the skill's foundational capability because it directly determines whether all later decisions are correct.
Scenario
With Skill
Without Skill
Eval 1
"CLI Tool" -> chooses a flags/options-first strategy
No classification, generic handling
Eval 2
"Service / Backend" -> chooses a runtime-modes-first strategy
No classification, but happened to make reasonable updates
Eval 3
"Monorepo" -> chooses a module index + submodule linking strategy
No classification, replaced it with a directory tree
Analysis: in Eval 2, without-skill happened to produce a reasonable service README structure, but without explicit routing the behavior is unpredictable. In Eval 3, the same model chose to dump a directory tree instead of creating a module index table. The skill's routing mechanism ensures consistent behavior across scenarios.
Analysis: the over-rewrite in Eval 1 without the skill, which added sections like "How It Works" and "Error Handling", is exactly the behavior the skill's Lightweight mode and Diff Scope Gate are designed to prevent. In Eval 2 and Eval 3, the concise without-skill responses were not verbose, but they missed structured outputs such as the Evidence Map, Scorecard, and Open Gaps.
Both configurations performed well on factual accuracy:
Dimension
With Skill
Without Skill
Environment variable defaults
All correct
All correct
Port numbers
All correct
All correct
API routes / endpoints
All correct
All correct
No invented content
✅
✅
Structured evidence traceability
✅ (every claim mapped to source file + line numbers)
❌ (narrative validation only, no structured mapping)
Key difference: the skill does not win on accuracy; it wins on auditability. The Evidence Map makes every documentation claim traceable to specific code lines, which supports PR review and later maintenance.
The base model is already strong at fact extraction and basic documentation writing. It got environment variables, ports, and routes correct in all cases. But it completely lacks methodological discipline. The skill's value is not that it "makes the model smarter"; it gives the model a repeatable workflow: project classification -> diff scope -> output mode -> structured report -> CI maintenance guidance.
Roughly 1,000 tokens (~48% of SKILL.md) produced no incremental gain in this evaluation:
Module
Tokens
Suggestion
Standard Workflow
~300
Compress into a 3-4 line checklist and move the detailed version to a reference file
Evidence Commands
~100
Move to references/evidence-commands.md and load on demand
Gate 1: Audience / Language
~120
Keep it, but shorten it (the base model already followed repo language naturally)
README UX Rules
~100
The base model already maintained a reasonable reader flow; can be compressed
This would likely remove ~400-500 tokens without affecting the high-leverage assertion gains, improving SKILL.md efficiency from 37 tok/1% to roughly 28 tok/1%.
The Codemap Output Contract in Eval 3 was one of the largest gaps between with-skill and without-skill. Suggested changes:
Add a short codemap INDEX.md template to references/codemap-template.md
State explicitly that INDEX.md must include: overview, child-codemap link table, and cross-module concerns
For each project type (Service / Monorepo), specify which codemap files are required
8.3 [P2] Clearer Conditional Loading for Reference Files¶
The current 3 reference files (~540 tokens total) are still less than one quarter of SKILL.md when all are read together. Their loading rules can still be stated more clearly:
User asks to include the Scorecard in the document
Multi-language repo
Mixed Python + Go repository
8.5 [P3] Consider Moving the Quality Scorecard to a Reference File¶
The 12-item Scorecard (~250 tokens) is always loaded, but it is only used in Full Output Mode. It could be moved to references/scorecard.md, with SKILL.md keeping a short pointer such as "use the 12-item checklist in references/scorecard.md."