readme-generator Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-19 Evaluation subject: readme-generator

readme-generator is a repository-aware README generation and refactoring skill for producing maintainable, evidence-backed project homepages across services, libraries, CLI tools, and monorepos. Its three main strengths are: project-type routing and template selection, so structure matches the real repository shape instead of generic documentation patterns; evidence mapping, badge detection, and no-fabrication rules, which keep each section grounded in actual files, commands, and configs; and structured output contracts plus maintenance guidance, which make the resulting README easier to review, sustain, and keep aligned with future code changes.

1. Evaluation Overview¶

This evaluation reviews the readme-generator skill along two dimensions: actual task performance and token cost-effectiveness. It uses 3 progressively more complex README generation / refactoring scenarios: creating a README from scratch for a Go service, creating one for a Go CLI tool, and refactoring a flawed README. Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios x 2 configs = 6 independent subagent runs, scored against 42 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	42/42 (100%)	26/42 (61.9%)	+38.1 percentage points
Output Contract structured report	3/3 correct	0/3	Skill-only
Documentation Maintenance notes	3/3	0/3	Skill-only
Evidence Mapping table	3/3	0/3	Skill-only
Community file links (Contributing / Security)	2/2	2/2	Tied
CLI end-to-end example	1/1 (no fabricated output body)	0/1	Skill-only
No internal workflow labels	3/3	2/3	Skill advantage
No fabricated content	3/3	2/3	Skill advantage
Skill token overhead (SKILL.md only)	~4,688 tokens	0	-
Skill token overhead (typical full load)	~10,030 tokens	0	-
Token cost per 1% pass-rate gain	~123 tokens (SKILL.md only) / ~263 tokens (full)	-	-

2. Test Method¶

2.1 Scenario Design¶

Scenario	Repository	Core evaluation points	Assertions
Eval 1: go-service-from-scratch	Go service: `cmd/api`, `internal/`, `Makefile`, `.env.example`, CI	Project-type routing, evidence-driven sections, badge strategy, Output Contract	14
Eval 2: go-cli-tool	Go CLI tool: Cobra with two subcommands, `Makefile`, CI, `CONTRIBUTING.md`	CLI routing, end-to-end example, ToC quality, no fabrication	13
Eval 3: refactor-stale-readme	Go service with a flawed README: fake badges, wrong config, outdated commands, internal labels	Anti-pattern detection and fixes, community file links, Output Contract	15

2.2 Test Repository Structure¶

Eval 1 repository (/tmp/readme-eval/eval-repos/go-service): - cmd/api/main.go - entrypoint (handler -> service -> repository layers) - internal/handler/user.go - 3 HTTP endpoints (GET/POST /users, GET /users/:id) - .env.example - 5 environment variables (DATABASE_URL, REDIS_URL, JWT_SECRET, LOG_LEVEL, PORT) - .github/workflows/ci.yml - GitHub Actions (runs make ci, Go 1.23) - Makefile - 9 targets, COVER_MIN=80, golangci-lint@v1.62.2 - LICENSE - MIT; Go 1.23; module github.com/acme/user-service

Eval 2 repository (/tmp/readme-eval/eval-repos/go-cli): - cmd/root/root.go - Cobra root + 2 global flags (--output/-o, --format/-f) - cmd/generate/generate.go, cmd/validate/validate.go - 2 subcommands - Makefile - 4 targets (build-schema-gen, test, lint, install) - .github/workflows/ci.yml, LICENSE (Apache 2.0), CONTRIBUTING.md - Go 1.22, no .env.example, no sample output files

Eval 3 repository (/tmp/readme-eval/eval-repos/refactor-stale) - preloaded with a flawed README: - Fake badges: Travis CI, Codecov, npm Downloads (the repo actually uses GitHub Actions) - Wrong config section: DB_HOST, DB_PORT, etc. (.env.example actually uses 7 variables such as POSTGRES_DSN, REDIS_ADDR) - Outdated command: go run main.go (the Makefile has make run-server) - Internal labels: the Testing table contains ✅ Verified / ⚠️ Not verified - Actual repo content: .env.example (7 variables), Makefile (9 targets), CONTRIBUTING.md, SECURITY.md, Go 1.24

2.3 Execution Method¶

Each scenario used an independent Git repository preloaded with code, go.mod, Makefile, and related files.
With-skill runs first read SKILL.md and followed the skill workflow to generate or refactor the README.
Without-skill runs did not read any skill and completed the same task using the model's default behavior.
All 6 runs were executed in parallel.

3. Assertion Pass Rate¶

3.1 Overview¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: go-service	14	14/14 (100%)	9/14 (64.3%)	+35.7%
Eval 2: go-cli	13	13/13 (100%)	8/13 (61.5%)	+38.5%
Eval 3: refactor-stale	15	15/15 (100%)	9/15 (60.0%)	+40.0%
Total	42	42/42 (100%)	26/42 (61.9%)	+38.1%

3.2 Breakdown of the 16 Failed Assertions Without the Skill¶

Failure type	Count	Affected evals	Notes
No Output Contract / Scorecard	3	Eval 1/2/3	No structured report with `project_type`, `template_used`, `scorecard`, or `badges_added`
No Documentation Maintenance	3	Eval 1/2/3	No maintenance matrix such as "update this README when these repo changes happen"
No Evidence Mapping	3	Eval 1/2/3	No section-to-evidence-file mapping table
No end-to-end example	1	Eval 2	The CLI README showed command snippets only, not a full "input command -> output description" example
No Project Structure section	1	Eval 2	Structure information was scattered across other sections
No ToC	1	Eval 2	The multi-section CLI README lacked navigation
Missing Go version badge	1	Eval 1	Only a CI badge was added; `go.mod` provided evidence for the Go version
Quick Start had more than 3 steps	1	Eval 1	Included `git clone`, resulting in 4 steps (`<=3` is required)
Introduced new fabricated content	1	Eval 3	Added `docker pull acme/notification-svc:latest` despite no Docker evidence
No License section / badge	1	Eval 3	An MIT `LICENSE` file existed but was not referenced

3.3 Trend: the Skill Advantage Grows with Scenario Complexity¶

Scenario complexity	Failed assertions without skill	With-skill advantage
Eval 1 (service, from scratch)	5	+35.7%
Eval 2 (CLI, from scratch)	5	+38.5%
Eval 3 (refactor, with anti-patterns)	6	+40.0%

Eval 3 shows the largest advantage because refactoring requires not only fixing known problems, but also proactively discovering missing sections such as community files and maintenance notes. This kind of "scan and fill the gaps" behavior is built into the skill workflow, while without-skill runs tend to stop after fixing the obvious problems.

4. Dimension-by-Dimension Comparison¶

4.1 Output Contract and Structured Reporting¶

This is a skill-only differentiator: 3/3 scenarios produced it with the skill, compared with 0/3 without it.

Report item	Eval 1	Eval 2	Eval 3
`project_type`	service	cli	service
`template_used`	Template A: Service	Template C: CLI	Template A: Service (Refactor)
`scorecard`	Critical 4/4	Standard 6/6	Hygiene 4/4 -> PASS
`badges_added`	CI + Go 1.23 + License	CI + Go 1.22 + License	CI + Go 1.24 + License
`sections_omitted`	Contributing, Security, Release	Config, Exit Codes, Arch, Deploy	-
`evidence_mapping`	14-row mapping	15-row mapping	12-row mapping

Practical value: - Reviewers can verify which file supports each section during PR review. - sections_omitted explains why a section was skipped, instead of leaving "why is section X missing?" unanswered. - The layered scorecard (Critical / Standard / Hygiene) helps reviewers quickly locate quality issues.

4.2 Documentation Maintenance Notes¶

This comes from Hygiene Tier H1 in the skill. It passed in 3/3 scenarios with the skill and 0/3 without it.

Example from the with-skill Eval 1 output:

Repository change	Sections to update
New `cmd/*/main.go` entrypoint	Project Structure, Common Commands, Quick Start
Environment variable added / changed	Configuration and Environment
Makefile target added / renamed	Common Commands
CI workflow changed	Badges, Testing and Quality
New API endpoints added	API Endpoints
Go version bumped in `go.mod`	Badges, Quick Start prerequisites

Practical value: this directly addresses the maintenance pain point where the README gradually drifts away from the codebase, because contributors can see exactly which README sections must be updated when the code changes.

4.3 CLI End-to-End Examples and No-Fabrication¶

The skill's End-to-End Example Rule requires CLI tools to provide a complete "input command -> output description" example, and it explicitly forbids inventing JSON / YAML output bodies when there is no evidence.

With skill (Eval 2):

schema-gen generate --format json --output ./schemas ./internal/models
# -> writes schema file(s) to ./schemas/

schema-gen validate ./schemas/models.json
# -> prints validation result to stdout

The Output Contract explicitly records: "No JSON/YAML output body fabricated (no sample fixtures in repo)"

Without skill (Eval 2): it only showed command examples, without the input-to-output description. The Examples subsection under Usage showed command variants, but readers could not tell what output to expect.

4.4 Defense Against Fabricated Content¶

This is the most important failure in the without-skill runs.

In Eval 3, while fixing existing fabricated content such as fake Travis CI badges and wrong DB config, the without-skill run introduced new fabricated content:

## Installation
docker pull acme/notification-svc:latest

There was no Docker-related evidence anywhere in the repository: no Dockerfile, no docker-compose.yml, and no Docker Hub link. This shows that when fixing one class of issue, the base model may still fill gaps using generic prior knowledge such as "Go services often have Docker images."

The skill's Evidence Completeness Gate explicitly requires "base every statement on repository evidence", and no new fabrication appeared in any of the 3 with-skill scenarios.

Scenario	With Skill	Without Skill
Removed old fake badges (Eval 3)	✅	✅
Corrected old wrong config (Eval 3)	✅	✅
Did not introduce new fabricated content (Eval 3)	✅	❌ (`docker pull`)
CLI examples contained no fabricated output body (Eval 2)	✅	N/A (no end-to-end example)
Go version badge was evidence-based (Eval 1)	✅	❌ (not added)

4.5 Badge Strategy¶

Dimension	With Skill	Without Skill
CI badge (from `.github/workflows`)	3/3	3/3
Go version badge (from `go.mod`)	3/3	0/3
License badge (from `LICENSE`)	3/3	0/3
Correctly removed fake badges (Eval 3)	3/3	3/3
No placeholder / fake badge URLs	3/3	3/3

The skill's Badge Detection Gate requires scanning in the order CI -> Coverage -> Language version -> License. As a result, the three-badge combination (CI + Go + License) was produced consistently in all three scenarios. Without the skill, the model only added the CI badge proactively. The Go-version and License badges need explicit rules to appear consistently.

Metric	With Skill	Without Skill
ToC present	✅ (10 items)	❌
Reasonable ToC size (7-10 items)	✅	N/A
ToC labels match headings exactly	✅	N/A

The with-skill Eval 2 ToC:

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Commands & Flags](#commands--flags)
- [End-to-End Example](#end-to-end-example)
- [Project Structure](#project-structure)
- [Development Commands](#development-commands)
- [Contributing](#contributing)
- [License](#license)
- [Documentation Maintenance](#documentation-maintenance)

All 10 items matched the actual ## headings exactly, which follows the skill's ToC size-calibration rule.

4.7 Boundary with Claude's Base Model¶

Capabilities the Base Model Already Has (No Skill Gain)¶

Capability	Evidence
Correct project-type routing (service / cli)	Correct in 3/3 scenarios
Removes fake badges (Travis CI, Codecov, npm)	Correct in the 1/1 relevant scenario (Eval 3)
Corrects wrong config sections	Correct in the 1/1 relevant scenario (Eval 3)
Fixes outdated commands (`go run` -> `make run-server`)	Correct in the 1/1 relevant scenario (Eval 3)
Removes internal `Verified` / `Not verified` labels	Correct in the 1/1 relevant scenario (Eval 3)
References discovered community files	The without-skill Eval 3 output correctly referenced `CONTRIBUTING.md` + `SECURITY.md`
Documents Makefile targets	Correct in 3/3 scenarios
Basic evidence-driven content	Generally decent, but not systematic

Capability Gaps in the Base Model (Filled by the Skill)¶

Gap	Evidence	Risk level
No Output Contract	0/3 scenarios produced a structured report	High - README changes cannot be audited programmatically
No Documentation Maintenance	0/3 scenarios added a maintenance matrix	Medium - the README gradually drifts away from the codebase
No Evidence Mapping	0/3 scenarios provided section-to-file mappings	Low - reduces auditability
Missing CLI end-to-end examples	0/1 scenarios provided a full "input -> output" example	Medium - users cannot predict CLI output shape
Introduces new fabricated content in refactor scenarios	Eval 3 `docker pull`	High - fills gaps with generic knowledge instead of repo evidence
Does not proactively add Go / License badges	0/3 scenarios produced the full badge set	Low - leaves information incomplete
Does not proactively add a ToC	0/1 scenarios added a ToC for a long README	Low - hurts readability
Missing Project Structure section	0/1 CLI scenarios included it	Low - structure information stays scattered

5. Token Cost-Effectiveness Analysis¶

5.1 Skill Size¶

readme-generator is a multi-file skill. SKILL.md contains the core rules, and references are loaded on demand.

File	Lines	Bytes	Estimated tokens	When loaded
SKILL.md	403	18,755	~4,688	Always
`references/templates.md`	372	7,512	~1,878	When generating from scratch
`references/golden-service.md`	144	4,357	~1,089	Service projects
`references/golden-cli.md`	102	2,638	~660	CLI projects
`references/golden-library.md`	103	3,007	~752	Library projects
`references/golden-monorepo.md`	93	2,951	~738	Monorepo (on demand)
`references/golden-lightweight.md`	61	1,685	~421	Small projects
`references/anti-examples.md`	182	3,306	~826	During refactoring
`references/checklist.md`	171	10,389	~2,597	During refactoring
`references/command-priority.md`	279	8,496	~2,124	When commands conflict
`scripts/discover_readme_needs.sh`	239	9,499	~2,375	Always (step 1)
`references/bilingual-guidelines.md`	28	1,086	~271	Chinese / bilingual (on demand)
`references/monorepo-rules.md`	49	1,687	~421	Monorepo (on demand)
Description (always in context)	-	-	~60	Always

Typical loading scenarios (following the "Load References Selectively" rule):

Scenario	Files loaded	Estimated total tokens
English service (Eval 1)	`SKILL.md` + `templates` + `golden-service` + `discover.sh`	~10,030
CLI tool (Eval 2)	`SKILL.md` + `templates` + `golden-cli` + `discover.sh`	~9,601
Refactor mode (Eval 3)	`SKILL.md` + `anti-examples` + `checklist` + `discover.sh`	~10,186
`SKILL.md` only (minimum load)	`SKILL.md`	~4,688

5.2 Quality Gains per Token¶

Metric	Value
With-skill pass rate	100% (42/42)
Without-skill pass rate	61.9% (26/42)
Pass-rate improvement	+38.1 percentage points
Fixed assertions	16
Tokens per fixed assertion (SKILL.md only)	~293 tokens
Tokens per fixed assertion (full load)	~627 tokens
Tokens per 1% gain (SKILL.md only)	~123 tokens
Tokens per 1% gain (full load)	~263 tokens

5.3 Cost-Effectiveness by Token Segment¶

Breaking SKILL.md into functional modules:

Module	Estimated tokens	Related assertion delta	Cost-effectiveness
Output Contract + Scorecard definition	~600	3 assertions (no structured report in all 3 evals)	High - 200 tok/assertion
Documentation Maintenance rules	~200	3 assertions (no maintenance note in all 3 evals)	Very high - 67 tok/assertion
End-to-End Example Rule + no-fabrication	~220	1 assertion (Eval 2 end-to-end example) + prevents new fabrication	High - 220 tok/assertion
Badge Detection Gate (4-step detection)	~250	2 assertions (Go + License badge)	High - 125 tok/assertion
Command Verifiability Gate + hard rule	~250	1 assertion (no execution-status labels)	High - 250 tok/assertion
README Navigation Rule (ToC)	~200	1 assertion (Eval 2 ToC)	Medium - 200 tok/assertion
Community & Governance Files rules	~150	Indirect contribution (tied with without-skill; both referenced community files)	Low (in this evaluation)
Pre-Generation Gates (type routing)	~400	Indirect contribution (type routing was correct in both; the base model could also do it)	Low (in this evaluation)
Anti-Example 1 (internal labels)	~200	Defensive only (without-skill already removed old labels, but this prevents new leakage)	Medium
Evidence Mapping rules	~150	3 assertions (all 3 evals missing evidence mapping)	Very high - 50 tok/assertion
Structure Policy (template routing)	~350	Indirect contribution (Project Structure section completeness)	Medium

5.4 High-Leverage vs Low-Leverage Instructions¶

High leverage (~1,620 tokens -> directly contributes 11+ assertion deltas): - Documentation Maintenance (200 tok -> 3 assertions) - Evidence Mapping (150 tok -> 3 assertions) - Output Contract + Scorecard (600 tok -> 3 assertions) - End-to-End Example + no-fabrication (220 tok -> 1 assertion + defensive value) - Badge Detection (250 tok -> 2 assertions) - Command Verifiability Gate (250 tok -> 1 assertion + defensive value)

Medium leverage (~750 tokens -> indirect contribution): - README Navigation Rule / ToC (200 tok -> 1 assertion) - Anti-Example 1 (200 tok -> defensive guarantee) - Structure Policy (350 tok -> section completeness)

Low leverage (~550 tokens -> 0 direct deltas in untested scenarios): - Chinese / Bilingual Guidelines (bilingual-guidelines.md, ~271 tok) - on demand, not triggered - Monorepo Rules (monorepo-rules.md, ~421 tok) - on demand, not triggered

Reference materials (~2,500-5,200 tokens depending on scenario): - golden-*.md provides README structure templates (indirectly improves section order and completeness) - templates.md provides the full skeleton (indirectly improves consistency in project-type routing) - discover_readme_needs.sh provides deterministic scanning (indirectly improves evidence completeness)

5.5 Token Efficiency Rating¶

Rating area	Conclusion
Overall ROI	Good - ~10,000 tokens for a +38.1% pass-rate gain
SKILL.md ROI alone	Moderate - ~4,688 tokens is relatively heavy; high-leverage rules account for about 34% (~1,620 tokens)
Conditional loading design	Excellent - bilingual / monorepo / refactor-specific files are loaded only when needed, so common scenarios avoid unnecessary cost
Defensive token spend	Valuable - the no-fabrication and evidence gates prevented the kind of `docker pull` fabrication seen in the without-skill run, which is hard to quantify fully through assertions alone

5.6 Cost-Effectiveness Compared with `go-makefile-writer`¶

Metric	readme-generator	go-makefile-writer
SKILL.md tokens	~4,688	~1,960
Typical full load	~10,000	~4,600
Pass-rate improvement	+38.1%	+31.0%
Tokens per 1% gain (SKILL.md)	~123 tok	~63 tok
Tokens per 1% gain (full)	~263 tok	~149 tok

The readme-generator SKILL.md is about 2.4x the size of go-makefile-writer, and its token cost per 1% improvement is about 2.0x higher. Given that readme-generator has to cover 5 project-type routes, multilingual support, both refactor and generation modes, and a much more complex evidence-driven constraint system than Makefile generation, this gap is a reasonable reflection of task complexity rather than poor efficiency.

6. Overall Score¶

6.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Evidence-driven content (no fabrication)	5.0/5	3.5/5	+1.5
Correct project-type routing	5.0/5	5.0/5	0
Structured reporting (Output Contract)	5.0/5	0/5	+5.0
Maintenance sustainability (maintenance note)	5.0/5	0/5	+5.0
Badge quality and completeness	5.0/5	3.0/5	+2.0
Navigation and ToC quality	5.0/5	2.0/5	+3.0
CLI end-to-end examples	5.0/5	1.5/5	+3.5
No internal workflow labels	5.0/5	4.5/5	+0.5
Overall average	5.0/5	2.44/5	+2.56

6.2 Weighted Total Score¶

Dimension	Weight	With Skill score	Without Skill score	Weighted (With Skill)
Assertion pass rate (delta)	25%	10/10	6.2/10	2.50
Structured reporting and evidence mapping	20%	10/10	0/10	2.00
Maintenance sustainability	15%	10/10	0/10	1.50
Defense against fabricated content	15%	10/10	5.0/10	1.50
Token cost-effectiveness	15%	6.0/10	-	0.90
Content quality and readability	10%	9.5/10	8.0/10	0.95
Weighted total				9.35/10

7. Improvement Suggestions¶

7.1 [P1] Minimum Coverage Constraint for Project Structure¶

Issue: in the with-skill README for Eval 3, the Project Structure section had only one line:

cmd/server/     # server entry point

It omitted directories such as internal/api/, internal/db/, and pkg/cache/, even though these were clearly evidenced by the import paths in cmd/server/main.go.

Suggestion: in Generation Workflow Step 1 (Discover), add a rule to scan the entrypoint's import paths and use them to supplement internal/ and pkg/ directories. Also enforce a minimum threshold such as "Project Structure must list at least 3 meaningful directories."

7.2 [P2] Clarify Priority Between License Section and License Badge¶

Issue: under Community and Governance Files, SKILL.md says "LICENSE -> Add License section or badge", but the priority is unclear, which leads to inconsistent output across scenarios (sometimes only a badge, sometimes only a section).

Suggestion: define an explicit priority rule: - README > 80 lines: a License badge is enough; no separate License section required - README <= 80 lines or public-facing repository: keep both the badge and a dedicated License section

7.3 [P3] Add More Evaluation Scenarios¶

Untested feature	Suggested scenario
Chinese / bilingual README	A Chinese Go project with Chinese comments, to validate `bilingual-guidelines.md`
Monorepo	`apps/` + `packages/` layout with multiple `go.mod` files, to validate `monorepo-rules.md`
Library / SDK	Pure `pkg/` layout with no `cmd/`, to validate Template B routing
Degraded mode	A bare repository with no `Makefile` and no `go.mod`
Private repository	Badge fallback strategy validation

8. Evaluation Materials¶

Material	Path
Eval 1 test repository	`/tmp/readme-eval/eval-repos/go-service`
Eval 2 test repository	`/tmp/readme-eval/eval-repos/go-cli`
Eval 3 test repository	`/tmp/readme-eval/eval-repos/refactor-stale`
Eval 1 with-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-1-go-service/with_skill/outputs/`
Eval 1 without-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-1-go-service/without_skill/outputs/`
Eval 2 with-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-2-go-cli/with_skill/outputs/`
Eval 2 without-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-2-go-cli/without_skill/outputs/`
Eval 3 with-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-3-refactor-stale/with_skill/outputs/`
Eval 3 without-skill output	`/tmp/readme-eval/workspace/iteration-2/eval-3-refactor-stale/without_skill/outputs/`
Skill path	`/Users/john/.codex/skills/readme-generator/SKILL.md`