Skip to content

readme-generator Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-19 Evaluation subject: readme-generator


readme-generator is a repository-aware README generation and refactoring skill for producing maintainable, evidence-backed project homepages across services, libraries, CLI tools, and monorepos. Its three main strengths are: project-type routing and template selection, so structure matches the real repository shape instead of generic documentation patterns; evidence mapping, badge detection, and no-fabrication rules, which keep each section grounded in actual files, commands, and configs; and structured output contracts plus maintenance guidance, which make the resulting README easier to review, sustain, and keep aligned with future code changes.

1. Evaluation Overview

This evaluation reviews the readme-generator skill along two dimensions: actual task performance and token cost-effectiveness. It uses 3 progressively more complex README generation / refactoring scenarios: creating a README from scratch for a Go service, creating one for a Go CLI tool, and refactoring a flawed README. Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios x 2 configs = 6 independent subagent runs, scored against 42 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 42/42 (100%) 26/42 (61.9%) +38.1 percentage points
Output Contract structured report 3/3 correct 0/3 Skill-only
Documentation Maintenance notes 3/3 0/3 Skill-only
Evidence Mapping table 3/3 0/3 Skill-only
Community file links (Contributing / Security) 2/2 2/2 Tied
CLI end-to-end example 1/1 (no fabricated output body) 0/1 Skill-only
No internal workflow labels 3/3 2/3 Skill advantage
No fabricated content 3/3 2/3 Skill advantage
Skill token overhead (SKILL.md only) ~4,688 tokens 0 -
Skill token overhead (typical full load) ~10,030 tokens 0 -
Token cost per 1% pass-rate gain ~123 tokens (SKILL.md only) / ~263 tokens (full) - -

2. Test Method

2.1 Scenario Design

Scenario Repository Core evaluation points Assertions
Eval 1: go-service-from-scratch Go service: cmd/api, internal/, Makefile, .env.example, CI Project-type routing, evidence-driven sections, badge strategy, Output Contract 14
Eval 2: go-cli-tool Go CLI tool: Cobra with two subcommands, Makefile, CI, CONTRIBUTING.md CLI routing, end-to-end example, ToC quality, no fabrication 13
Eval 3: refactor-stale-readme Go service with a flawed README: fake badges, wrong config, outdated commands, internal labels Anti-pattern detection and fixes, community file links, Output Contract 15

2.2 Test Repository Structure

Eval 1 repository (/tmp/readme-eval/eval-repos/go-service): - cmd/api/main.go - entrypoint (handler -> service -> repository layers) - internal/handler/user.go - 3 HTTP endpoints (GET/POST /users, GET /users/:id) - .env.example - 5 environment variables (DATABASE_URL, REDIS_URL, JWT_SECRET, LOG_LEVEL, PORT) - .github/workflows/ci.yml - GitHub Actions (runs make ci, Go 1.23) - Makefile - 9 targets, COVER_MIN=80, golangci-lint@v1.62.2 - LICENSE - MIT; Go 1.23; module github.com/acme/user-service

Eval 2 repository (/tmp/readme-eval/eval-repos/go-cli): - cmd/root/root.go - Cobra root + 2 global flags (--output/-o, --format/-f) - cmd/generate/generate.go, cmd/validate/validate.go - 2 subcommands - Makefile - 4 targets (build-schema-gen, test, lint, install) - .github/workflows/ci.yml, LICENSE (Apache 2.0), CONTRIBUTING.md - Go 1.22, no .env.example, no sample output files

Eval 3 repository (/tmp/readme-eval/eval-repos/refactor-stale) - preloaded with a flawed README: - Fake badges: Travis CI, Codecov, npm Downloads (the repo actually uses GitHub Actions) - Wrong config section: DB_HOST, DB_PORT, etc. (.env.example actually uses 7 variables such as POSTGRES_DSN, REDIS_ADDR) - Outdated command: go run main.go (the Makefile has make run-server) - Internal labels: the Testing table contains ✅ Verified / ⚠️ Not verified - Actual repo content: .env.example (7 variables), Makefile (9 targets), CONTRIBUTING.md, SECURITY.md, Go 1.24

2.3 Execution Method

  • Each scenario used an independent Git repository preloaded with code, go.mod, Makefile, and related files.
  • With-skill runs first read SKILL.md and followed the skill workflow to generate or refactor the README.
  • Without-skill runs did not read any skill and completed the same task using the model's default behavior.
  • All 6 runs were executed in parallel.

3. Assertion Pass Rate

3.1 Overview

Scenario Assertions With Skill Without Skill Delta
Eval 1: go-service 14 14/14 (100%) 9/14 (64.3%) +35.7%
Eval 2: go-cli 13 13/13 (100%) 8/13 (61.5%) +38.5%
Eval 3: refactor-stale 15 15/15 (100%) 9/15 (60.0%) +40.0%
Total 42 42/42 (100%) 26/42 (61.9%) +38.1%

3.2 Breakdown of the 16 Failed Assertions Without the Skill

Failure type Count Affected evals Notes
No Output Contract / Scorecard 3 Eval 1/2/3 No structured report with project_type, template_used, scorecard, or badges_added
No Documentation Maintenance 3 Eval 1/2/3 No maintenance matrix such as "update this README when these repo changes happen"
No Evidence Mapping 3 Eval 1/2/3 No section-to-evidence-file mapping table
No end-to-end example 1 Eval 2 The CLI README showed command snippets only, not a full "input command -> output description" example
No Project Structure section 1 Eval 2 Structure information was scattered across other sections
No ToC 1 Eval 2 The multi-section CLI README lacked navigation
Missing Go version badge 1 Eval 1 Only a CI badge was added; go.mod provided evidence for the Go version
Quick Start had more than 3 steps 1 Eval 1 Included git clone, resulting in 4 steps (<=3 is required)
Introduced new fabricated content 1 Eval 3 Added docker pull acme/notification-svc:latest despite no Docker evidence
No License section / badge 1 Eval 3 An MIT LICENSE file existed but was not referenced

3.3 Trend: the Skill Advantage Grows with Scenario Complexity

Scenario complexity Failed assertions without skill With-skill advantage
Eval 1 (service, from scratch) 5 +35.7%
Eval 2 (CLI, from scratch) 5 +38.5%
Eval 3 (refactor, with anti-patterns) 6 +40.0%

Eval 3 shows the largest advantage because refactoring requires not only fixing known problems, but also proactively discovering missing sections such as community files and maintenance notes. This kind of "scan and fill the gaps" behavior is built into the skill workflow, while without-skill runs tend to stop after fixing the obvious problems.


4. Dimension-by-Dimension Comparison

4.1 Output Contract and Structured Reporting

This is a skill-only differentiator: 3/3 scenarios produced it with the skill, compared with 0/3 without it.

Report item Eval 1 Eval 2 Eval 3
project_type service cli service
template_used Template A: Service Template C: CLI Template A: Service (Refactor)
scorecard Critical 4/4 Standard 6/6 Hygiene 4/4 -> PASS
badges_added CI + Go 1.23 + License CI + Go 1.22 + License CI + Go 1.24 + License
sections_omitted Contributing, Security, Release Config, Exit Codes, Arch, Deploy -
evidence_mapping 14-row mapping 15-row mapping 12-row mapping

Practical value: - Reviewers can verify which file supports each section during PR review. - sections_omitted explains why a section was skipped, instead of leaving "why is section X missing?" unanswered. - The layered scorecard (Critical / Standard / Hygiene) helps reviewers quickly locate quality issues.

4.2 Documentation Maintenance Notes

This comes from Hygiene Tier H1 in the skill. It passed in 3/3 scenarios with the skill and 0/3 without it.

Example from the with-skill Eval 1 output:

Repository change Sections to update
New cmd/*/main.go entrypoint Project Structure, Common Commands, Quick Start
Environment variable added / changed Configuration and Environment
Makefile target added / renamed Common Commands
CI workflow changed Badges, Testing and Quality
New API endpoints added API Endpoints
Go version bumped in go.mod Badges, Quick Start prerequisites

Practical value: this directly addresses the maintenance pain point where the README gradually drifts away from the codebase, because contributors can see exactly which README sections must be updated when the code changes.

4.3 CLI End-to-End Examples and No-Fabrication

The skill's End-to-End Example Rule requires CLI tools to provide a complete "input command -> output description" example, and it explicitly forbids inventing JSON / YAML output bodies when there is no evidence.

With skill (Eval 2):

schema-gen generate --format json --output ./schemas ./internal/models
# -> writes schema file(s) to ./schemas/

schema-gen validate ./schemas/models.json
# -> prints validation result to stdout
The Output Contract explicitly records: "No JSON/YAML output body fabricated (no sample fixtures in repo)"

Without skill (Eval 2): it only showed command examples, without the input-to-output description. The Examples subsection under Usage showed command variants, but readers could not tell what output to expect.

4.4 Defense Against Fabricated Content

This is the most important failure in the without-skill runs.

In Eval 3, while fixing existing fabricated content such as fake Travis CI badges and wrong DB config, the without-skill run introduced new fabricated content:

## Installation
docker pull acme/notification-svc:latest
There was no Docker-related evidence anywhere in the repository: no Dockerfile, no docker-compose.yml, and no Docker Hub link. This shows that when fixing one class of issue, the base model may still fill gaps using generic prior knowledge such as "Go services often have Docker images."

The skill's Evidence Completeness Gate explicitly requires "base every statement on repository evidence", and no new fabrication appeared in any of the 3 with-skill scenarios.

Scenario With Skill Without Skill
Removed old fake badges (Eval 3)
Corrected old wrong config (Eval 3)
Did not introduce new fabricated content (Eval 3) ❌ (docker pull)
CLI examples contained no fabricated output body (Eval 2) N/A (no end-to-end example)
Go version badge was evidence-based (Eval 1) ❌ (not added)

4.5 Badge Strategy

Dimension With Skill Without Skill
CI badge (from .github/workflows) 3/3 3/3
Go version badge (from go.mod) 3/3 0/3
License badge (from LICENSE) 3/3 0/3
Correctly removed fake badges (Eval 3) 3/3 3/3
No placeholder / fake badge URLs 3/3 3/3

The skill's Badge Detection Gate requires scanning in the order CI -> Coverage -> Language version -> License. As a result, the three-badge combination (CI + Go + License) was produced consistently in all three scenarios. Without the skill, the model only added the CI badge proactively. The Go-version and License badges need explicit rules to appear consistently.

4.6 ToC Navigation Quality (CLI Scenario)

Metric With Skill Without Skill
ToC present ✅ (10 items)
Reasonable ToC size (7-10 items) N/A
ToC labels match headings exactly N/A

The with-skill Eval 2 ToC:

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Commands & Flags](#commands--flags)
- [End-to-End Example](#end-to-end-example)
- [Project Structure](#project-structure)
- [Development Commands](#development-commands)
- [Contributing](#contributing)
- [License](#license)
- [Documentation Maintenance](#documentation-maintenance)
All 10 items matched the actual ## headings exactly, which follows the skill's ToC size-calibration rule.

4.7 Boundary with Claude's Base Model

Capabilities the Base Model Already Has (No Skill Gain)

Capability Evidence
Correct project-type routing (service / cli) Correct in 3/3 scenarios
Removes fake badges (Travis CI, Codecov, npm) Correct in the 1/1 relevant scenario (Eval 3)
Corrects wrong config sections Correct in the 1/1 relevant scenario (Eval 3)
Fixes outdated commands (go run -> make run-server) Correct in the 1/1 relevant scenario (Eval 3)
Removes internal Verified / Not verified labels Correct in the 1/1 relevant scenario (Eval 3)
References discovered community files The without-skill Eval 3 output correctly referenced CONTRIBUTING.md + SECURITY.md
Documents Makefile targets Correct in 3/3 scenarios
Basic evidence-driven content Generally decent, but not systematic

Capability Gaps in the Base Model (Filled by the Skill)

Gap Evidence Risk level
No Output Contract 0/3 scenarios produced a structured report High - README changes cannot be audited programmatically
No Documentation Maintenance 0/3 scenarios added a maintenance matrix Medium - the README gradually drifts away from the codebase
No Evidence Mapping 0/3 scenarios provided section-to-file mappings Low - reduces auditability
Missing CLI end-to-end examples 0/1 scenarios provided a full "input -> output" example Medium - users cannot predict CLI output shape
Introduces new fabricated content in refactor scenarios Eval 3 docker pull High - fills gaps with generic knowledge instead of repo evidence
Does not proactively add Go / License badges 0/3 scenarios produced the full badge set Low - leaves information incomplete
Does not proactively add a ToC 0/1 scenarios added a ToC for a long README Low - hurts readability
Missing Project Structure section 0/1 CLI scenarios included it Low - structure information stays scattered

5. Token Cost-Effectiveness Analysis

5.1 Skill Size

readme-generator is a multi-file skill. SKILL.md contains the core rules, and references are loaded on demand.

File Lines Bytes Estimated tokens When loaded
SKILL.md 403 18,755 ~4,688 Always
references/templates.md 372 7,512 ~1,878 When generating from scratch
references/golden-service.md 144 4,357 ~1,089 Service projects
references/golden-cli.md 102 2,638 ~660 CLI projects
references/golden-library.md 103 3,007 ~752 Library projects
references/golden-monorepo.md 93 2,951 ~738 Monorepo (on demand)
references/golden-lightweight.md 61 1,685 ~421 Small projects
references/anti-examples.md 182 3,306 ~826 During refactoring
references/checklist.md 171 10,389 ~2,597 During refactoring
references/command-priority.md 279 8,496 ~2,124 When commands conflict
scripts/discover_readme_needs.sh 239 9,499 ~2,375 Always (step 1)
references/bilingual-guidelines.md 28 1,086 ~271 Chinese / bilingual (on demand)
references/monorepo-rules.md 49 1,687 ~421 Monorepo (on demand)
Description (always in context) - - ~60 Always

Typical loading scenarios (following the "Load References Selectively" rule):

Scenario Files loaded Estimated total tokens
English service (Eval 1) SKILL.md + templates + golden-service + discover.sh ~10,030
CLI tool (Eval 2) SKILL.md + templates + golden-cli + discover.sh ~9,601
Refactor mode (Eval 3) SKILL.md + anti-examples + checklist + discover.sh ~10,186
SKILL.md only (minimum load) SKILL.md ~4,688

5.2 Quality Gains per Token

Metric Value
With-skill pass rate 100% (42/42)
Without-skill pass rate 61.9% (26/42)
Pass-rate improvement +38.1 percentage points
Fixed assertions 16
Tokens per fixed assertion (SKILL.md only) ~293 tokens
Tokens per fixed assertion (full load) ~627 tokens
Tokens per 1% gain (SKILL.md only) ~123 tokens
Tokens per 1% gain (full load) ~263 tokens

5.3 Cost-Effectiveness by Token Segment

Breaking SKILL.md into functional modules:

Module Estimated tokens Related assertion delta Cost-effectiveness
Output Contract + Scorecard definition ~600 3 assertions (no structured report in all 3 evals) High - 200 tok/assertion
Documentation Maintenance rules ~200 3 assertions (no maintenance note in all 3 evals) Very high - 67 tok/assertion
End-to-End Example Rule + no-fabrication ~220 1 assertion (Eval 2 end-to-end example) + prevents new fabrication High - 220 tok/assertion
Badge Detection Gate (4-step detection) ~250 2 assertions (Go + License badge) High - 125 tok/assertion
Command Verifiability Gate + hard rule ~250 1 assertion (no execution-status labels) High - 250 tok/assertion
README Navigation Rule (ToC) ~200 1 assertion (Eval 2 ToC) Medium - 200 tok/assertion
Community & Governance Files rules ~150 Indirect contribution (tied with without-skill; both referenced community files) Low (in this evaluation)
Pre-Generation Gates (type routing) ~400 Indirect contribution (type routing was correct in both; the base model could also do it) Low (in this evaluation)
Anti-Example 1 (internal labels) ~200 Defensive only (without-skill already removed old labels, but this prevents new leakage) Medium
Evidence Mapping rules ~150 3 assertions (all 3 evals missing evidence mapping) Very high - 50 tok/assertion
Structure Policy (template routing) ~350 Indirect contribution (Project Structure section completeness) Medium

5.4 High-Leverage vs Low-Leverage Instructions

High leverage (~1,620 tokens -> directly contributes 11+ assertion deltas): - Documentation Maintenance (200 tok -> 3 assertions) - Evidence Mapping (150 tok -> 3 assertions) - Output Contract + Scorecard (600 tok -> 3 assertions) - End-to-End Example + no-fabrication (220 tok -> 1 assertion + defensive value) - Badge Detection (250 tok -> 2 assertions) - Command Verifiability Gate (250 tok -> 1 assertion + defensive value)

Medium leverage (~750 tokens -> indirect contribution): - README Navigation Rule / ToC (200 tok -> 1 assertion) - Anti-Example 1 (200 tok -> defensive guarantee) - Structure Policy (350 tok -> section completeness)

Low leverage (~550 tokens -> 0 direct deltas in untested scenarios): - Chinese / Bilingual Guidelines (bilingual-guidelines.md, ~271 tok) - on demand, not triggered - Monorepo Rules (monorepo-rules.md, ~421 tok) - on demand, not triggered

Reference materials (~2,500-5,200 tokens depending on scenario): - golden-*.md provides README structure templates (indirectly improves section order and completeness) - templates.md provides the full skeleton (indirectly improves consistency in project-type routing) - discover_readme_needs.sh provides deterministic scanning (indirectly improves evidence completeness)

5.5 Token Efficiency Rating

Rating area Conclusion
Overall ROI Good - ~10,000 tokens for a +38.1% pass-rate gain
SKILL.md ROI alone Moderate - ~4,688 tokens is relatively heavy; high-leverage rules account for about 34% (~1,620 tokens)
Conditional loading design Excellent - bilingual / monorepo / refactor-specific files are loaded only when needed, so common scenarios avoid unnecessary cost
Defensive token spend Valuable - the no-fabrication and evidence gates prevented the kind of docker pull fabrication seen in the without-skill run, which is hard to quantify fully through assertions alone

5.6 Cost-Effectiveness Compared with go-makefile-writer

Metric readme-generator go-makefile-writer
SKILL.md tokens ~4,688 ~1,960
Typical full load ~10,000 ~4,600
Pass-rate improvement +38.1% +31.0%
Tokens per 1% gain (SKILL.md) ~123 tok ~63 tok
Tokens per 1% gain (full) ~263 tok ~149 tok

The readme-generator SKILL.md is about 2.4x the size of go-makefile-writer, and its token cost per 1% improvement is about 2.0x higher. Given that readme-generator has to cover 5 project-type routes, multilingual support, both refactor and generation modes, and a much more complex evidence-driven constraint system than Makefile generation, this gap is a reasonable reflection of task complexity rather than poor efficiency.


6. Overall Score

6.1 Dimension Scores

Dimension With Skill Without Skill Delta
Evidence-driven content (no fabrication) 5.0/5 3.5/5 +1.5
Correct project-type routing 5.0/5 5.0/5 0
Structured reporting (Output Contract) 5.0/5 0/5 +5.0
Maintenance sustainability (maintenance note) 5.0/5 0/5 +5.0
Badge quality and completeness 5.0/5 3.0/5 +2.0
Navigation and ToC quality 5.0/5 2.0/5 +3.0
CLI end-to-end examples 5.0/5 1.5/5 +3.5
No internal workflow labels 5.0/5 4.5/5 +0.5
Overall average 5.0/5 2.44/5 +2.56

6.2 Weighted Total Score

Dimension Weight With Skill score Without Skill score Weighted (With Skill)
Assertion pass rate (delta) 25% 10/10 6.2/10 2.50
Structured reporting and evidence mapping 20% 10/10 0/10 2.00
Maintenance sustainability 15% 10/10 0/10 1.50
Defense against fabricated content 15% 10/10 5.0/10 1.50
Token cost-effectiveness 15% 6.0/10 - 0.90
Content quality and readability 10% 9.5/10 8.0/10 0.95
Weighted total 9.35/10

7. Improvement Suggestions

7.1 [P1] Minimum Coverage Constraint for Project Structure

Issue: in the with-skill README for Eval 3, the Project Structure section had only one line:

cmd/server/     # server entry point

It omitted directories such as internal/api/, internal/db/, and pkg/cache/, even though these were clearly evidenced by the import paths in cmd/server/main.go.

Suggestion: in Generation Workflow Step 1 (Discover), add a rule to scan the entrypoint's import paths and use them to supplement internal/ and pkg/ directories. Also enforce a minimum threshold such as "Project Structure must list at least 3 meaningful directories."

7.2 [P2] Clarify Priority Between License Section and License Badge

Issue: under Community and Governance Files, SKILL.md says "LICENSE -> Add License section or badge", but the priority is unclear, which leads to inconsistent output across scenarios (sometimes only a badge, sometimes only a section).

Suggestion: define an explicit priority rule: - README > 80 lines: a License badge is enough; no separate License section required - README <= 80 lines or public-facing repository: keep both the badge and a dedicated License section

7.3 [P3] Add More Evaluation Scenarios

Untested feature Suggested scenario
Chinese / bilingual README A Chinese Go project with Chinese comments, to validate bilingual-guidelines.md
Monorepo apps/ + packages/ layout with multiple go.mod files, to validate monorepo-rules.md
Library / SDK Pure pkg/ layout with no cmd/, to validate Template B routing
Degraded mode A bare repository with no Makefile and no go.mod
Private repository Badge fallback strategy validation

8. Evaluation Materials

Material Path
Eval 1 test repository /tmp/readme-eval/eval-repos/go-service
Eval 2 test repository /tmp/readme-eval/eval-repos/go-cli
Eval 3 test repository /tmp/readme-eval/eval-repos/refactor-stale
Eval 1 with-skill output /tmp/readme-eval/workspace/iteration-2/eval-1-go-service/with_skill/outputs/
Eval 1 without-skill output /tmp/readme-eval/workspace/iteration-2/eval-1-go-service/without_skill/outputs/
Eval 2 with-skill output /tmp/readme-eval/workspace/iteration-2/eval-2-go-cli/with_skill/outputs/
Eval 2 without-skill output /tmp/readme-eval/workspace/iteration-2/eval-2-go-cli/without_skill/outputs/
Eval 3 with-skill output /tmp/readme-eval/workspace/iteration-2/eval-3-refactor-stale/with_skill/outputs/
Eval 3 without-skill output /tmp/readme-eval/workspace/iteration-2/eval-3-refactor-stale/without_skill/outputs/
Skill path /Users/john/.codex/skills/readme-generator/SKILL.md