go-makefile-writer Skill Evaluation Report¶
Evaluation framework: skill-creator Evaluation date: 2026-03-11 Subject:
go-makefile-writer
go-makefile-writer is a skill for creating or refactoring Makefiles in Go repositories, suitable for unifying build, test, lint, run, and CI entry points, and for converging existing Makefiles of varying quality with minimal effort. Its three main strengths are: automatically planning target sets and naming rules from repository structure for more stable, readable Makefiles; consistent version pinning and normative constraints for key targets like install-tools, ci, and tidy to reduce drift; and in Refactor mode, emphasis on minimal-diff and backward compatibility—fixing issues without breaking existing usage patterns.
1. Evaluation Overview¶
This evaluation assesses the go-makefile-writer skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 Makefile generation/refactoring scenarios of increasing complexity (single-binary creation, multi-binary+Docker creation, defective Makefile refactoring). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 42 assertions.
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Assertion pass rate | 42/42 (100%) | 29/42 (69.0%) | +31.0 percentage points |
| Naming convention compliance | 3/3 correct | 1/3 | Largest single-item delta |
| install-tools version pinning | 3/3 | 0/3 | Skill-only |
| Output Contract structured report | 3/3 | 0/3 | Skill-only |
| ci target naming | 3/3 | 1/3 | Skill consistent |
| tidy target | 3/3 | 2/3 | Skill consistent |
| Skill Token cost (SKILL.md only) | ~1,960 tokens | 0 | — |
| Skill Token cost (incl. references) | ~4,700 tokens | 0 | — |
| Token cost per 1% pass-rate gain | ~63 tokens (SKILL.md only) / ~152 tokens (full) | — | — |
2. Test Methodology¶
2.1 Scenario Design¶
| Scenario | Repository | Core focus | Assertions |
|---|---|---|---|
| Eval 1: simple-create | Single cmd/api, Go 1.23, no Makefile | Basic target set, naming convention, version injection, quality gates | 15 |
| Eval 2: multi-binary-docker | 3× cmd/*, Dockerfile, Go 1.25 | Multi-binary targets, Docker targets, cross-compilation | 15 |
| Eval 3: refactor-defects | Existing Makefile with 6 defects | Refactor mode, backward compatibility, defect fix coverage | 12 |
2.2 Execution¶
- Each scenario uses an independent Git repo with pre-seeded code and go.mod
- With-skill runs first read SKILL.md and its referenced materials (golden template, quality guide)
- Without-skill runs read no skill; Makefile is generated by model default behavior
- All runs execute in independent subagents in parallel
3. Assertion Pass Rate¶
3.1 Summary¶
| Scenario | Assertions | With Skill | Without Skill | Delta |
|---|---|---|---|---|
| Eval 1: simple-create | 15 | 15/15 (100%) | 8/15 (53.3%) | +46.7% |
| Eval 2: multi-binary-docker | 15 | 15/15 (100%) | 11/15 (73.3%) | +26.7% |
| Eval 3: refactor-defects | 12 | 12/12 (100%) | 10/12 (83.3%) | +16.7% |
| Total | 42 | 42/42 (100%) | 29/42 (69.0%) | +31.0% |
3.2 Classification of 13 Without-Skill Failed Assertions¶
| Failure type | Count | Evals | Notes |
|---|---|---|---|
| Naming convention non-compliance | 2 | Eval 1 | build/run instead of build-api/run-api, violates cmd/-path semantics |
| Missing install-tools or unpinned version | 3 | Eval 1/2/3 | Eval 1 missing install-tools; Eval 2 uses @latest; Eval 3 missing |
| Missing structured Output Report | 3 | Eval 1/2/3 | No structured report of Go version, layout, entrypoints, validation results |
| ci target missing or different name | 2 | Eval 1/2 | Eval 1 no ci; Eval 2 named check |
| Missing tidy target | 1 | Eval 1 | No go mod tidy + go mod verify |
| Lint tool check missing | 1 | Eval 1 | lint defined as vet+fmt-check, no golangci-lint |
| docker-build variable non-standard | 1 | Eval 2 | Uses DOCKER_IMAGE instead of IMAGE_NAME/IMAGE_TAG |
3.3 Trend: Skill Advantage Decreases with Scenario Complexity¶
| Scenario complexity | With-Skill advantage |
|---|---|
| Eval 1 (simple) | +46.7% (7 failures) |
| Eval 2 (medium) | +26.7% (4 failures) |
| Eval 3 (refactor) | +16.7% (2 failures) |
This is expected: Eval 3’s user prompt explicitly listed all 6 defects, effectively embedding the skill’s knowledge in the prompt. Eval 1’s prompt was minimal and most dependent on the skill’s conventions.
4. Dimension-by-Dimension Comparison¶
4.1 Naming Convention (cmd/-path semantics)¶
This is the largest single-item delta, contributing 2 assertion failures in Eval 1.
| Directory structure | With Skill | Without Skill |
|---|---|---|
cmd/api/main.go | build-api, run-api | build, run |
cmd/worker/main.go | build-worker, run-worker | build-worker, run-worker |
cmd/server/main.go | build-server | build-server |
Analysis: Without-skill naturally used per-binary naming in multi-binary scenarios (Eval 2/3), but in the single-binary scenario defaulted to generic names. The skill’s rule "Map target names to cmd/ path semantics: cmd/<name> → build-<name>" ensures consistency.
Practical value: Consistent naming enables: - No target renaming when scaling from single to multi-binary - Unified Makefile style across teams - Predictable target names in CI scripts
4.2 install-tools and Version Pinning¶
| Dimension | With Skill | Without Skill |
|---|---|---|
| Eval 1 | install-tools pinned v1.62.2 | ❌ No install-tools |
| Eval 2 | install-tools pinned v1.62.2 | ❌ lint auto-installs @latest |
| Eval 3 | install-tools pinned v1.62.2 | install-tools pinned v1.62.2 ✅ |
Analysis: Without-skill in Eval 2 embedded golangci-lint installation in the lint target (@latest auto-install). This works locally but in CI causes: - Non-deterministic builds (different versions at different times) - Re-installing tools on every CI run (slow)
The skill explicitly requires "Pin tool versions in install-tools for CI reproducibility".
4.3 Output Contract (Structured Report)¶
This is a skill-only differentiated output. With-skill produces a report after each run containing:
| Report item | Eval 1 | Eval 2 | Eval 3 |
|---|---|---|---|
| Mode (Create/Refactor + rationale) | ✅ | ✅ | ✅ |
| Go version (from go.mod) | 1.23 | 1.25 | 1.24 |
| Layout (single-module/monorepo) | ✅ | ✅ | ✅ |
| Entrypoints discovered | cmd/api | cmd/api, cmd/worker, cmd/migrate | cmd/server, cmd/cli |
| New/updated targets list | ✅ | ✅ | ✅ |
| Deprecated/aliased targets | (none) | (none) | build-srv → build-server |
| Before vs After (Refactor) | N/A | N/A | ✅ |
| Validation results (make help/test/build) | ✅ | ✅ | ✅ |
| Anti-pattern checklist | ✅ | — | — |
Without-skill produced brief task summaries but no structured Output Contract.
Practical value: The Output Contract enables: - Auditable Makefile changes (PR reviewers know what changed and why) - Traceable backward compatibility in Refactor mode - Documented CI validation results
4.4 ci Target Naming¶
| Scenario | With Skill | Without Skill |
|---|---|---|
| Eval 1 | ci | ❌ No such target |
| Eval 2 | ci | check (similar but different name) |
| Eval 3 | ci | ci ✅ |
The skill specifies "CI target: ci (fmt-check + lint + test + cover-check in one pass)". Without-skill in Eval 2 used check with content fmt-check vet test (missing cover-check), not fully aligned with the standard CI pipeline.
4.5 Impact of Golden Template¶
With-skill Makefiles closely follow the golden template structure (variables → build → run → quality → ci → version → tools → clean → phony → help), while without-skill structures varied.
Key Eval 2 difference: Without-skill used $(eval $(call build-template,...)) dynamic metaprogramming for build targets; with-skill used explicit per-binary targets per the golden template. The skill’s Anti-Patterns section explicitly flags "Overly dynamic Make metaprogramming (eval/call/define) that reduces readability when explicit targets would be clearer".
4.6 Actual Makefile Quality Comparison¶
Using Eval 2 (most complex scenario) as an example:
| Feature | With Skill | Without Skill |
|---|---|---|
| build target style | Explicit per-binary | $(eval $(call build-template)) dynamic |
-ldflags placement | Explicit per build target | Embedded in GOBUILD variable (CGO_ENABLED=0 also embedded) |
| clean behavior | rm -rf bin/ coverage.out | rm -rf bin/ coverage.out + go clean -cache -testcache (over-cleanup) |
| lint installation | Separate install-tools, pinned | Embedded in lint target, @latest |
| cross-compile | build-linux target | None |
| cover-check threshold | COVER_MIN ?= 80 | None |
| help format | awk fixed-width, no color | grep+awk+sort, ANSI color |
5. Token Cost-Effectiveness Analysis¶
5.1 Skill Size¶
go-makefile-writer is a multi-file skill (SKILL.md + references + scripts). What is loaded into context depends on which files the subagent reads.
| File | Lines | Words | Bytes | Est. Tokens |
|---|---|---|---|---|
| SKILL.md | 231 | 1,466 | 10,772 | ~1,960 |
| references/makefile-quality-guide.md | 268 | 1,211 | 8,837 | ~1,620 |
| references/golden/simple-project.mk | 101 | 396 | 2,864 | ~530 |
| references/golden/complex-project.mk | 193 | 777 | 6,559 | ~1,040 |
| references/pr-checklist.md | 71 | 429 | 2,980 | ~570 |
| scripts/discover_go_entrypoints.sh | 93 | 285 | 2,279 | ~380 |
| Description (always in context) | — | ~30 | — | ~40 |
Typical load scenarios:
| Scenario | Files read | Total Tokens |
|---|---|---|
| Simple project (Eval 1) | SKILL.md + quality-guide + simple-project.mk | ~4,110 |
| Complex project (Eval 2) | SKILL.md + quality-guide + complex-project.mk | ~4,620 |
| Refactor (Eval 3) | SKILL.md + quality-guide | ~3,580 |
| SKILL.md only (minimal) | SKILL.md | ~1,960 |
5.2 Token Cost for Quality Gain¶
| Metric | Value |
|---|---|
| With-skill pass rate | 100% (42/42) |
| Without-skill pass rate | 69.0% (29/42) |
| Pass-rate gain | +31.0 percentage points |
| Token cost per assertion fixed | ~150 tokens (SKILL.md only) / ~355 tokens (full) |
| Token cost per 1% pass-rate gain | ~63 tokens (SKILL.md only) / ~149 tokens (full) |
5.3 Token Segment Cost-Effectiveness¶
SKILL.md content split by functional module:
| Module | Est. Tokens | Related assertion delta | Cost-effectiveness |
|---|---|---|---|
| Naming Convention rules | ~100 | 2 (Eval 1 build-api/run-api) | Very high — 50 tok/assertion |
| Output Contract definition | ~300 | 3 (3 evals structured report) | High — 100 tok/assertion |
| install-tools version pinning rules | ~80 | 3 (3 evals pinned versions) | Very high — 27 tok/assertion |
| ci target specification | ~50 | 2 (Eval 1/2 ci naming) | Very high — 25 tok/assertion |
| tidy target specification | ~30 | 1 (Eval 1 tidy) | Very high — 30 tok/assertion |
| lint tool-check rules | ~40 | 1 (Eval 1 golangci-lint check) | High — 40 tok/assertion |
| docker-build variable spec | ~60 | 1 (Eval 2 IMAGE_NAME/TAG) | High — 60 tok/assertion |
| Anti-Patterns section | ~250 | Indirect (avoids eval/call metaprogramming) | Medium — no direct assertion |
| Go Version Awareness | ~150 | 0 (no version-diff scenario tested) | Low — no test scenario |
| Monorepo Support | ~200 | 0 (no monorepo tested) | Low — no test scenario |
| Golden templates (references) | ~530–1,040 | Indirect (Makefile structure consistency) | Medium — template-driven structure |
| Quality guide (references) | ~1,620 | Indirect (detailed implementation patterns) | Medium — provides concrete recipes |
5.4 High-Leverage vs Low-Leverage Instructions¶
High leverage (~360 tokens SKILL.md → 12 assertion delta): - Naming convention cmd/<name> → build-<name> (100 tok → 2) - Output Contract definition (300 tok → 3) — template portion contributes most - install-tools version pinning (80 tok → 3) - ci target specification (50 tok → 2) - tidy target (30 tok → 1) - lint tool check (40 tok → 1)
Medium leverage (~310 tokens → indirect contribution): - Anti-Patterns section (250 tok) — avoided eval/call metaprogramming in Eval 2 - docker-build variable spec (60 tok → 1)
Low leverage (~350 tokens → 0 delta): - Go Version Awareness (150 tok) — not tested - Monorepo Support (200 tok) — not tested
References (~2,150–2,660 tokens → indirect contribution): - Golden templates drive overall Makefile structure consistency - Quality guide provides concrete recipe implementations
5.5 Token Efficiency Rating¶
| Rating | Conclusion |
|---|---|
| Overall ROI | Good — ~4,100–4,600 tokens for +31% pass rate |
| SKILL.md ROI | Excellent — ~1,960 tokens contains all high-leverage rules |
| High-leverage token share | ~18% (360/1,960) directly contributes 12/13 assertion delta |
| Low-leverage token share | ~18% (350/1,960) contributes nothing in this evaluation |
| Reference cost-effectiveness | Medium — ~2,150+ tokens provide indirect quality gain but no direct assertion delta |
5.6 Comparison with git-commit Skill Cost-Effectiveness¶
| Metric | go-makefile-writer | git-commit |
|---|---|---|
| SKILL.md Tokens | ~1,960 | ~1,120 |
| Total load Tokens | ~4,100–4,600 | ~1,120 |
| Pass-rate gain | +31.0% | +22.7% |
| Tokens per 1% (SKILL.md) | ~63 tok | ~51 tok |
| Tokens per 1% (full) | ~149 tok | ~51 tok |
go-makefile-writer’s SKILL.md cost-effectiveness is close to git-commit, but references add significant token overhead. Reference value mainly shows in Makefile structure consistency and avoiding anti-patterns—quality dimensions that are hard to quantify with assertions.
6. Boundary Analysis vs Claude Base Model Capabilities¶
6.1 Base Model Capabilities (No Skill Increment)¶
| Capability | Evidence |
|---|---|
| .DEFAULT_GOAL := help pattern | 3/3 scenarios correct |
| .PHONY declarations | 3/3 scenarios correct |
| -ldflags version injection | 3/3 scenarios correct |
| -race flag in test | 3/3 scenarios correct |
| docker-build/push targets | 1/1 scenario correct (Eval 2) |
| Multi-binary per-binary targets | 1/1 scenario correct (Eval 2) |
| build-srv → build-server rename | 1/1 scenario correct (Eval 3) |
| build-srv backward compat alias | 1/1 scenario correct (Eval 3) |
| bin/ output directory | 3/3 scenarios correct |
6.2 Base Model Gaps (Skill Fills)¶
| Gap | Evidence | Risk level |
|---|---|---|
| Single-binary generic naming | Eval 1: build/run instead of build-api/run-api | Medium — requires rename when scaling |
| Missing or unpinned install-tools | 3/3 scenarios: no install-tools or @latest | High — CI not reproducible |
| No structured Output Report | 3/3 scenarios no report | Medium — no audit trail |
| Inconsistent ci target naming | 2/3 scenarios no ci or named check | Medium — team convention mismatch |
| Missing tidy target | 1/3 scenarios no tidy | Low — can run manually |
| Lint missing golangci-lint | 1/3 scenarios lint=vet+fmt-check | Medium — incomplete static analysis |
| eval/call metaprogramming | 1/3 scenarios used dynamic template | Low — functionally equivalent but less readable |
7. Overall Score¶
7.1 Dimension Scores¶
| Dimension | With Skill | Without Skill | Delta |
|---|---|---|---|
| Target set completeness | 5.0/5 | 3.5/5 | +1.5 |
| Naming convention compliance | 5.0/5 | 3.0/5 | +2.0 |
| Version injection & build quality | 5.0/5 | 4.5/5 | +0.5 |
| CI reproducibility (tool pinning) | 5.0/5 | 2.0/5 | +3.0 |
| Structured report | 5.0/5 | 1.0/5 | +4.0 |
| Maintainability & readability | 4.5/5 | 3.5/5 | +1.0 |
| Overall mean | 4.92/5 | 2.92/5 | +2.0 |
7.2 Weighted Total¶
| Dimension | Weight | Score | Weighted |
|---|---|---|---|
| Assertion pass rate (delta) | 25% | 9.5/10 | 2.38 |
| Naming convention & target design | 20% | 10/10 | 2.00 |
| CI reproducibility (tool pinning) | 15% | 10/10 | 1.50 |
| Structured report (Output Contract) | 15% | 10/10 | 1.50 |
| Token cost-effectiveness | 15% | 6.5/10 | 0.98 |
| Maintainability & anti-pattern avoidance | 10% | 8.0/10 | 0.80 |
| Weighted total | 9.16/10 |
8. Evaluation Artifacts¶
| Artifact | Path |
|---|---|
| Eval definitions | /tmp/makefile-eval/workspace/iteration-1/eval-*/eval_metadata.json |
| Eval 1 with-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/with_skill/outputs/ |
| Eval 1 without-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/without_skill/outputs/ |
| Eval 2 with-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/with_skill/outputs/ |
| Eval 2 without-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/without_skill/outputs/ |
| Eval 3 with-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/with_skill/outputs/ |
| Eval 3 without-skill output | /tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/without_skill/outputs/ |
| Grading results | /tmp/makefile-eval/workspace/iteration-1/eval-*/with_skill/grading.json |
| Benchmark summary | /tmp/makefile-eval/workspace/iteration-1/benchmark.json |
| Eval viewer | /tmp/makefile-eval/eval-review.html |