go-makefile-writer Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Subject: go-makefile-writer

go-makefile-writer is a skill for creating or refactoring Makefiles in Go repositories, suitable for unifying build, test, lint, run, and CI entry points, and for converging existing Makefiles of varying quality with minimal effort. Its three main strengths are: automatically planning target sets and naming rules from repository structure for more stable, readable Makefiles; consistent version pinning and normative constraints for key targets like install-tools, ci, and tidy to reduce drift; and in Refactor mode, emphasis on minimal-diff and backward compatibility—fixing issues without breaking existing usage patterns.

1. Evaluation Overview¶

This evaluation assesses the go-makefile-writer skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 Makefile generation/refactoring scenarios of increasing complexity (single-binary creation, multi-binary+Docker creation, defective Makefile refactoring). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 42 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	42/42 (100%)	29/42 (69.0%)	+31.0 percentage points
Naming convention compliance	3/3 correct	1/3	Largest single-item delta
install-tools version pinning	3/3	0/3	Skill-only
Output Contract structured report	3/3	0/3	Skill-only
ci target naming	3/3	1/3	Skill consistent
tidy target	3/3	2/3	Skill consistent
Skill Token cost (SKILL.md only)	~1,960 tokens	0	—
Skill Token cost (incl. references)	~4,700 tokens	0	—
Token cost per 1% pass-rate gain	~63 tokens (SKILL.md only) / ~152 tokens (full)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	Repository	Core focus	Assertions
Eval 1: simple-create	Single `cmd/api`, Go 1.23, no Makefile	Basic target set, naming convention, version injection, quality gates	15
Eval 2: multi-binary-docker	3× `cmd/*`, Dockerfile, Go 1.25	Multi-binary targets, Docker targets, cross-compilation	15
Eval 3: refactor-defects	Existing Makefile with 6 defects	Refactor mode, backward compatibility, defect fix coverage	12

2.2 Execution¶

Each scenario uses an independent Git repo with pre-seeded code and go.mod
With-skill runs first read SKILL.md and its referenced materials (golden template, quality guide)
Without-skill runs read no skill; Makefile is generated by model default behavior
All runs execute in independent subagents in parallel

3. Assertion Pass Rate¶

3.1 Summary¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: simple-create	15	15/15 (100%)	8/15 (53.3%)	+46.7%
Eval 2: multi-binary-docker	15	15/15 (100%)	11/15 (73.3%)	+26.7%
Eval 3: refactor-defects	12	12/12 (100%)	10/12 (83.3%)	+16.7%
Total	42	42/42 (100%)	29/42 (69.0%)	+31.0%

3.2 Classification of 13 Without-Skill Failed Assertions¶

Failure type	Count	Evals	Notes
Naming convention non-compliance	2	Eval 1	`build`/`run` instead of `build-api`/`run-api`, violates cmd/-path semantics
Missing install-tools or unpinned version	3	Eval 1/2/3	Eval 1 missing install-tools; Eval 2 uses `@latest`; Eval 3 missing
Missing structured Output Report	3	Eval 1/2/3	No structured report of Go version, layout, entrypoints, validation results
ci target missing or different name	2	Eval 1/2	Eval 1 no ci; Eval 2 named `check`
Missing tidy target	1	Eval 1	No `go mod tidy` + `go mod verify`
Lint tool check missing	1	Eval 1	lint defined as vet+fmt-check, no golangci-lint
docker-build variable non-standard	1	Eval 2	Uses DOCKER_IMAGE instead of IMAGE_NAME/IMAGE_TAG

3.3 Trend: Skill Advantage Decreases with Scenario Complexity¶

Scenario complexity	With-Skill advantage
Eval 1 (simple)	+46.7% (7 failures)
Eval 2 (medium)	+26.7% (4 failures)
Eval 3 (refactor)	+16.7% (2 failures)

This is expected: Eval 3’s user prompt explicitly listed all 6 defects, effectively embedding the skill’s knowledge in the prompt. Eval 1’s prompt was minimal and most dependent on the skill’s conventions.

4. Dimension-by-Dimension Comparison¶

4.1 Naming Convention (cmd/-path semantics)¶

This is the largest single-item delta, contributing 2 assertion failures in Eval 1.

Directory structure	With Skill	Without Skill
`cmd/api/main.go`	`build-api`, `run-api`	`build`, `run`
`cmd/worker/main.go`	`build-worker`, `run-worker`	`build-worker`, `run-worker`
`cmd/server/main.go`	`build-server`	`build-server`

Analysis: Without-skill naturally used per-binary naming in multi-binary scenarios (Eval 2/3), but in the single-binary scenario defaulted to generic names. The skill’s rule "Map target names to cmd/ path semantics: cmd/<name> → build-<name>" ensures consistency.

Practical value: Consistent naming enables: - No target renaming when scaling from single to multi-binary - Unified Makefile style across teams - Predictable target names in CI scripts

4.2 install-tools and Version Pinning¶

Dimension	With Skill	Without Skill
Eval 1	`install-tools` pinned v1.62.2	❌ No install-tools
Eval 2	`install-tools` pinned v1.62.2	❌ lint auto-installs `@latest`
Eval 3	`install-tools` pinned v1.62.2	`install-tools` pinned v1.62.2 ✅

Analysis: Without-skill in Eval 2 embedded golangci-lint installation in the lint target (@latest auto-install). This works locally but in CI causes: - Non-deterministic builds (different versions at different times) - Re-installing tools on every CI run (slow)

The skill explicitly requires "Pin tool versions in install-tools for CI reproducibility".

4.3 Output Contract (Structured Report)¶

This is a skill-only differentiated output. With-skill produces a report after each run containing:

Report item	Eval 1	Eval 2	Eval 3
Mode (Create/Refactor + rationale)	✅	✅	✅
Go version (from go.mod)	1.23	1.25	1.24
Layout (single-module/monorepo)	✅	✅	✅
Entrypoints discovered	cmd/api	cmd/api, cmd/worker, cmd/migrate	cmd/server, cmd/cli
New/updated targets list	✅	✅	✅
Deprecated/aliased targets	(none)	(none)	build-srv → build-server
Before vs After (Refactor)	N/A	N/A	✅
Validation results (make help/test/build)	✅	✅	✅
Anti-pattern checklist	✅	—	—

Without-skill produced brief task summaries but no structured Output Contract.

Practical value: The Output Contract enables: - Auditable Makefile changes (PR reviewers know what changed and why) - Traceable backward compatibility in Refactor mode - Documented CI validation results

4.4 ci Target Naming¶

Scenario	With Skill	Without Skill
Eval 1	`ci`	❌ No such target
Eval 2	`ci`	`check` (similar but different name)
Eval 3	`ci`	`ci` ✅

The skill specifies "CI target: ci (fmt-check + lint + test + cover-check in one pass)". Without-skill in Eval 2 used check with content fmt-check vet test (missing cover-check), not fully aligned with the standard CI pipeline.

4.5 Impact of Golden Template¶

With-skill Makefiles closely follow the golden template structure (variables → build → run → quality → ci → version → tools → clean → phony → help), while without-skill structures varied.

Key Eval 2 difference: Without-skill used $(eval $(call build-template,...)) dynamic metaprogramming for build targets; with-skill used explicit per-binary targets per the golden template. The skill’s Anti-Patterns section explicitly flags "Overly dynamic Make metaprogramming (eval/call/define) that reduces readability when explicit targets would be clearer".

4.6 Actual Makefile Quality Comparison¶

Using Eval 2 (most complex scenario) as an example:

Feature	With Skill	Without Skill
build target style	Explicit per-binary	`$(eval $(call build-template))` dynamic
`-ldflags` placement	Explicit per build target	Embedded in GOBUILD variable (`CGO_ENABLED=0` also embedded)
clean behavior	`rm -rf bin/ coverage.out`	`rm -rf bin/ coverage.out` + `go clean -cache -testcache` (over-cleanup)
lint installation	Separate `install-tools`, pinned	Embedded in lint target, `@latest`
cross-compile	`build-linux` target	None
cover-check threshold	`COVER_MIN ?= 80`	None
help format	awk fixed-width, no color	grep+awk+sort, ANSI color

5. Token Cost-Effectiveness Analysis¶

5.1 Skill Size¶

go-makefile-writer is a multi-file skill (SKILL.md + references + scripts). What is loaded into context depends on which files the subagent reads.

File	Lines	Words	Bytes	Est. Tokens
SKILL.md	231	1,466	10,772	~1,960
references/makefile-quality-guide.md	268	1,211	8,837	~1,620
references/golden/simple-project.mk	101	396	2,864	~530
references/golden/complex-project.mk	193	777	6,559	~1,040
references/pr-checklist.md	71	429	2,980	~570
scripts/discover_go_entrypoints.sh	93	285	2,279	~380
Description (always in context)	—	~30	—	~40

Typical load scenarios:

Scenario	Files read	Total Tokens
Simple project (Eval 1)	SKILL.md + quality-guide + simple-project.mk	~4,110
Complex project (Eval 2)	SKILL.md + quality-guide + complex-project.mk	~4,620
Refactor (Eval 3)	SKILL.md + quality-guide	~3,580
SKILL.md only (minimal)	SKILL.md	~1,960

5.2 Token Cost for Quality Gain¶

Metric	Value
With-skill pass rate	100% (42/42)
Without-skill pass rate	69.0% (29/42)
Pass-rate gain	+31.0 percentage points
Token cost per assertion fixed	~150 tokens (SKILL.md only) / ~355 tokens (full)
Token cost per 1% pass-rate gain	~63 tokens (SKILL.md only) / ~149 tokens (full)

5.3 Token Segment Cost-Effectiveness¶

SKILL.md content split by functional module:

Module	Est. Tokens	Related assertion delta	Cost-effectiveness
Naming Convention rules	~100	2 (Eval 1 build-api/run-api)	Very high — 50 tok/assertion
Output Contract definition	~300	3 (3 evals structured report)	High — 100 tok/assertion
install-tools version pinning rules	~80	3 (3 evals pinned versions)	Very high — 27 tok/assertion
ci target specification	~50	2 (Eval 1/2 ci naming)	Very high — 25 tok/assertion
tidy target specification	~30	1 (Eval 1 tidy)	Very high — 30 tok/assertion
lint tool-check rules	~40	1 (Eval 1 golangci-lint check)	High — 40 tok/assertion
docker-build variable spec	~60	1 (Eval 2 IMAGE_NAME/TAG)	High — 60 tok/assertion
Anti-Patterns section	~250	Indirect (avoids eval/call metaprogramming)	Medium — no direct assertion
Go Version Awareness	~150	0 (no version-diff scenario tested)	Low — no test scenario
Monorepo Support	~200	0 (no monorepo tested)	Low — no test scenario
Golden templates (references)	~530–1,040	Indirect (Makefile structure consistency)	Medium — template-driven structure
Quality guide (references)	~1,620	Indirect (detailed implementation patterns)	Medium — provides concrete recipes

5.4 High-Leverage vs Low-Leverage Instructions¶

High leverage (~360 tokens SKILL.md → 12 assertion delta): - Naming convention cmd/<name> → build-<name> (100 tok → 2) - Output Contract definition (300 tok → 3) — template portion contributes most - install-tools version pinning (80 tok → 3) - ci target specification (50 tok → 2) - tidy target (30 tok → 1) - lint tool check (40 tok → 1)

Medium leverage (~310 tokens → indirect contribution): - Anti-Patterns section (250 tok) — avoided eval/call metaprogramming in Eval 2 - docker-build variable spec (60 tok → 1)

Low leverage (~350 tokens → 0 delta): - Go Version Awareness (150 tok) — not tested - Monorepo Support (200 tok) — not tested

References (~2,150–2,660 tokens → indirect contribution): - Golden templates drive overall Makefile structure consistency - Quality guide provides concrete recipe implementations

5.5 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Good — ~4,100–4,600 tokens for +31% pass rate
SKILL.md ROI	Excellent — ~1,960 tokens contains all high-leverage rules
High-leverage token share	~18% (360/1,960) directly contributes 12/13 assertion delta
Low-leverage token share	~18% (350/1,960) contributes nothing in this evaluation
Reference cost-effectiveness	Medium — ~2,150+ tokens provide indirect quality gain but no direct assertion delta

5.6 Comparison with git-commit Skill Cost-Effectiveness¶

Metric	go-makefile-writer	git-commit
SKILL.md Tokens	~1,960	~1,120
Total load Tokens	~4,100–4,600	~1,120
Pass-rate gain	+31.0%	+22.7%
Tokens per 1% (SKILL.md)	~63 tok	~51 tok
Tokens per 1% (full)	~149 tok	~51 tok

go-makefile-writer’s SKILL.md cost-effectiveness is close to git-commit, but references add significant token overhead. Reference value mainly shows in Makefile structure consistency and avoiding anti-patterns—quality dimensions that are hard to quantify with assertions.

6. Boundary Analysis vs Claude Base Model Capabilities¶

6.1 Base Model Capabilities (No Skill Increment)¶

Capability	Evidence
.DEFAULT_GOAL := help pattern	3/3 scenarios correct
.PHONY declarations	3/3 scenarios correct
-ldflags version injection	3/3 scenarios correct
-race flag in test	3/3 scenarios correct
docker-build/push targets	1/1 scenario correct (Eval 2)
Multi-binary per-binary targets	1/1 scenario correct (Eval 2)
build-srv → build-server rename	1/1 scenario correct (Eval 3)
build-srv backward compat alias	1/1 scenario correct (Eval 3)
bin/ output directory	3/3 scenarios correct

6.2 Base Model Gaps (Skill Fills)¶

Gap	Evidence	Risk level
Single-binary generic naming	Eval 1: `build`/`run` instead of `build-api`/`run-api`	Medium — requires rename when scaling
Missing or unpinned install-tools	3/3 scenarios: no install-tools or `@latest`	High — CI not reproducible
No structured Output Report	3/3 scenarios no report	Medium — no audit trail
Inconsistent ci target naming	2/3 scenarios no ci or named check	Medium — team convention mismatch
Missing tidy target	1/3 scenarios no tidy	Low — can run manually
Lint missing golangci-lint	1/3 scenarios lint=vet+fmt-check	Medium — incomplete static analysis
eval/call metaprogramming	1/3 scenarios used dynamic template	Low — functionally equivalent but less readable

7. Overall Score¶

7.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Target set completeness	5.0/5	3.5/5	+1.5
Naming convention compliance	5.0/5	3.0/5	+2.0
Version injection & build quality	5.0/5	4.5/5	+0.5
CI reproducibility (tool pinning)	5.0/5	2.0/5	+3.0
Structured report	5.0/5	1.0/5	+4.0
Maintainability & readability	4.5/5	3.5/5	+1.0
Overall mean	4.92/5	2.92/5	+2.0

7.2 Weighted Total¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	9.5/10	2.38
Naming convention & target design	20%	10/10	2.00
CI reproducibility (tool pinning)	15%	10/10	1.50
Structured report (Output Contract)	15%	10/10	1.50
Token cost-effectiveness	15%	6.5/10	0.98
Maintainability & anti-pattern avoidance	10%	8.0/10	0.80
Weighted total			9.16/10

8. Evaluation Artifacts¶

Artifact	Path
Eval definitions	`/tmp/makefile-eval/workspace/iteration-1/eval-*/eval_metadata.json`
Eval 1 with-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/with_skill/outputs/`
Eval 1 without-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/without_skill/outputs/`
Eval 2 with-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/with_skill/outputs/`
Eval 2 without-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/without_skill/outputs/`
Eval 3 with-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/with_skill/outputs/`
Eval 3 without-skill output	`/tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/without_skill/outputs/`
Grading results	`/tmp/makefile-eval/workspace/iteration-1/eval-*/with_skill/grading.json`
Benchmark summary	`/tmp/makefile-eval/workspace/iteration-1/benchmark.json`
Eval viewer	`/tmp/makefile-eval/eval-review.html`