Skip to content

go-makefile-writer Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Subject: go-makefile-writer


go-makefile-writer is a skill for creating or refactoring Makefiles in Go repositories, suitable for unifying build, test, lint, run, and CI entry points, and for converging existing Makefiles of varying quality with minimal effort. Its three main strengths are: automatically planning target sets and naming rules from repository structure for more stable, readable Makefiles; consistent version pinning and normative constraints for key targets like install-tools, ci, and tidy to reduce drift; and in Refactor mode, emphasis on minimal-diff and backward compatibility—fixing issues without breaking existing usage patterns.

1. Evaluation Overview

This evaluation assesses the go-makefile-writer skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 Makefile generation/refactoring scenarios of increasing complexity (single-binary creation, multi-binary+Docker creation, defective Makefile refactoring). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 42 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 42/42 (100%) 29/42 (69.0%) +31.0 percentage points
Naming convention compliance 3/3 correct 1/3 Largest single-item delta
install-tools version pinning 3/3 0/3 Skill-only
Output Contract structured report 3/3 0/3 Skill-only
ci target naming 3/3 1/3 Skill consistent
tidy target 3/3 2/3 Skill consistent
Skill Token cost (SKILL.md only) ~1,960 tokens 0
Skill Token cost (incl. references) ~4,700 tokens 0
Token cost per 1% pass-rate gain ~63 tokens (SKILL.md only) / ~152 tokens (full)

2. Test Methodology

2.1 Scenario Design

Scenario Repository Core focus Assertions
Eval 1: simple-create Single cmd/api, Go 1.23, no Makefile Basic target set, naming convention, version injection, quality gates 15
Eval 2: multi-binary-docker cmd/*, Dockerfile, Go 1.25 Multi-binary targets, Docker targets, cross-compilation 15
Eval 3: refactor-defects Existing Makefile with 6 defects Refactor mode, backward compatibility, defect fix coverage 12

2.2 Execution

  • Each scenario uses an independent Git repo with pre-seeded code and go.mod
  • With-skill runs first read SKILL.md and its referenced materials (golden template, quality guide)
  • Without-skill runs read no skill; Makefile is generated by model default behavior
  • All runs execute in independent subagents in parallel

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: simple-create 15 15/15 (100%) 8/15 (53.3%) +46.7%
Eval 2: multi-binary-docker 15 15/15 (100%) 11/15 (73.3%) +26.7%
Eval 3: refactor-defects 12 12/12 (100%) 10/12 (83.3%) +16.7%
Total 42 42/42 (100%) 29/42 (69.0%) +31.0%

3.2 Classification of 13 Without-Skill Failed Assertions

Failure type Count Evals Notes
Naming convention non-compliance 2 Eval 1 build/run instead of build-api/run-api, violates cmd/-path semantics
Missing install-tools or unpinned version 3 Eval 1/2/3 Eval 1 missing install-tools; Eval 2 uses @latest; Eval 3 missing
Missing structured Output Report 3 Eval 1/2/3 No structured report of Go version, layout, entrypoints, validation results
ci target missing or different name 2 Eval 1/2 Eval 1 no ci; Eval 2 named check
Missing tidy target 1 Eval 1 No go mod tidy + go mod verify
Lint tool check missing 1 Eval 1 lint defined as vet+fmt-check, no golangci-lint
docker-build variable non-standard 1 Eval 2 Uses DOCKER_IMAGE instead of IMAGE_NAME/IMAGE_TAG

3.3 Trend: Skill Advantage Decreases with Scenario Complexity

Scenario complexity With-Skill advantage
Eval 1 (simple) +46.7% (7 failures)
Eval 2 (medium) +26.7% (4 failures)
Eval 3 (refactor) +16.7% (2 failures)

This is expected: Eval 3’s user prompt explicitly listed all 6 defects, effectively embedding the skill’s knowledge in the prompt. Eval 1’s prompt was minimal and most dependent on the skill’s conventions.


4. Dimension-by-Dimension Comparison

4.1 Naming Convention (cmd/-path semantics)

This is the largest single-item delta, contributing 2 assertion failures in Eval 1.

Directory structure With Skill Without Skill
cmd/api/main.go build-api, run-api build, run
cmd/worker/main.go build-worker, run-worker build-worker, run-worker
cmd/server/main.go build-server build-server

Analysis: Without-skill naturally used per-binary naming in multi-binary scenarios (Eval 2/3), but in the single-binary scenario defaulted to generic names. The skill’s rule "Map target names to cmd/ path semantics: cmd/<name>build-<name>" ensures consistency.

Practical value: Consistent naming enables: - No target renaming when scaling from single to multi-binary - Unified Makefile style across teams - Predictable target names in CI scripts

4.2 install-tools and Version Pinning

Dimension With Skill Without Skill
Eval 1 install-tools pinned v1.62.2 ❌ No install-tools
Eval 2 install-tools pinned v1.62.2 ❌ lint auto-installs @latest
Eval 3 install-tools pinned v1.62.2 install-tools pinned v1.62.2 ✅

Analysis: Without-skill in Eval 2 embedded golangci-lint installation in the lint target (@latest auto-install). This works locally but in CI causes: - Non-deterministic builds (different versions at different times) - Re-installing tools on every CI run (slow)

The skill explicitly requires "Pin tool versions in install-tools for CI reproducibility".

4.3 Output Contract (Structured Report)

This is a skill-only differentiated output. With-skill produces a report after each run containing:

Report item Eval 1 Eval 2 Eval 3
Mode (Create/Refactor + rationale)
Go version (from go.mod) 1.23 1.25 1.24
Layout (single-module/monorepo)
Entrypoints discovered cmd/api cmd/api, cmd/worker, cmd/migrate cmd/server, cmd/cli
New/updated targets list
Deprecated/aliased targets (none) (none) build-srv → build-server
Before vs After (Refactor) N/A N/A
Validation results (make help/test/build)
Anti-pattern checklist

Without-skill produced brief task summaries but no structured Output Contract.

Practical value: The Output Contract enables: - Auditable Makefile changes (PR reviewers know what changed and why) - Traceable backward compatibility in Refactor mode - Documented CI validation results

4.4 ci Target Naming

Scenario With Skill Without Skill
Eval 1 ci ❌ No such target
Eval 2 ci check (similar but different name)
Eval 3 ci ci

The skill specifies "CI target: ci (fmt-check + lint + test + cover-check in one pass)". Without-skill in Eval 2 used check with content fmt-check vet test (missing cover-check), not fully aligned with the standard CI pipeline.

4.5 Impact of Golden Template

With-skill Makefiles closely follow the golden template structure (variables → build → run → quality → ci → version → tools → clean → phony → help), while without-skill structures varied.

Key Eval 2 difference: Without-skill used $(eval $(call build-template,...)) dynamic metaprogramming for build targets; with-skill used explicit per-binary targets per the golden template. The skill’s Anti-Patterns section explicitly flags "Overly dynamic Make metaprogramming (eval/call/define) that reduces readability when explicit targets would be clearer".

4.6 Actual Makefile Quality Comparison

Using Eval 2 (most complex scenario) as an example:

Feature With Skill Without Skill
build target style Explicit per-binary $(eval $(call build-template)) dynamic
-ldflags placement Explicit per build target Embedded in GOBUILD variable (CGO_ENABLED=0 also embedded)
clean behavior rm -rf bin/ coverage.out rm -rf bin/ coverage.out + go clean -cache -testcache (over-cleanup)
lint installation Separate install-tools, pinned Embedded in lint target, @latest
cross-compile build-linux target None
cover-check threshold COVER_MIN ?= 80 None
help format awk fixed-width, no color grep+awk+sort, ANSI color

5. Token Cost-Effectiveness Analysis

5.1 Skill Size

go-makefile-writer is a multi-file skill (SKILL.md + references + scripts). What is loaded into context depends on which files the subagent reads.

File Lines Words Bytes Est. Tokens
SKILL.md 231 1,466 10,772 ~1,960
references/makefile-quality-guide.md 268 1,211 8,837 ~1,620
references/golden/simple-project.mk 101 396 2,864 ~530
references/golden/complex-project.mk 193 777 6,559 ~1,040
references/pr-checklist.md 71 429 2,980 ~570
scripts/discover_go_entrypoints.sh 93 285 2,279 ~380
Description (always in context) ~30 ~40

Typical load scenarios:

Scenario Files read Total Tokens
Simple project (Eval 1) SKILL.md + quality-guide + simple-project.mk ~4,110
Complex project (Eval 2) SKILL.md + quality-guide + complex-project.mk ~4,620
Refactor (Eval 3) SKILL.md + quality-guide ~3,580
SKILL.md only (minimal) SKILL.md ~1,960

5.2 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (42/42)
Without-skill pass rate 69.0% (29/42)
Pass-rate gain +31.0 percentage points
Token cost per assertion fixed ~150 tokens (SKILL.md only) / ~355 tokens (full)
Token cost per 1% pass-rate gain ~63 tokens (SKILL.md only) / ~149 tokens (full)

5.3 Token Segment Cost-Effectiveness

SKILL.md content split by functional module:

Module Est. Tokens Related assertion delta Cost-effectiveness
Naming Convention rules ~100 2 (Eval 1 build-api/run-api) Very high — 50 tok/assertion
Output Contract definition ~300 3 (3 evals structured report) High — 100 tok/assertion
install-tools version pinning rules ~80 3 (3 evals pinned versions) Very high — 27 tok/assertion
ci target specification ~50 2 (Eval 1/2 ci naming) Very high — 25 tok/assertion
tidy target specification ~30 1 (Eval 1 tidy) Very high — 30 tok/assertion
lint tool-check rules ~40 1 (Eval 1 golangci-lint check) High — 40 tok/assertion
docker-build variable spec ~60 1 (Eval 2 IMAGE_NAME/TAG) High — 60 tok/assertion
Anti-Patterns section ~250 Indirect (avoids eval/call metaprogramming) Medium — no direct assertion
Go Version Awareness ~150 0 (no version-diff scenario tested) Low — no test scenario
Monorepo Support ~200 0 (no monorepo tested) Low — no test scenario
Golden templates (references) ~530–1,040 Indirect (Makefile structure consistency) Medium — template-driven structure
Quality guide (references) ~1,620 Indirect (detailed implementation patterns) Medium — provides concrete recipes

5.4 High-Leverage vs Low-Leverage Instructions

High leverage (~360 tokens SKILL.md → 12 assertion delta): - Naming convention cmd/<name>build-<name> (100 tok → 2) - Output Contract definition (300 tok → 3) — template portion contributes most - install-tools version pinning (80 tok → 3) - ci target specification (50 tok → 2) - tidy target (30 tok → 1) - lint tool check (40 tok → 1)

Medium leverage (~310 tokens → indirect contribution): - Anti-Patterns section (250 tok) — avoided eval/call metaprogramming in Eval 2 - docker-build variable spec (60 tok → 1)

Low leverage (~350 tokens → 0 delta): - Go Version Awareness (150 tok) — not tested - Monorepo Support (200 tok) — not tested

References (~2,150–2,660 tokens → indirect contribution): - Golden templates drive overall Makefile structure consistency - Quality guide provides concrete recipe implementations

5.5 Token Efficiency Rating

Rating Conclusion
Overall ROI Good — ~4,100–4,600 tokens for +31% pass rate
SKILL.md ROI Excellent — ~1,960 tokens contains all high-leverage rules
High-leverage token share ~18% (360/1,960) directly contributes 12/13 assertion delta
Low-leverage token share ~18% (350/1,960) contributes nothing in this evaluation
Reference cost-effectiveness Medium — ~2,150+ tokens provide indirect quality gain but no direct assertion delta

5.6 Comparison with git-commit Skill Cost-Effectiveness

Metric go-makefile-writer git-commit
SKILL.md Tokens ~1,960 ~1,120
Total load Tokens ~4,100–4,600 ~1,120
Pass-rate gain +31.0% +22.7%
Tokens per 1% (SKILL.md) ~63 tok ~51 tok
Tokens per 1% (full) ~149 tok ~51 tok

go-makefile-writer’s SKILL.md cost-effectiveness is close to git-commit, but references add significant token overhead. Reference value mainly shows in Makefile structure consistency and avoiding anti-patterns—quality dimensions that are hard to quantify with assertions.


6. Boundary Analysis vs Claude Base Model Capabilities

6.1 Base Model Capabilities (No Skill Increment)

Capability Evidence
.DEFAULT_GOAL := help pattern 3/3 scenarios correct
.PHONY declarations 3/3 scenarios correct
-ldflags version injection 3/3 scenarios correct
-race flag in test 3/3 scenarios correct
docker-build/push targets 1/1 scenario correct (Eval 2)
Multi-binary per-binary targets 1/1 scenario correct (Eval 2)
build-srv → build-server rename 1/1 scenario correct (Eval 3)
build-srv backward compat alias 1/1 scenario correct (Eval 3)
bin/ output directory 3/3 scenarios correct

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
Single-binary generic naming Eval 1: build/run instead of build-api/run-api Medium — requires rename when scaling
Missing or unpinned install-tools 3/3 scenarios: no install-tools or @latest High — CI not reproducible
No structured Output Report 3/3 scenarios no report Medium — no audit trail
Inconsistent ci target naming 2/3 scenarios no ci or named check Medium — team convention mismatch
Missing tidy target 1/3 scenarios no tidy Low — can run manually
Lint missing golangci-lint 1/3 scenarios lint=vet+fmt-check Medium — incomplete static analysis
eval/call metaprogramming 1/3 scenarios used dynamic template Low — functionally equivalent but less readable

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Target set completeness 5.0/5 3.5/5 +1.5
Naming convention compliance 5.0/5 3.0/5 +2.0
Version injection & build quality 5.0/5 4.5/5 +0.5
CI reproducibility (tool pinning) 5.0/5 2.0/5 +3.0
Structured report 5.0/5 1.0/5 +4.0
Maintainability & readability 4.5/5 3.5/5 +1.0
Overall mean 4.92/5 2.92/5 +2.0

7.2 Weighted Total

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 9.5/10 2.38
Naming convention & target design 20% 10/10 2.00
CI reproducibility (tool pinning) 15% 10/10 1.50
Structured report (Output Contract) 15% 10/10 1.50
Token cost-effectiveness 15% 6.5/10 0.98
Maintainability & anti-pattern avoidance 10% 8.0/10 0.80
Weighted total 9.16/10

8. Evaluation Artifacts

Artifact Path
Eval definitions /tmp/makefile-eval/workspace/iteration-1/eval-*/eval_metadata.json
Eval 1 with-skill output /tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/with_skill/outputs/
Eval 1 without-skill output /tmp/makefile-eval/workspace/iteration-1/eval-1-simple-create/without_skill/outputs/
Eval 2 with-skill output /tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/with_skill/outputs/
Eval 2 without-skill output /tmp/makefile-eval/workspace/iteration-1/eval-2-multi-binary-docker/without_skill/outputs/
Eval 3 with-skill output /tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/with_skill/outputs/
Eval 3 without-skill output /tmp/makefile-eval/workspace/iteration-1/eval-3-refactor-defects/without_skill/outputs/
Grading results /tmp/makefile-eval/workspace/iteration-1/eval-*/with_skill/grading.json
Benchmark summary /tmp/makefile-eval/workspace/iteration-1/benchmark.json
Eval viewer /tmp/makefile-eval/eval-review.html