git-commit Skill Evaluation Report¶

Evaluation Framework: skill-creator Evaluation Date: 2026-03-25 Subject: git-commit

git-commit is a safety-enhanced commit workflow skill that runs repository state checks, staging analysis, secret scanning, ecosystem-aware quality gates, and Angular (Conventional Commits) message generation before executing git commit. Its three standout strengths: a mandatory 7-step workflow (Preflight → Staging → Secret Gate → Quality Gate → Compose → Commit → Report) that replaces the baseline model's habit of staging and committing directly; regex-based secret scanning with layered triage that precisely blocked a hardcoded API key in Scenario 2; and ecosystem-aware quality gates (Go/Node/Python/Java/Rust) that ensure vet/test/lint runs before every commit — steps the baseline skipped in all three scenarios.

1. Skill Overview¶

git-commit is a structured commit safety skill defining a mandatory 7-step workflow, Hard Rules, five ecosystem quality gates, secret-scanning regexes, and a scope-discovery mechanism. Its goal: every commit passes complete safety preflight, logical staging, secret scanning, quality verification, and message normalization before execution.

Core Components:

File	Lines	Responsibility
`SKILL.md`	184	Main skill definition (7-step workflow, Hard Rules, secret regexes, scope discovery)
`references/quality-gate-go.md`	25	Go quality gate (go vet + go test, scaled by package count)
`references/quality-gate-node.md`	40	Node.js/TS quality gate (package manager detection, lint, tsc, test)
`references/quality-gate-python.md`	53	Python quality gate (ruff/flake8, mypy/pyright, pytest)
`references/quality-gate-java.md`	45	Java/Kotlin quality gate (Maven/Gradle, multi-module aware)
`references/quality-gate-rust.md`	32	Rust quality gate (clippy, cargo test, workspace aware)
`scripts/tests/test_skill_contract.py`	187	Contract tests (frontmatter, required sections, key content, reference integrity)

2. Test Design¶

2.1 Scenario Definitions¶

#	Scenario	Repo Type	Core Challenge	Expected Outcome
1	Clean Go feature	Go calculator	2-file single-concern change, all checks should pass	Normal commit, CC format
2	Python multi-concern + secret	Python myapp	4 files across 3 logical concerns + hardcoded `sk-proj-` API key	Commit blocked, secret reported
3	Node.js >8 files, messy history	Node task-api	10 files + non-CC history ("WIP", "fix bug")	List files for confirmation, split commits

2.2 Assertion Matrix (35 items)¶

Scenario 1 — Clean Go Feature (13 items)

ID	Assertion	With-Skill	Without-Skill
A1	Run all preflight checks systematically (7 checks, with commands)	PASS	FAIL
A2	Check for unresolved merge conflicts (diff-filter=U)	PASS	FAIL
A3	Check for detached HEAD state	PASS	FAIL
A4	Check for rebase/merge/cherry-pick in progress	PASS	FAIL
A5	Staging analysis: correctly identify as single logical change	PASS	PASS
A6	Run secret/sensitive-content scan (filename + content regexes)	PASS	FAIL
A7	Run quality gate: go vet + go test	PASS	FAIL
A8	Check git log to determine scope frequency	PASS	PARTIAL
A9	Generate CC-format commit message	PASS	PASS
A10	Subject line ≤ 50 characters (including type(scope):)	PASS	PASS
A11	Use imperative mood	PASS	PASS
A12	Output structured post-commit report (hash + files + gate status)	PASS	FAIL
A13	Follow ordered 7-step workflow (output contract)	PASS	FAIL

Scenario 2 — Python Multi-Concern + Secret (12 items)

ID	Assertion	With-Skill	Without-Skill
B1	Run all preflight checks systematically	PASS	FAIL
B2	Identify 3 independent logical concerns (user feature, config, logging)	PASS	PARTIAL
B3	Propose splitting into separate commits	PASS	PARTIAL
B4	Run secret scan with specific regex patterns	PASS	FAIL
B5	Detect hardcoded `sk-proj-` API key	PASS	PASS
B6	Block the commit	PASS	PASS
B7	Report exact file, line number, and matched pattern name	PASS	PARTIAL
B8	Suggest remediation (env var + .env file + key rotation)	PASS	PASS
B9	Run quality gate (pytest + ruff/flake8)	PASS	FAIL
B10	Generate CC-format messages for each logical group	PASS	PARTIAL
B11	All subject lines ≤ 50 characters	PASS	PARTIAL
B12	Follow structured output contract	PASS	FAIL

Scenario 3 — Node.js >8 Files, Messy History (10 items)

ID	Assertion	With-Skill	Without-Skill
C1	Run all preflight checks systematically	PASS	FAIL
C2	Detect >8 files and list full file set for user confirmation	PASS	FAIL
C3	Partition changes into logical groups	PASS	FAIL
C4	Propose multiple separate commits	PASS	FAIL
C5	Run secret scan	PASS	PARTIAL
C6	Run quality gate (npm test + npm run lint)	PASS	FAIL
C7	Check git log for scope frequency	PASS	PASS
C8	Detect no CC history → omit scope	PASS	FAIL
C9	All subject lines ≤ 50 characters	PASS	FAIL
C10	Follow structured output contract	PASS	FAIL

3. Pass Rate Comparison¶

3.1 Overall Pass Rate¶

Configuration	Pass	Partial	Fail	Pass Rate
With Skill	35	0	0	100%
Without Skill	8	6	21	23% (counting PARTIAL as 0.5 = 31%)

Pass rate improvement: +77 percentage points (strict) / +69pp (with PARTIAL)

3.2 Pass Rate by Scenario¶

Scenario	With-Skill	Without-Skill	Delta
1. Clean Go feature	13/13 (100%)	4.5/13 (35%)	+65pp
2. Python multi-concern + secret	12/12 (100%)	5.5/12 (46%)	+54pp
3. Node.js >8 files, messy history	10/10 (100%)	1.5/10 (15%)	+85pp

3.3 Substantive Dimension (Capability-Focused, Structure-Independent)¶

To remove "workflow-structure bias," 15 additional checks were evaluated independently of workflow steps:

ID	Check	With-Skill	Without-Skill
S1	Scenario 1: Run tests (go test)	PASS	FAIL
S2	Scenario 1: Run static analysis (go vet)	PASS	FAIL
S3	Scenario 1: Secret scan	PASS	FAIL
S4	Scenario 1: CC-format message	PASS	PASS
S5	Scenario 1: Subject ≤ 50 characters	PASS	PASS
S6	Scenario 2: Identify multiple logical concerns	PASS	PARTIAL
S7	Scenario 2: Detect hardcoded API key	PASS	PASS
S8	Scenario 2: Block commit containing secret	PASS	PASS
S9	Scenario 2: Suggest secret remediation	PASS	PASS
S10	Scenario 2: Run quality gate	PASS	FAIL
S11	Scenario 3: >8 files triggers confirmation	PASS	FAIL
S12	Scenario 3: Propose split commits	PASS	FAIL
S13	Scenario 3: Run tests (npm test)	PASS	FAIL
S14	Scenario 3: Detect no CC history → adapt scope strategy	PASS	FAIL
S15	Scenario 3: Subject ≤ 50 characters	PASS	FAIL

Substantive pass rate: With-Skill 15/15 (100%) vs Without-Skill 5.5/15 (37%), improvement +63pp.

4. Key Difference Analysis¶

4.1 Behaviors Unique to With-Skill (Completely Absent in Baseline)¶

Behavior	Impact
Mandatory 7-step workflow	Each step executed explicitly, with precise commands and expected results — nothing skipped
6-item preflight checklist	Conflict detection, detached HEAD, rebase/merge/cherry-pick state, submodule awareness
Regex secret scanning	13 secret patterns (AWS/GitHub/Slack/Google/Stripe/OpenAI/DB URIs, etc.) + filename patterns
4-level triage filtering	allowlist → test/fixture → doc → comment line — eliminates false positives
Ecosystem-aware quality gates	Auto-detects Go/Node/Python/Java/Rust, runs the matching vet/test/lint toolchain
>8 file confirmation threshold	Forces listing all files and requesting user confirmation when changes exceed 8 files
Scope frequency discovery	Uses `git log` frequency (≥3 commits with same scope) to decide whether to include scope
Structured post-commit report	Complete record including hash, file summary, and gate status

4.2 Behaviors the Baseline Does, But with Lower Quality¶

Behavior	With-Skill	Without-Skill
Secret detection	Regex pattern matching + filename scan + tiered triage	Manual diff review — catches obvious secrets but no tool evidence
Logical grouping	Precise grouping + split proposal + character counting	Recognizes different concerns but defaults to a single commit
CC message generation	Scope frequency analysis + character counting + imperative mood check	Produces CC format but doesn't verify character limits — occasionally over 50
Quality verification	go vet + go test / npm test + lint / pytest, etc.	Only `git status` — no tests or lint ever run
Post-commit verification	Structured report (hash, files, gate status)	Only `git status` to confirm success

4.3 Scenario-Level Key Findings¶

Scenario 1 (Clean Go feature): - With-Skill: All 7 steps completed. Full 7-item preflight; secret scan clean; go vet + go test passed; scope calc confirmed from history frequency; message feat(calc): add multiply operation (35 chars) precise and concise; post-commit report complete. - Without-Skill: Only ran git status / git diff / git log (3 steps). No go vet or go test. No secret scan. No preflight checks. Message feat(calc): add Multiply function (33 chars) — correct format but included a system-default Co-Authored-By line. No post-commit report.

Scenario 2 (Python secret): - With-Skill: After passing preflight, precisely identified 3 logical concerns. Secret scan matched both sk-[A-Za-z0-9]{20,} and api[_-]?key\s*= on src/config.py:5. Triage confirmed non-test/non-doc/non-comment → BLOCKED. Report included exact filename, line number, matched pattern names, and remediation (os.environ["API_KEY"] + key rotation). Proposed 3 split commits, each subject ≤ 50 chars (49/44/45). - Without-Skill: Caught the sk-proj- key via manual diff review (PASS), correctly blocked and suggested env var replacement. But no regex evidence chain, only mentioned file and line — no matched pattern name reported. Tended toward a single commit (two at most); did not identify the logger as an independent concern.

Scenario 3 (Node.js messy history): - With-Skill: Detected 10 files > 8 threshold, listed all files and requested confirmation. Identified 6 logical groups (config/middleware, auth+test, task+test, user+test, index wiring, README). git log showed no CC history → omit scope → use type: subject format. All 6 subject lines ≤ 50 chars (42/44/42/38/etc.). Ran npm test + npm run lint (both exit 0). - Without-Skill: Committed all 10 files as one commit — no file count threshold, no logical grouping. Message feat: add auth, users, and tasks modules with tests (51 characters, exceeds the 50-char limit). No npm test or npm run lint. Noticed the non-CC history but chose to ignore it (kept CC format — correct behavior).

5. Token Cost-Effectiveness¶

5.1 Skill Context Token Cost¶

Component	Lines	Estimated Tokens	Load Timing
`SKILL.md`	184	~1,150	Always loaded
`quality-gate-go.md`	25	~150	On-demand for Go projects
`quality-gate-node.md`	40	~240	On-demand for Node projects
`quality-gate-python.md`	53	~320	On-demand for Python projects
`quality-gate-java.md`	45	~270	On-demand for Java/Kotlin projects
`quality-gate-rust.md`	32	~190	On-demand for Rust projects
Typical scenario total	~209–237	~1,300–1,470	SKILL.md + 1 ecosystem gate

Note: Only one ecosystem's quality gate reference is loaded per commit.

5.2 Actual Token Usage (6 Evaluation Agents)¶

Agent	Scenario	Total Tokens	Duration (s)	Tool Calls
S1 With-Skill	Clean Go feature	28,841	128	27
S1 Without-Skill	Clean Go feature	22,156	78	11
S2 With-Skill	Python secret	32,732	179	25
S2 Without-Skill	Python secret	23,217	104	15
S3 With-Skill	Node.js messy	30,068	122	42
S3 Without-Skill	Node.js messy	33,290	170	24

With-Skill average: ~30,547 tokens, ~143s, ~31 tool calls Without-Skill average: ~26,221 tokens, ~117s, ~17 tool calls

With-Skill agents consumed on average +17% more tokens and +22% more time, spent on the additional preflight checks, secret scanning, and quality gate steps. Scenario 3's Without-Skill agent anomalously consumed more tokens (33,290 vs 30,068) — without structural guidance, it made more exploratory file reads.

5.3 Cost-Effectiveness Calculation¶

Metric	Value
Overall pass rate improvement	+77pp (strict) / +69pp (with PARTIAL)
Substantive pass rate improvement	+63pp
Skill context cost (typical)	~1,300 tokens
Runtime overhead (average)	+4,326 tokens (+17%)
Context tokens per 1% improvement (strict)	~17 tokens/1%
Context tokens per 1% improvement (substantive)	~21 tokens/1%
Including runtime overhead per 1% improvement	~73 tokens/1%

Note: "Context cost" counts only SKILL.md + reference loading; "runtime overhead" includes extra tool calls from preflight, secret scan, and quality gate execution.

5.4 Comparison with Other Skills¶

Skill	Context Tokens	Pass Rate Improvement	Context Tok/1%	With Runtime Tok/1%
`git-commit`	~1,300	+77pp	~17	~73
`create-pr`	~3,400	+71pp	~48	—
`go-makefile-writer`	~1,960–4,300	+31pp	~63–139	—

git-commit leads on context tokens per 1% improvement (~17), for three reasons: 1. SKILL.md is extremely lean (184 lines) — progressive reference loading prevents context bloat 2. Per-file quality gate design is highly efficient — only one ecosystem reference is ever loaded, adding just ~150–320 tokens per commit 3. Large pass rate delta (+77pp) — git commit is a domain where baseline models critically lack structured safety workflows

Even including runtime overhead (~73 tok/1%), git-commit still outperforms go-makefile-writer on context cost alone. The additional tool calls guided by the skill (quality gates, secret scanning) consume more tokens, but their safety output far exceeds the cost.

5.5 Token Return Curve¶

Token investment → return mapping:

~1,150 tokens (SKILL.md only):
  → Gets: 7-step workflow, Hard Rules, secret regexes, staging threshold, scope discovery
  → Estimated coverage: ~85% of pass rate improvement

+150–320 tokens (1 quality-gate reference):
  → Gets: Ecosystem-specific vet/test/lint commands and thresholds
  → Estimated coverage: +12% of pass rate improvement (Quality Gate assertions)

+0 tokens (edge cases / examples already inlined):
  → Gets: Empty commit, post-merge residuals, submodule handling
  → Estimated coverage: +3% (edge case coverage)

SKILL.md alone delivers ~85% of the value. The progressive reference loading design achieves optimal token efficiency.

6. Overall Scoring¶

6.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Preflight completeness (6 systematic checks)	5.0/5	1.0/5	+4.0
Secret scanning (regex patterns + triage)	5.0/5	2.5/5	+2.5
Quality gate execution (ecosystem vet/test/lint)	5.0/5	1.0/5	+4.0
Staging logic (grouping + threshold + confirmation)	5.0/5	2.0/5	+3.0
CC message quality (scope discovery + char counting)	5.0/5	3.5/5	+1.5
Structured output report (7-step ordered output)	5.0/5	1.5/5	+3.5
Overall average	5.0/5	1.9/5	+3.1

Dimension notes:

Preflight completeness: With-Skill ran 6 preflight checks systematically across all 3 scenarios (work tree, status, conflicts, branch, rebase, merge/cherry-pick), plus submodule awareness. Without-Skill only ran git status/git diff/git log — no conflict detection, no detached HEAD check, no rebase/merge state check.
Secret scanning: With-Skill used 13 regex patterns + filename patterns in a dual-scan, with 4-level triage to filter false positives. In Scenario 2, precisely matched sk-[A-Za-z0-9]{20,} and api[_-]?key\s*=. Without-Skill caught the obvious sk-proj- key via manual diff review but had no tool evidence chain and reported no pattern name.
Quality gate execution: With-Skill ran go vet + go test, pytest + ruff (deferred), and npm test + npm run lint across the 3 scenarios. Without-Skill ran zero tests or lint tools in all 3 scenarios — the largest capability gap.
Staging logic: With-Skill identified 3 logical concerns in Scenario 2 and proposed 3 split commits; in Scenario 3 triggered the >8 file threshold, listed all files, and identified 6 logical groups. Without-Skill defaulted to a single commit in Scenario 2 and committed all 10 files together in Scenario 3.
CC message quality: With-Skill produced subjects ≤ 50 chars across all scenarios (35/49/44/45/42/38, etc.), using git log frequency analysis for scope decisions. Without-Skill's Scenario 3 subject was 51 characters — over the limit — and no scope frequency analysis was performed.
Structured output report: With-Skill strictly followed the 7-step workflow output order; post-commit report included hash, files, and gate status. Without-Skill only ran git status to confirm success — no structured report.

6.2 Weighted Score¶

Dimension	Weight	Score	Rationale	Weighted
Assertion pass rate (delta)	25%	10.0/10	+77pp (strict) / +63pp (substantive), best token efficiency	2.50
Quality gate execution	20%	10.0/10	3/3 scenarios ran ecosystem-matched tools; baseline skipped all	2.00
Secret scanning	15%	9.5/10	Excellent regex + triage; could add more patterns (e.g., JWT)	1.43
Staging logic and grouping	15%	10.0/10	>8 threshold + logical grouping + split proposals + hunk-level staging	1.50
Token cost-effectiveness	15%	9.0/10	~17 tok/1% best among the three skills; progressive loading design elegant	1.35
CC message quality	10%	9.5/10	Scope frequency discovery + char counting; 50-char limit strictly enforced	0.95
Weighted total	100%			9.73/10

6.3 Comparison with Other Skills¶

Skill	Weighted Score	Pass Rate Delta	Tokens/1%	Top Advantage Dimension
git-commit	9.73/10	+77pp	~17	Quality gate (+4.0), Preflight (+4.0)
create-pr	9.55/10	+71pp	~48	Gate workflow (+3.5), Output Contract (+4.0)
go-makefile-writer	9.16/10	+31pp	~63	CI reproducibility (+3.0), Output Contract (+4.0)

git-commit earns the highest overall score of the three skills, primarily because:

Token efficiency is significantly ahead (~17 tok/1% vs ~48 and ~63): progressive reference loading keeps SKILL.md at just 184 lines while covering 5 ecosystems
Largest pass rate delta (+77pp): git commit is the domain where baseline models are weakest in structured safety workflows — baseline skipped tests in every scenario
No weak dimensions: all 6 dimensions scored ≥ 9.0

Point deductions: - Secret scanning (9.5/10): Current regexes cover mainstream secret types but lack patterns for JWT, Twilio (SK[0-9a-fA-F]{32}), Mailgun, and other newer SaaS platforms - Token cost-effectiveness (9.0/10): Despite having the best absolute efficiency, the Edge Cases section (~100 tokens) sees low utilization in typical scenarios

7. Conclusion¶

git-commit is the skill with the largest pass rate delta (+77pp) and best token efficiency (~17 tok/1%) in this evaluation. This indicates that git commit is a domain where baseline models critically lack structured safety processes — baseline never ran any tests or lint tools across all 3 test scenarios, making the skill's marginal value exceptionally high.

Core value: 1. Quality gate: zero to one — Baseline never runs vet/test/lint; the skill guarantees "every commit passes a quality gate" 2. Regex secret scanning — 13 patterns + 4-level triage precisely blocked a hardcoded API key in Scenario 2, providing an evidence chain that manual review cannot match 3. Staging safety net — >8 file confirmation + logical split proposals prevented 10 mixed files from being committed as one blob in Scenario 3 4. Progressive reference loading — 5 ecosystem gates stored in separate files, loaded on demand; typical token cost is just ~1,300 (SKILL.md + 1 gate)

Design strengths: - SKILL.md strictly held to 184 lines (target ≤ 200) with very high information density - Quality gate per-file design achieves "one SKILL.md, five ecosystems" — a model example of progressive loading - Hard Rules first + precise thresholds (8 files, 50 chars, 3-commit scope frequency) make rules verifiable and unambiguous

Improvement suggestions: 1. Extend secret regexes to cover JWT, Twilio (SK[0-9a-fA-F]{32}), Mailgun, and other emerging platforms 2. Add guidance on creating a .commit-secret-allowlist file to reduce first-time setup friction 3. Consider moving the Edge Cases section into a reference file to further trim SKILL.md (~80 tokens saved)