Advanced
Table of Contents¶
- 6. Design Patterns for High-Quality Skills
- 6.1 Mandatory Gate Architecture
- 6.2 Teaching Through Anti-Examples
- 6.3 Three-Tier Quality Scorecard
- 6.4 Golden Fixtures and Contract Tests
- 6.5 Structured Output Contract
- 6.6 Version and Platform Awareness
- 6.7 Honest Degradation
- 6.8 Degrees of Freedom
- 6.9 Five Execution-Orchestration Patterns (from Anthropic's Official Guide)
- 7. Common Pitfalls and Anti-Patterns
- 7.1 Description Determines Whether a Skill Lives or Dies
- 7.2
SKILL.mdExceeds 500 Lines - 7.3 Reference Files Without Loading Conditions
- 7.4 Positive Examples Only, No Anti-Examples
- 7.5 Ignoring
allowed-toolsSecurity Constraints - 7.6 Good and Bad Uses of Dynamic Context Injection
- 7.7 Creating Extra Files You Do Not Need
- 7.8 Naming and Security Hard Limits
- 7.9 Performance and Loading Limits
- 7.10 Common Misunderstandings
- 8. Real-World Examples: From Simple to Complex
- 9. Design Philosophy: From Teachable to Executable
6. Design Patterns for High-Quality Skills¶
From a systematic review of 10 production-grade, high-quality skills, we can extract 8 quality-assurance patterns (6.1-6.8). In addition, Anthropic's official guide summarizes 5 execution-orchestration patterns (6.9). The two are complementary: the first set governs how well the skill works, while the second governs how execution is organized.
| # | Pattern | One-Line Summary | Frequency |
|---|---|---|---|
| 6.1 | Mandatory gates | If a prerequisite is not met, execution cannot continue | 9/10 |
| 6.2 | Anti-examples | Teaching AI what not to do is often more effective than teaching it what to do | 8/10 |
| 6.3 | Three-tier scorecard | Critical items can veto the whole result, so minor issues do not dilute major defects | 7/10 |
| 6.4 | Golden fixtures + contract tests | Zero-LLM structural checks protect a skill from accidental breakage | 9/10 |
| 6.5 | Structured output contract | Fixed output fields let CI consume AI results reliably | 10/10 |
| 6.6 | Version/platform awareness | Recommendations are filtered based on the project's actual runtime version | 6/10 |
| 6.7 | Honest degradation | When conditions are incomplete, return a clearly marked partial result instead of pretending it is complete | 5/10 |
| 6.8 | Degrees of freedom | Use exact scripts for fragile actions and natural language for flexible ones | Official guidance |
6.1 Mandatory Gate Architecture¶
Gates are the core quality mechanism in a skill: if a prerequisite is not satisfied, execution must stop. The number and shape of gates vary by workflow complexity, from lighter skills such as git-commit to heavier ones such as create-pr.
Common gate types:
| Gate Type | Purpose | Typical Example |
|---|---|---|
| Execution-integrity gate | Prevent the model from claiming it ran a tool when it did not | go-code-reviewer: "Never claim verification ran unless it actually did" |
| Context/evidence gate | Collect the necessary information before acting | security-review: scan the resource inventory before evaluating |
| Version-awareness gate | Adjust behavior based on the actual runtime version | unit-test: read the Go version from go.mod; do not recommend t.Setenv for Go < 1.17 |
| Degradation-output gate | Mark the result as partial when conditions are incomplete | go-ci-workflow: mark # INLINE FALLBACK when no Makefile exists |
| Applicability gate | Decide whether the task should be executed at all | fuzzing-test: stop immediately if the target is not suitable for fuzzing |
Design point: gates form a serial dependency chain. If any one fails, all later steps are blocked. This is different from a checklist, where you may skip an item.
6.2 Teaching Through Anti-Examples¶
This is the most counterintuitive pattern: teaching AI what not to do is often more effective than teaching it what to do. LLMs naturally tend to over-report because they prefer false positives over missed issues. Clear anti-examples suppress that tendency.
Take go-code-reviewer as an example. It defines 8 major false-positive classes:
## Anti-Examples — DO NOT Report
1. Speculative nil dereference with no evidence of actual nil source
2. Over-cautious error handling complaints where stdlib guarantees non-nil
3. False concurrency alarm on a map used only in a single goroutine
4. Premature optimization suggestion without profiling evidence
5. Version-inappropriate recommendation (e.g., slog for Go < 1.21)
6. Context over-propagation complaint when function already has ctx
7. Unnecessary abstraction suggestion for teaching/example code
8. Structural false alarm on intentional test fixtures
The unit-test skill has the same idea, with 10 anti-examples such as "do not test standard-library behavior" and "do not write test cases that only assert err == nil just to raise coverage."
Design point: anti-examples must be specific. Do not write vague advice like "avoid false positives." Say in what scenario, and what kind of output would be wrong. A BAD/GOOD comparison works best.
6.3 Three-Tier Quality Scorecard¶
Split quality dimensions into three layers so critical issues are not averaged away and every dimension does not compete on equal weight:
| Tier | Pass Standard | Typical Example |
|---|---|---|
| Critical | Any FAIL means the overall result FAILS | Gate existence, security scan, killer case |
| Standard | At least 4/5 pass | Test coverage, lint, formatting |
| Hygiene | At least 3/4 pass | Comment completeness, naming style |
This avoids two common problems: critical defects getting "averaged out" by minor wins, and decision paralysis caused by treating every dimension as equally important.
6.4 Golden Fixtures and Contract Tests¶
Golden fixtures are the "anchor tests" of a skill. They define expected rule coverage and behavior assertions for typical input scenarios. Contract tests validate the structural completeness of the skill text itself.
A typical test system includes:
- Contract tests: verify that
SKILL.mdincludes its required gates, reference files, and output fields (pure text matching, no LLM dependency) - Golden-scenario tests: given input scenario X, verify that the skill text contains all required rule keywords
- Regression runner:
scripts/run_regression.shto run all tests in one command - Coverage docs:
COVERAGE.mdto record what is covered and what gaps remain
Test counts across 10 skills:
| Skill | Contract Tests | Golden-Scenario Tests | Total |
|---|---|---|---|
tdd-workflow | 49 | 38 | 87 |
fuzzing-test | 35 | 25 | 60 |
go-ci-workflow | 44 | 17 | 61 |
security-review | 30 | 25 | 55 |
go-makefile-writer | 25 | 20 | 45 |
unit-test | 24 | 17 | 41 |
go-code-reviewer | 33 | 8 | 41 |
Key property: all of these tests have zero LLM dependency and run in under one second. This is not "using AI to test AI." It is plain structural and rule validation.
Concrete Example: 33 Contract Tests and 8 Golden Cases in go-code-reviewer¶
The table above shows that go-code-reviewer has 33 contract tests and 8 golden cases. What do they actually verify? Three examples show the pattern.
Example 1: Contract test — protect two rules that point in opposite directions
Go has a classic subtle distinction around closing HTTP bodies: in a server handler, r.Body is closed automatically by net/http, so manual r.Body.Close() is unnecessary; but on the client side, resp.Body must be closed manually or the connection leaks. That means SKILL.md must contain two opposite rules. Missing either one causes false positives or missed findings.
The contract test verifies this:
def test_http_body_rule_is_server_client_aware(self):
# Rule 1: no manual close needed on the server side
self.assertIn("avoid requiring explicit `r.Body.Close()`", self.skill_text)
# Rule 2: client code must close resp.Body
self.assertIn("require `resp.Body.Close()`", self.skill_text)
# Rule 3: detailed explanation in the reference file
self.assertIn(
"Do not treat missing `r.Body.Close()` in server handlers as an automatic defect.",
self.api_ref_text,
)
If someone accidentally deletes the server-side rule while editing SKILL.md, this test fails immediately. The point of contract tests is to catch accidental rule loss or drift, not to test the model's runtime behavior.
Example 2: Golden case (true positive) — verify that the skill covers a real defect
001_race_shared_map.json defines a real concurrency bug:
{
"id": "GOLDEN-001",
"title": "Race condition on shared package-level map",
"expected_finding": true,
"severity": "High",
"category": "concurrency",
"code": "package cache\n\nvar store = map[string]string{}\n\nfunc Set(k, v string) { store[k] = v }\nfunc Get(k string) string { return store[k] }\n// Both called from HTTP handlers (concurrent goroutines)",
"coverage_rules": [
"Race conditions on shared state (maps, slices, vars)",
"concurrent map write"
]
}
The test logic is: expected_finding: true means the skill should produce a finding for this scenario. The test walks through the coverage_rules array and checks whether SKILL.md plus the reference files contain those keywords:
def test_001_race_shared_map(self):
f = self._load("001_race_shared_map.json")
self.assertTrue(f["expected_finding"])
# Verify that the skill text covers
# "Race conditions on shared state"
# and "concurrent map write"
self._assert_coverage(f)
If someone removes the section about concurrent map writes from a reference file, this test fails and tells you the skill can no longer reliably catch a shared-map race.
Example 3: Golden case (false-positive suppression) — verify that the skill does not over-report
004_server_handler_body_fp.json pairs with Example 1, but checks the other side of the problem: given a correct code example, are the rules sufficient to stop the AI from raising a false positive?
{
"id": "GOLDEN-004",
"title": "Server handler without r.Body.Close — false positive",
"expected_finding": false,
"code": "func handler(w http.ResponseWriter, r *http.Request) {\n data, err := io.ReadAll(r.Body)\n // ... no r.Body.Close() call\n}",
"anti_example_patterns": [
"avoid requiring explicit `r.Body.Close()`"
]
}
expected_finding: false means this scenario should not produce a finding. The test checks whether the suppression rule listed in anti_example_patterns still exists in SKILL.md. If it was deleted, the test fails and warns you that the model will start flagging resource leaks on every server handler that omits r.Body.Close().
Its counterpart is 003_missing_resp_body_close.json (expected_finding: true) for client code, where the AI should report the missing close. Together they form a yin-yang pair that protects this subtle distinction.
Contract Tests vs Golden Cases¶
| Contract Tests | Golden Cases | |
|---|---|---|
| Granularity | Whether a single rule exists | Whether a full scenario is covered by the combined rules |
| What is validated | "Does SKILL.md mention r.Body.Close()?" | "Given a server handler without r.Body.Close(), are the rules enough to avoid a false positive?" |
| Protection target | Prevent accidental deletion or renaming of rules | Prevent coverage gaps in combined scenarios |
| Analogy | Unit test: every brick is present | Integration test: the bricks together cover a real case |
In short: contract tests make sure every brick is still there; golden cases make sure those bricks still cover the real-world structure. Together they keep a skill from quietly degrading over time.
6.5 Structured Output Contract¶
Each skill defines 7-10 required output fields so results are auditable, parseable, and easy to integrate downstream:
## Output Contract (Mandatory Fields)
1. review_mode: Lite | Standard | Strict
2. files_reviewed: list of paths
3. findings: [{id, severity, category, location, description, evidence, recommendation}]
4. suppressed: [{reason, original_finding}]
5. baseline_comparison: {new, regressed, unchanged, resolved}
6. risk_summary: {overall_risk, sla_recommendations}
7. execution_status: {tools_run, tools_skipped, reason}
The output contract solves a common LLM problem: without a contract, the output shape changes every time, so CI cannot consume it reliably.
6.5.1 Output Format Design Methodology¶
Fixed fields answer the question of what to output. But there is a more fundamental question: how should those fields be organized and presented? The following five principles are distilled from a systematic review of the Output Contract sections across all production-grade skills.
Principle 1: Conclusion First
The verdict or status must appear at the very top. Readers should never have to read to the end to learn the overall result.
# Good format
**Status**: FAIL ← result visible at a glance
**Failed**: 3 out of 47
# Bad format
Running tests...
Test suite 1: calculator.test.js
- add(1,2) = 3 ... PASS
- subtract(5,3) = 2 ... PASS
... ← pages of output before the reader knows whether it passed
Note: in skills, "conclusion first" goes beyond a summary line. The review depth (Lite/Standard/Strict) is also declared in the very first section — so the reader knows the scope of coverage before reading any finding.
Principle 2: Execution Integrity
Never claim a tool ran when it did not. When a tool is not executed, always output exactly three things:
# Good format
Coverage: Not run
Reason: service unreachable (REDIS_URL not configured)
Reproduce: REDIS_URL=redis://localhost:6379 go test -tags=integration ./...
# Bad format
Tests ran with some issues. Please check the environment.
This principle governs output truthfulness. The other four principles assume the output is true; this one ensures that assumption holds. All 10 production-grade skills enforce it as a hard gate in their Output Contract.
Principle 3: Actionability
Every piece of information should directly guide the next action. The test: can the recipient — human or downstream agent — act on this immediately without further investigation?
# Actionable
- [src/form.go:45] handleSubmit: email state not bound to form field
Severity: must-fix
Fix: bind `form.email` to the email <input> in the template
# Not actionable
- Some tests failed in the form module → Please check the code
Note: severity labels like must-fix vs follow-up are themselves part of actionability — they tell the recipient whether this finding is blocking or advisory.
Principle 4: Layered Detail
Adjust verbosity to match the volume of results. The same skill produces outputs of different "thickness" depending on the outcome:
# All passing → minimal
**Status**: PASS (47/47)
# Few failures → expand each one
**Status**: FAIL (44/47)
### Failed Tests
- test_1: reason
- test_2: reason
- test_3: reason
# Many failures → group by category
**Status**: FAIL (12/47)
### Failed Tests by Category
- Database connection (8 failures): DB server unreachable
- Auth token (3 failures): Token expired
- Input validation (1 failure): Missing null check
Note: when findings exceed a soft cap (e.g., 10 in Standard mode), the overflow must not be silently dropped. Move them to a Residual Risk section and note "N additional findings deferred to Residual Risk."
Principle 5: Design for Downstream Consumption
A sub-agent's output will be consumed directly by the main conversation or a CI pipeline. Ask yourself before finalizing: "Can Claude or a script use this output immediately, without parsing it further?"
# Good: downstream agent can generate a fix directly
### Failed Tests
- [src/form.go:45] handleSubmit: email state not bound to form field
# Bad: downstream agent still needs to locate the file and function
- Some tests failed in the form module
Implementation patterns: - JSON summary block (Standard/Strict mode): machine-parseable summary that CI can consume with jq - PASS/FAIL verdict: boolean, not vague descriptions like "looks okay" or "some issues" - Stable field names: once published, field names must not change — downstream scripts depend on them, and renaming is a breaking change
6.6 Version and Platform Awareness¶
Read the project's real version information and adjust recommendations dynamically:
## Go Version Gate
Read go.mod → extract Go version → apply rules:
- < 1.17: do NOT recommend t.Setenv
- < 1.21: do NOT recommend slog
- < 1.22: WARN about range variable capture in goroutines
- < 1.24: do NOT recommend t.Parallel() + t.Setenv combination
This looks simple, but it solves one of the most common LLM mistakes: recommending features the current project version does not support. Traditional tools such as golangci-lint and SonarQube do not offer this kind of version-aware filtering.
6.7 Honest Degradation¶
When prerequisites are incomplete, the skill should not skip checks or guess. It should produce an explicitly marked degraded result:
# Degradation strategy from go-ci-workflow:
# Level 1: Makefile target exists → full parity
# Level 2: Makefile exists but target missing → partial parity + recommendations
# Level 3: no Makefile → inline scaffold + mark every line with "# INLINE FALLBACK"
The create-pr skill has a similar pattern: sufficient evidence → ready PR; insufficient evidence → draft PR with suspected items clearly marked.
Design point: degradation does not mean "do nothing if you cannot do it perfectly." It means do what you can, while clearly telling the user which parts are incomplete. That is far more valuable than pretending everything is fine.
6.8 Degrees of Freedom¶
Source: official skill-creator guidance
Choose different levels of instruction precision based on how fragile the operation is:
| Degree of Freedom | Expression Style | Best Use |
|---|---|---|
| High | Natural language ("use appropriate error handling") | Tasks with multiple valid implementations |
| Medium | Pseudocode or parameterized templates | A preferred pattern exists, but variation is acceptable |
| Low | Concrete scripts or full code | Fragile, error-prone actions that must be executed precisely |
The anti-pattern is clear: if everything is described with high freedom, output quality becomes unstable; if everything is defined as low freedom, the skill becomes too rigid to fit different projects.
6.9 Five Execution-Orchestration Patterns (from Anthropic's Official Guide)¶
The 8 patterns above focus on quality assurance: gates, anti-examples, scorecards, and so on. Anthropic's official guide also defines 5 patterns for how a skill organizes execution. The two sets complement each other:
| Pattern | Best Use | Core Technique |
|---|---|---|
| Sequential workflow orchestration | Multi-step flows that must happen in a fixed order | Explicit step order, inter-step dependency, phase-by-phase validation, rollback instructions |
| Multi-MCP coordination | Workflows that span multiple services, such as Figma → Drive → Linear → Slack | Clear phase boundaries, data handoff across MCPs, pre-validation, centralized error handling |
| Iterative refinement | Tasks where output quality improves over multiple passes, such as report generation | Draft → quality check → refinement loop → finalization, with explicit quality bar and stop criteria |
| Context-aware tool selection | One goal, but different tools are better depending on context | Decision trees, fallback options, transparent explanations for tool choice |
| Domain-expertise injection | The skill provides professional knowledge beyond raw tool access | Embedded domain rules, pre-action gates, audit trails, governance records |
Real-world mapping: go-code-reviewer combines sequential workflow orchestration (10 serial gates), context-aware tool selection (loading different references based on code traits), and domain-expertise injection (2,100+ lines of expert knowledge across 8 domains). When designing a skill, first pick the orchestration pattern, then layer on the quality-assurance patterns.
7. Common Pitfalls and Anti-Patterns¶
7.1 Description Determines Whether a Skill Lives or Dies¶
description is the only basis Claude uses to decide whether to auto-load a skill. It is not part of the body of SKILL.md; it lives in frontmatter and is always present in context.
Common mistakes:
# BAD — too vague, Claude cannot tell when to load it
description: A helpful tool for Go developers.
# BAD — explains what it is, but not when to use it
description: Go code review skill with multiple modes.
# GOOD — includes trigger conditions and core capability
description: >
Review Go code changes for real defects (security, concurrency, error handling,
resource leaks). Triggers on PR review, code review, diff analysis.
Supports Lite/Standard/Strict modes. Evidence-based, false-positive-aware.
Rule: everything about "when to use this skill" belongs in description, not the body. The body answers how. The description answers when.
7.2 SKILL.md Exceeds 500 Lines¶
The body of SKILL.md is fully loaded into context whenever the skill triggers. Once it grows past 500 lines, it not only wastes tokens but also weakens Claude's focus on the most important instructions.
Split like this:
- Decision framework, gates, output contract → keep in
SKILL.md - Detailed domain knowledge, templates, checklists → move to
references/ - Deterministic logic (scan, validate, discover) → wrap in
scripts/
7.3 Reference Files Without Loading Conditions¶
If you only list file names without explaining when to load them, Claude may load all of them (wasting tokens) or none of them (missing key knowledge):
# BAD — no loading conditions
## References
- references/security-patterns.md
- references/concurrency-patterns.md
- references/performance-patterns.md
# GOOD — explicit triggers
## References (Load Selectively)
- references/security-patterns.md
Load when diff contains: database/sql, tls.Config, crypto/, jwt, bcrypt
- references/concurrency-patterns.md
Load when diff contains: go func, chan, sync.Mutex, errgroup, context.WithCancel
- references/performance-patterns.md
Load when diff contains: append(, sync.Pool, atomic., reflect.
7.4 Positive Examples Only, No Anti-Examples¶
If you only teach the AI what to look for, but not what not to report, false positives will explode. In review skills especially, the anti-example library may be more valuable than the positive-example guide.
7.5 Ignoring allowed-tools Security Constraints¶
If a production skill runs without tool restrictions, the AI may perform unexpected operations:
# BAD — unrestricted tools; the AI might push or delete files
- uses: anthropics/claude-code-action@v1
with:
prompt: "Review this PR"
# GOOD — allowlist plus denylist
- uses: anthropics/claude-code-action@v1
with:
prompt: "Review this PR following .claude/skills/go-code-reviewer/SKILL.md"
allowed_tools: "Read,Grep,Glob,Bash(go test:*),Bash(golangci-lint:*)"
disallowed_tools: "Bash(git add:*),Bash(git commit:*),Bash(git push:*)"
7.6 Good and Bad Uses of Dynamic Context Injection¶
Skills support the !`command` syntax, which runs a shell command before the skill is loaded and replaces the placeholder with its output:
# Dynamic context injection inside SKILL.md
Current Go version: !`grep '^go ' go.mod | awk '{print $2}'`
Current branch: !`git branch --show-current`
This is preprocessing, not an instruction telling Claude to execute a command. It is useful for deterministic metadata such as project version or branch name.
Bad use: putting complex logic inside !`...`. Complex logic belongs in standalone scripts under scripts/.
7.7 Creating Extra Files You Do Not Need¶
A skill directory should contain only the files it actually needs. The following files should not exist:
README.md(SKILL.mdis already the documentation)CHANGELOG.md(usegit log)INSTALLATION_GUIDE.md(skills do not need installation guides)LICENSE(the skill follows the parent project's license)
7.8 Naming and Security Hard Limits¶
Anthropic enforces the following hard constraints. Violations may cause uploads to fail or the skill to be silently ignored:
| Constraint | Requirement | Bad Example |
|---|---|---|
| Folder naming | Must use kebab-case | My_Cool_Skill ❌, mySkill ❌, my-skill ✅ |
SKILL.md naming | Case-sensitive and exact | skill.md ❌, SKILL.MD ❌, SKILL.md ✅ |
| Reserved words in skill names | Must not contain claude or anthropic | claude-helper ❌ |
description content | XML angle brackets < > are forbidden | Frontmatter is injected into the system prompt, so angle brackets may be interpreted as instructions |
description length | ≤ 1024 characters | — |
7.9 Performance and Loading Limits¶
| Metric | Recommended Value | Notes |
|---|---|---|
SKILL.md size | < 5,000 words (about 500 lines) | Beyond this threshold, both latency and output quality tend to degrade |
| Number of skills enabled at once | 20-50 | Beyond that, frontmatter alone consumes a large chunk of the context window |
| Reference files | Load on demand; do not inline everything | Put detailed content under references/ and define loading conditions in SKILL.md |
Advanced tip (from the official guide): for critical validation logic, prefer executable scripts under scripts/ over natural-language instructions. Code is deterministic; prose is not. Anthropic's Office-related skills (docx, pptx, xlsx) are good examples of this pattern.
7.10 Common Misunderstandings¶
| Misunderstanding | Reality |
|---|---|
| "A skill is just a more advanced prompt" | A skill is a testable, version-controlled, on-demand knowledge module. A normal prompt cannot be regression-tested, loaded conditionally, or shared reliably across a team. The relationship is similar to "temporary script" vs "real tool." |
| "The more detailed the instructions, the better the skill" | Over-detail causes two problems: (1) once you exceed 500 lines, context cost becomes too high; (2) low-freedom instructions cannot adapt across projects. The right approach is to match instruction precision to the degree of freedom (see §6.8). |
| "Every common operation should become a skill" | A skill is worth creating only if it is reused, longer than 50 lines, and not needed in every session (see §3.1). Deterministic steps such as formatting are better as hooks, and short global rules belong in CLAUDE.md. |
| "Once a skill is written, it does not need maintenance" | Skills decay just like code. Tool upgrades, changing team standards, and model behavior changes can all make a skill stale. You need contract tests and real-world iteration to keep it healthy (see Chapter 9). |
| "Only developers can write skills" | SKILL.md is just Markdown. PMs can write process skills, QA can write testing-standard skills, and technical writers can write documentation-style skills. The key is understanding the structure: trigger conditions + operating steps + output requirements. |
8. Real-World Examples: From Simple to Complex¶
8.1 Simple Case: git-commit¶
The git-commit skill has only about 130 lines in SKILL.md and no references/ directory, yet it still implements a complete safe-commit workflow:
Workflow:
Pre-check → Staging strategy → Secret scan → Quality gates → Generate commit message → Commit → Report
Core design points:
- Security gates: scan for AWS keys, PEM files, GitHub tokens, and other secrets using regex; if any are found, block the commit
- Quality gates: for Go projects, run
go vetandgo test; for non-Go projects, run the project's standard checks - Hook awareness: if a git hook rejects the commit, the skill adjusts the message to satisfy the hook instead of bypassing it
- Atomicity rule: "one commit = one logical change"
In practice: running $git-commit in a real project
The following three screenshots come from an actual commit in the issue2md project and show how the gates work at each stage:

Security gate: after staging and before commit, the skill scans the staged changes with two regex passes: file-name patterns such as .env, .pem, and .key, plus content patterns such as AWS keys, SSH private keys, and GitHub tokens. In the screenshot, both return (no output), so the gate passes.

Commit step: after the security gate passes, the skill looks at the style of the previous commit and generates fix(github): classify raw 401 auth errors. Notice that it confirms there are unrelated changes in retry.go and coverage.out, but only stages the 3 files related to the target fix. That is the atomicity rule in action.

Report: the output includes a structured commit hash, file list, quality-gate result (make ci-api-integration passed), and a clear note about unrelated changes that were not committed. This same commit later triggers the all-green CI pipeline shown in §12.2.
Even a simple skill like this still relies on gates. That is a shared trait across all high-quality skills. Its deeper value, though, is that it turns the team's Git standard into executable enforcement, and Chapter 9 explains that idea in more detail.
8.3 A Closer Look at git-commit: The Art of Explicit Constraints¶
§8.1 showed what git-commit produces. This section steps back to the prompt-design level and unpacks three decisions that make it work.
Decision 1: Bidirectional Constraints, Not Just Positive Rules¶
Most skills only state positive norms: they tell the AI what a good commit looks like. The git-commit prompt goes further — it enumerates 5 engineering pain points as negative constraints too:
- Bad repo state: conflict markers, detached HEAD, an in-progress rebase
- Staging sprawl: unrelated changes, temp files, or submodule pointer drift mixing into one commit
- Credential leakage: API keys, private keys, and database URIs entering version history
- Unverified quality: committing before tests pass, letting breakage reach the main branch
- Semantic drift in the message: invented scope names, an overlong subject, mixing multiple intentions into one commit
This "define the target + enumerate what is forbidden" approach is more effective for the same reason anti-examples work in §6.2: LLMs naturally lean toward "do something rather than nothing." Without explicit negative constraints, the model will happily skip checks, invent plausible scope names, and commit even from an abnormal repo state. Positive norms define the destination; negative constraints define the boundaries that block "well-intentioned" shortcuts.
Decision 2: Precision Gradient — Exact Rules for Fragile Actions, Principles for Flexible Ones¶
This is §6.8 "Degrees of Freedom" applied in practice. The git-commit prompt assigns different levels of precision to different actions:
| Action | Constraint Type | Specific Rule |
|---|---|---|
| Subject length | Hard constraint | ≤ 50 characters (GitHub truncation threshold) |
| Body line length | Hard constraint | ≤ 72 characters (terminal git log readability) |
| Scope discovery | Algorithmic rule | Scan the last 30 commits; use a scope only if it appears ≥ 3 times |
| Body content | Principle | "Explain why the change was made, not just what changed" |
| Footer | Principle | "BREAKING CHANGE, Closes #, and Refs: are all optional" |
The logic: subject length is a concrete, verifiable hard constraint — GitHub will truncate anything longer, so the exact number is given. Body content depends on the context of each change and cannot be prescribed in advance, so a guiding principle is given and the judgment is left to the model. Mixing up these two levels — rigid rules for flexible situations, or vague principles for fragile ones — either makes the skill too brittle or leaves critical rules unenforced.
Decision 3: Pain Points Directly Drive Workflow Structure¶
The 5 pain points above are not just background. They map directly onto the skeleton of the 7-step serial workflow:
Pain Point → Workflow Step
────────────────────────────────────────────────────────────────
Bad repo state → Step 1: Preflight (6 repo health checks)
Staging sprawl → Step 2: Staging (force confirmation for >8 files + git add -p)
Credential leakage → Step 3: Secret gate (regex scan + 4-tier filtering)
Unverified quality → Step 4: Quality gate (ecosystem detection + matching commands)
Semantic drift in the message → Step 5: Compose message (scope algorithm + hard constraints)
────────────────────────────────────────────────────────────────
(Workflow-specific) → Step 6: Commit (--amend disabled by default)
(Workflow-specific) → Step 7: Post-commit report (structured output contract)
Every pain point maps to exactly one gate step, nothing missing, nothing extraneous. Deriving the workflow from pain points rather than filling in rules after the workflow is designed makes logical gaps far less likely — every step exists because there is a concrete problem it prevents.
The Concrete Shape of Negative Constraints¶
The scope-discovery algorithm is the best entry point for understanding how anti-examples become executable constraints. The prompt says only "scope names the module affected," but the skill turns that into a runnable algorithm with one critical guard clause:
git log --oneline -30 | grep -oE '^[0-9a-f]+ [a-z]+\([a-z0-9_-]+\):' \
| sort | uniq -c | sort -rn
# >= 3 commits with the same scope → use that scope
# < 3 per scope → omit scope, use <type>: <subject>
# Never invent a scope not in the canonical set.
That last line — "Never invent a scope not in the canonical set" — is a textbook negative constraint. Without it, the model will make reasonable inferences: seeing changes in internal/auth/, it will suggest auth as the scope, even if that project has never used auth in a commit. The inference is logical in isolation, but over time it pollutes the scope namespace in the commit history and breaks scripts that rely on consistent scope names like git log --grep. One explicit negative constraint stops a well-meaning model from doing harm.
The three decisions together demonstrate a single core principle: the quality of a prompt depends not only on what it says, but also on what it explicitly forbids.
8.2 Complex Case: go-code-reviewer¶
go-code-reviewer is a complex skill rated 9.5/10. It spans about 3,100 lines across SKILL.md, 8 reference files, 33 contract tests, and 8 golden cases. It shows the full architecture of a mature skill.
Three execution modes:
| Mode | Best Use | Finding Limit |
|---|---|---|
| Lite | ≤ 3 files, low risk | 5 |
| Standard | Default | 10 |
| Strict | Security, concurrency, or API contract changes | 15 |
7 mandatory gates:
Execution Integrity → Baseline Comparison → False-Positive Suppression
→ Risk Acceptance/SLA → Go Version → Generated Code Exclusion → Reference Loading
Three of these are especially distinctive:
- Go Version Gate: reads the Go version from
go.modand blocks recommendations the project cannot use (for example, noslogrecommendation for a Go 1.20 project). Neithergolangci-lintnor SonarQube can do this. - Reference Loading Gate: when code matches a trigger pattern, the corresponding domain reference file is mandatory. It is not "nice to have"; without loading it, review is not allowed to continue.
- Anti-example library: 8 major false-positive classes. It teaches Claude what not to report, which is harder, and often more valuable, than teaching what to look for.
What anti-examples achieve in practice:

This screenshot comes from a real review of the same fix(github) commit. The skill suppresses two false positives automatically: (1) string matching on "bad credentials" inside isAuthError, because it is limited to the GitHub API domain and is not a real security issue; and (2) field reordering inside statusError, because the 4 fields are already 8-byte aligned, so memory layout does not change. Every suppressed item includes specific domain reasoning and a residual-risk note, rather than simply being ignored. That is the real-world effect of the anti-example pattern in §6.2.
Progressive disclosure in practice:
SKILL.md (457-line operating framework)
└── Load references on demand based on trigger keywords in the code:
├── go-security-patterns.md (581 lines; triggers: database/sql, tls.Config, jwt...)
├── go-concurrency-patterns.md (224 lines; triggers: go func, chan, sync.Mutex...)
├── go-error-and-quality.md (249 lines; triggers: _ =, panic(, errors.Is...)
├── go-test-quality.md (174 lines; triggers: *_test.go files in diff)
├── go-api-http-checklist.md (222 lines; triggers: net/http, gin., grpc...)
├── go-performance-patterns.md (287 lines; triggers: append(, sync.Pool, atomic....)
├── go-modern-practices.md (296 lines; triggers: [T, slog., atomic.Int...)
└── pr-review-quick-checklist.md (65 lines; triggers: any PR/diff review)
If a PR only touches HTTP handlers, the review loads only go-api-http-checklist.md and pr-review-quick-checklist.md, not all 2,100 lines of domain knowledge. That is progressive disclosure working as intended.
9. Design Philosophy: From Teachable to Executable¶
The earlier chapters focus on how to write a skill, what patterns matter, and how to iterate on it. But after writing and refining many skills in real projects, a deeper idea becomes clear: a skill is not just a way to customize an AI coding assistant. It represents an engineering shift, from knowledge that can be taught to knowledge that can be executed.
9.1 Three Forms of Knowledge¶
In any technical team, engineering practice exists in three forms:
Tacit knowledge Explicit knowledge Executable knowledge
(in people's heads) (in documents) (in a skill)
┌──────────┐ ┌──────────┐ ┌──────────┐
│ "Check │ ───────→ │ Git │ ───────→ │ git- │
│ secrets │ Document │ standard │ Skillify │ commit │
│ before │ │ Chapter 6│ │ SKILL.md │
│ commit" │ │ details │ │ 7-step │
└──────────┘ └──────────┘ │ workflow │
✗ depends on memory ✗ depends on reading └──────────┘
✗ cannot be verified ✗ cannot be enforced ✓ enforced by gates
✗ lost with people ✓ preserved ✓ preserved + reusable
Traditional practice usually stops at the second stage: write documents, do training, and pass knowledge through code review. The core problem is that execution still depends entirely on human discipline and memory. Even a great document is useless if nobody remembers to read it.
Skills close the gap from stage two to stage three by automating explicit knowledge.
9.2 Case Study: How git-commit Aligns with the Git Standard¶
The git-commit skill (§8.1) is the most direct proof of this philosophy. For example, our team's Git operations guide had already distilled a complete Git commit standard in Chapter 2, "Daily Operation Commands Explained in Detail," and Chapter 6, "Workflow and Commit Standards." The skill turns those commit-related rules into mandatory gates inside a 7-step workflow:
Reference: xrdy511623/go-notes / productivetools/git
| Git Standard Topic | Current Documentation Topic | Skill Alignment |
|---|---|---|
| Atomic commit: one commit should do one thing | Commit granularity and patch staging | ✅ Hard rule + git add -p as the default strategy |
| Describe why, not just what | Commit message body guidance | ✅ Body must answer "why it changed" |
| Follow the team's commit standard | Commit message conventions | ✅ Full Conventional Commits rule set |
| Conventional Commits format | Subject-line format rules | ✅ type(scope): subject exactly aligned |
| Full type coverage | Type conventions and exceptions | ✅ Superset, with extra build and revert |
| Subject ≤ 50 characters | Subject length guidance | ✅ Aligned with GitHub truncation behavior |
| Body explains why | Responsibility split for the body | ✅ Clear split between what and why |
| Footer rules | Footer and issue-linking conventions | ✅ Supports BREAKING CHANGE, Closes #, and Refs: |
--amend safety warning | History-rewrite and force-push risk | ✅ Includes force-push warnings when the commit was already pushed |
All 9 documented rules are covered, from atomicity to footer formatting, from subject length to --amend risk warnings. The important change is not that the skill repeats the document. It changes how the knowledge is executed: from "please consult the Git standard" to "if the rule is not satisfied, the commit is blocked."
9.3 The Same Philosophy Across Different Skills¶
The same pattern appears across all strong skills:
| Skill | Tacit Knowledge (originally in people's heads) | Explicit Knowledge (documented) | Executable Knowledge (skillified) |
|---|---|---|---|
go-code-reviewer | Senior engineers' review instincts: security patterns, concurrency traps, performance anti-patterns | 8 domain references totaling 2,100+ lines | Trigger keywords auto-load the right references, then mandatory gates run in sequence |
unit-test | "Tests should find bugs, not chase coverage" | Defect-First Workflow methodology docs | Before writing tests, the skill must produce a failure-hypothesis list, and every target needs a killer case |
go-makefile-writer | Team build standards: standard targets and flags for lint/test/fmt | Project Makefile conventions | Generates a Makefile that matches team standards and can be used directly by CI |
security-review | Security lessons learned from real production incidents (for example, DB connection leaks) | Security review checklist | Forces review of resource lifecycle management, including constructor-release pairing |
9.4 Three Core Capabilities¶
Turning experience from "teachable" into "executable" requires three capabilities:
- Identify tacit knowledge: realize which team rules "everyone knows" are actually stored only in a few people's heads. For example, "check for accidentally committed secrets before commit" often feels obvious only after an incident.
- Express it structurally: turn vague experience into precise rules, gates, and workflows. "Be careful in code review" is not executable. "When
database/sqlappears, load the security review reference and check connection-pool config plus transaction boundaries" is executable. - Choose the right level of automation: not all knowledge should be automated. High-determinism rules such as formatting, secret scanning, and resource-lifecycle checks fit skills well. Flexible judgment such as architecture choices or business-logic reasoning should stay with humans. Too much automation is as harmful as too little.
9.5 Three Design Principles¶
Anthropic's official guide highlights the following three principles as the foundation of skills. Together they support the philosophy above:
| Principle | Meaning | Practical Guidance |
|---|---|---|
| Progressive disclosure | Three loading layers: frontmatter always visible, SKILL.md loaded on use, references/ discovered on demand | Do not put everything into SKILL.md; move detailed docs into references/ (see §5) |
| Composability | Claude may load multiple skills at once, so your skill should coexist cleanly with others | Do not assume exclusive tool access; avoid output-field collisions with common skills; keep boundaries clear |
| Portability | The same skill should run on Claude.ai, Claude Code, and the API without changes | Do not rely on platform-specific paths or env vars; declare required runtime dependencies in compatibility |
"Progressive disclosure" was covered in detail in §5. "Composability" and "portability" are especially important here because they determine whether a skill can actually scale across team environments and multiple platforms. If a skill assumes it is the only active skill, or only works on one platform, its reuse value drops sharply.
This is not just about using a tool better. It is an upgrade in engineering thinking. In the AI era, the people and teams that can make this shift will keep a long-term advantage.