systematic-debugging Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Subject: systematic-debugging

systematic-debugging is a debugging skill that emphasizes "find root cause first, then fix", suitable for test failures, production anomalies, intermittent issues, performance regressions, and third-party integration failures. Its core goal is to avoid guesswork-based patching. Its three main strengths are: breaking the debugging process into clear phases and requiring investigation before proposing permanent fixes; emphasizing explicit hypotheses, evidence collection, and complete investigation steps so debug reports are more verifiable and less speculative; and built-in severity triage that supports stopping the bleed in urgent failures while insisting on returning to root-cause analysis afterward.

1. Evaluation Overview¶

This evaluation assesses the systematic-debugging skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 debugging scenarios of increasing complexity (Go test failure, multi-layer error mapping bug, intermittent empty result). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 40 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	40/40 (100%)	29/40 (72.5%)	+27.5 percentage points
Phase structure	3/3 correct	0/3	Largest single-item delta
Explicit hypothesis statement	3/3	0/3	Skill-only
Investigation step completeness	3/3	0/3	At least 1 step missing
Skill Token cost (SKILL.md)	~2,000 tokens	0	—
Skill Token cost (incl. references)	~3,000 tokens	0	—
Token cost per 1% pass-rate gain	~73 tokens (SKILL.md only) / ~109 tokens (full)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

All scenarios use real code from the issue2md project (/Users/john/issue2md) to construct debugging tasks.

Scenario	Target file	Core focus	Assertions
Eval 1: Test failure	`frontmatter.go` `yamlQuote`	Single-function bug: multiline string breaks YAML output	14
Eval 2: Error status code	`graphql_client.go` → `handler.go`	Multi-layer call chain: GraphQL error not classified, causing 502	13
Eval 3: Intermittent empty summary	`summary_openai.go`	Intermittent bug: LLM output has trailing comma causing JSON validation failure	13

2.2 Assertion Design Principles¶

Assertions focus on debug process discipline, not final bug fix quality. Core checks:

Dimension	Check content	Assertions covered
Phase 1 completeness	Read error, reproduce, check history, trace data flow, collect evidence	15
Phase 2 completeness	Working example comparison, diff analysis	3
Phase 3 completeness	Explicit hypothesis statement, minimal test	6
Phase 4 completeness	Failing test, single fix, verification, no incidental changes	12
Structure discipline	Phase order compliance	3
Anti-impulse discipline	No fix before investigation	1

2.3 Execution¶

With-skill runs first read SKILL.md and root-cause-tracing.md reference
Without-skill runs read no skill; debugging follows model default behavior
All runs execute in independent subagents

3. Assertion Pass Rate¶

3.1 Summary¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: Test failure	14	14/14 (100%)	9/14 (64.3%)	+35.7%
Eval 2: Multi-layer bug	13	13/13 (100%)	11/13 (84.6%)	+15.4%
Eval 3: Intermittent bug	13	13/13 (100%)	9/13 (69.2%)	+30.8%
Total	40	40/40 (100%)	29/40 (72.5%)	+27.5%

3.2 Classification of 11 Without-Skill Failed Assertions¶

Failure type	Count	Evals	Notes
Phase structure missing	3	1/2/3	Flat structure (Symptom → Root Cause → Fix), no Phase 1→2→3→4
Explicit hypothesis missing	3	1/2/3	Jump from root cause to fix, no "I think X because Y" hypothesis verification
Reproduction attempt missing	1	1	No description of how to trigger bug or whether reliably reproducible
Change history check missing	1	1	No git history or recent change check
Working example comparison missing	1	1	No comparison with existing working cases
Existing test review missing	1	3	No check of what existing tests cover or miss
Fix verification missing	1	3	Proposed fix but no demonstration of running test to confirm

3.3 Trend: Skill Advantage vs Scenario Characteristics¶

Scenario characteristic	With-Skill advantage	Analysis
Eval 1 (simple, single-point)	+35.7% (5 failures)	Simple bugs most likely to skip investigation; Skill’s Iron Law forces full flow
Eval 2 (multi-layer, complex)	+15.4% (2 failures)	Complex scenarios naturally need layered analysis; base model does more complete investigation
Eval 3 (intermittent, subtle)	+30.8% (4 failures)	Intermittent bugs’ "silent failure" needs systematic evidence collection; without-skill lacks process rigor

Key finding: The skill’s largest value is in simple bug scenarios (Eval 1: +35.7%) and intermittent bug scenarios (Eval 3: +30.8%). This aligns with the skill’s "When to Use — Use this ESPECIALLY when 'Just one quick fix' seems obvious" design intent.

4. Dimension-by-Dimension Comparison¶

4.1 Phase Structure (Largest Delta Dimension)¶

This is the most consistent difference across all 3 scenarios: with-skill uses Phase 1→2→3→4 structure; without-skill uses flat structure.

Dimension	With Skill	Without Skill
Phase 1: Root Cause Investigation	✅ 3/3 separate section with sub-steps	❌ Mixed in "Root Cause" paragraph
Phase 2: Pattern Analysis	✅ 3/3 separate section	❌ 0/3 missing or implicit
Phase 3: Hypothesis and Testing	✅ 3/3 explicit hypothesis	❌ 0/3 missing
Phase 4: Implementation	✅ 3/3 RED→GREEN→Verify	⚠️ 2/3 have fix and test but no verification flow

Analysis: The base model’s default debugging pattern is Root Cause → Fix → Test, skipping Pattern Analysis and Hypothesis. The skill’s four-phase framework enforces extra analysis cycles; this is especially clear in Eval 2—with-skill’s Phase 2 explicitly compared REST vs GraphQL working/broken paths, while without-skill did similar comparison but embedded in root-cause analysis rather than a separate step.

Practical value: The four-phase structure ensures: - Investigation doesn’t jump to fix as soon as root cause is seen (Phase 2 guard) - There is an explicit verifiable hypothesis before fix (Phase 3 guard) - Red/green verification loop after fix (Phase 4 discipline)

4.2 Explicit Hypothesis Statement (Skill-Only)¶

Scenario	With Skill hypothesis	Without Skill
Eval 1	"The root cause is that `yamlQuote` does not handle newline characters. Replacing `\r\n`, `\r`, and `\n` with spaces..."	No hypothesis, directly "Fix Applied"
Eval 2	"`queryRaw()` line 144-146 uses `fmt.Errorf` with `%s`, creating plain unclassified error..."	No hypothesis, directly "Proposed Fix"
Eval 3	"`normalizeSummaryJSON` does not strip trailing commas... `json.Valid()` returns false..."	No hypothesis, directly "Proposed Fix"

Analysis: Without-skill skipped the explicit hypothesis step in all 3 scenarios. Although root-cause descriptions implied hypotheses, the lack of "I think X because Y" means: - Can’t distinguish "confirmed root cause" from "guessed root cause" - Can’t design minimal verification experiments to rule out alternatives - In complex bugs, may lead to "fixing symptom not root cause"

The skill’s Phase 3 rule "Form Single Hypothesis — State clearly: 'I think X is the root cause because Y'" effectively removes this gap.

4.3 Investigation Completeness¶

Some Phase 1 sub-steps in with-skill are missing in without-skill:

Phase 1 sub-step	With Skill	Without Skill	Missing in
Read error message	3/3	3/3	—
Reproduction confirmation	3/3	2/3	Eval 1
Check change history	3/3	2/3	Eval 1
Data flow tracing	3/3	3/3	—
Evidence collection (multi-component)	3/3	3/3	—
Working example comparison	3/3	2/3	Eval 1
Existing test review	3/3	2/3	Eval 3

Analysis: The base model is strong on reading error messages and data flow tracing (3/3), but inconsistent on reproduction confirmation, change history, and working example comparison. Eval 1 (simplest scenario) had the most gaps, suggesting simple bugs more easily trigger step omission.

4.4 Bug Fix Quality Comparison¶

All 6 agents correctly identified the root cause and proposed equivalent fixes:

Scenario	With Skill fix	Without Skill fix	Quality delta
Eval 1	`strings.NewReplacer` for newlines	`strings.NewReplacer` for newlines	No difference
Eval 2	Add `Type` field + `isGraphQLNotFoundError` + `%w`	Add `Type` field + `isGraphQLNotFoundError` + `%w`	No difference
Eval 3	`stripTrailingCommas()` character-level parse	`removeTrailingCommas()` regex	Minor (different implementation, functionally equivalent)

Key finding: The base model’s bug fix ability is already strong. The skill’s value is not improving fix quality but enforcing structured process, ensuring: - Full understanding before fix (prevents "symptom fix") - Hypothesis verified (prevents "fixed by luck") - Fix goes through full red/green verification loop

5. Token Cost-Effectiveness Analysis¶

5.1 Skill Size¶

File	Lines	Words	Est. Tokens
SKILL.md	296	1,504	~2,000
root-cause-tracing.md	169	739	~1,000
defense-in-depth.md	122	494	~650
condition-based-waiting.md	115	498	~650
condition-based-waiting-example.ts	158	667	~870
find-polluter.sh	63	214	~280
test-*.md + test-academic.md	209	1,221	~1,600
CREATION-LOG.md	119	612	~800
Description (always in context)	—	~15	~20

Actual load in evaluation:

Config	Files read	Total Tokens
Eval 1/2/3 with-skill	SKILL.md + root-cause-tracing.md	~3,000
SKILL.md only (minimal)	SKILL.md	~2,000
Full load (extreme)	All .md + .sh + .ts	~7,850

5.2 Token Cost for Quality Gain¶

Metric	Value
With-skill pass rate	100% (40/40)
Without-skill pass rate	72.5% (29/40)
Pass-rate gain	+27.5 percentage points
Token cost per assertion fixed	~182 tokens (SKILL.md only) / ~273 tokens (full)
Token cost per 1% pass-rate gain	~73 tokens (SKILL.md only) / ~109 tokens (full)

5.3 Token Segment Cost-Effectiveness¶

SKILL.md content split by functional module:

Module	Est. Tokens	Related assertion delta	Cost-effectiveness
Iron Law + Phase order	~120	3 (Phase structure)	Very high — 40 tok/assertion
Phase 3: Hypothesis rules	~150	3 (explicit hypothesis)	Very high — 50 tok/assertion
Phase 1: 5-step investigation checklist	~400	3 (reproduce/history/compare/test review)	High — 133 tok/assertion
Phase 4: Implementation discipline	~250	1 (verification)	Medium — 250 tok/assertion
Phase 2: Pattern Analysis	~150	1 (working example comparison)	Medium — 150 tok/assertion
Red Flags checklist	~200	Indirect (reinforces no-skip discipline)	Medium — no direct assertion
Common Rationalizations table	~150	Indirect (resists "quick fix" temptation)	Medium — no direct assertion
"When to Use" section	~180	0 (scenario matching set by evaluation)	Low — no increment in evaluation
Phase 4.5: Architecture questioning	~200	0 (evaluation didn’t cover 3+ failed-fix scenarios)	Low — not tested
Supporting Techniques pointer	~50	0 (pointer only)	Low — low information density
root-cause-tracing.md	~1,000	Indirect (Eval 2 multi-layer tracing)	Medium — aids tracing but base model also does it

5.4 High-Leverage vs Low-Leverage Instructions¶

High leverage (~670 tokens SKILL.md → 10 assertion delta): - Iron Law + Phase order (120 tok → 3) - Phase 3 Hypothesis rules (150 tok → 3) - Phase 1 five-step investigation checklist (400 tok → 4)

Medium leverage (~750 tokens → indirect): - Phase 4 implementation discipline (250 tok → 1 direct + red/green flow indirect) - Phase 2 Pattern Analysis (150 tok → 1) - Red Flags + Rationalizations (350 tok → anti-impulse discipline indirect)

Low leverage (~430 tokens → 0 delta): - "When to Use" section (180 tok) — scenario matching set by evaluation - Phase 4.5 Architecture questioning (200 tok) — not tested - Supporting Techniques pointer (50 tok) — low information density

References (~1,000 tokens root-cause-tracing.md → indirect): - Aided Eval 2 multi-layer tracing structure, but base model also did well on multi-layer tracing

5.5 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Excellent — ~3,000 tokens for +27.5% pass rate
SKILL.md ROI	Excellent — ~2,000 tokens contains all high-leverage rules
High-leverage token share	~34% (670/2,000) directly contributes 10/11 assertion delta
Low-leverage token share	~22% (430/2,000) contributes nothing in this evaluation
Reference cost-effectiveness	Medium — ~1,000 tokens root-cause-tracing.md provides indirect gain

5.6 Comparison with Other Skills’ Cost-Effectiveness¶

Metric	systematic-debugging	go-makefile-writer	security-review	google-search
SKILL.md Tokens	~2,000	~1,960	~3,700	~2,200
Total load Tokens	~3,000	~4,100–4,600	~5,000–9,600	~3,600
Pass-rate gain	+27.5%	+31.0%	+50.0%	+74.1%
Tokens per 1% (SKILL.md)	~73 tok	~63 tok	~74 tok	~30 tok
Tokens per 1% (full)	~109 tok	~149 tok	~100–192 tok	~49 tok

systematic-debugging’s SKILL.md cost-effectiveness (73 tok/1%) is in the same range as go-makefile-writer (63 tok/1%) and security-review (74 tok/1%)—a high-efficiency skill.

6. Boundary Analysis vs Base Model Capabilities¶

6.1 Base Model Capabilities (No Skill Increment)¶

Capability	Evidence
Accurately read error messages	3/3 scenarios correctly parsed error output
Data flow tracing (single and multi-layer)	3/3 scenarios traced to root cause
Correctly identify root cause	3/3 scenarios root cause consistent
Write equivalent fix code	3/3 scenarios fixes functionally equivalent
Write table-driven tests	3/3 scenarios produced similar tests
Multi-component boundary analysis	Eval 2 detailed 5-layer component analysis
Intermittent bug symptom→cause mapping	Eval 3 correctly explained "why intermittent"

6.2 Base Model Gaps (Skill Fills)¶

Gap	Evidence	Risk level
Phase structure missing	3/3 scenarios use flat structure	High — can’t distinguish investigation, analysis, verification, implementation
Explicit hypothesis missing	3/3 scenarios jump from root cause to fix	High — may "fix by luck" on complex bugs
Reproduction confirmation inconsistent	1/3 scenarios skipped	Medium — simple bugs more likely to omit
Change history check inconsistent	1/3 scenarios skipped	Low — scenario-dependent
Working example comparison missing	1/3 scenarios skipped	Medium — Pattern Analysis prevents repeat bugs
Existing test review inconsistent	1/3 scenarios skipped	Medium — may miss test coverage gaps
Fix verification inconsistent	1/3 scenarios skipped	High — unverified fix may introduce new bugs

6.3 Skill Value Proposition¶

The systematic-debugging skill’s core value is not improving bug fix ability (the base model is already strong) but enforcing debugging discipline:

Prevent skipping steps: Iron Law + Phase structure forces investigation before fix
Explicit hypothesis verification: Phase 3 ensures fix is based on verified hypothesis, not intuition
Investigation checklist completeness: Phase 1’s 5-step checklist ensures no key step is missed
Anti-impulse mechanism: Red Flags + Rationalizations table is especially effective in "simple bug" scenarios

This is like a flight checklist—not because pilots don’t know how to fly, but to ensure critical steps aren’t skipped when things seem "too simple" or "too urgent".

7. Overall Score¶

7.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Phase structure	5.0/5	1.0/5	+4.0
Hypothesis verification discipline	5.0/5	1.0/5	+4.0
Investigation completeness	5.0/5	3.5/5	+1.5
Fix quality	5.0/5	4.5/5	+0.5
Test coverage	5.0/5	4.0/5	+1.0
Verification discipline (red/green loop)	5.0/5	3.5/5	+1.5
Overall mean	5.0/5	2.92/5	+2.08

7.2 Weighted Total¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	8.5/10	2.13
Phase structure	20%	10/10	2.00
Hypothesis verification discipline	15%	10/10	1.50
Investigation completeness	15%	9.0/10	1.35
Token cost-effectiveness	15%	8.5/10	1.28
Bug fix quality increment	10%	5.0/10	0.50
Weighted total			8.76/10

Lower Bug fix quality increment score (5.0/10): The base model’s fix ability is already strong; the skill’s contribution is mainly process discipline, not result quality.

8. Evaluation Artifacts¶

Artifact	Path
Eval 1 with-skill output	`/tmp/debug-eval/eval-1/with_skill/response.md`
Eval 1 without-skill output	`/tmp/debug-eval/eval-1/without_skill/response.md`
Eval 2 with-skill output	`/tmp/debug-eval/eval-2/with_skill/response.md`
Eval 2 without-skill output	`/tmp/debug-eval/eval-2/without_skill/response.md`
Eval 3 with-skill output	`/tmp/debug-eval/eval-3/with_skill/response.md`
Eval 3 without-skill output	`/tmp/debug-eval/eval-3/without_skill/response.md`
Target code	`/Users/john/issue2md/internal/converter/`
Target code	`/Users/john/issue2md/internal/github/`
Target code	`/Users/john/issue2md/internal/webapp/`