systematic-debugging Skill Design Rationale¶
systematic-debugging is a debugging framework that turns debugging work from intuition-driven patching into a strict "root cause first, fix second" investigation process. Its core idea is: the goal of debugging is to first classify severity, collect evidence, form a single hypothesis, validate it minimally, and only then move into implementation and verification, while delivering the whole process as a report that can be reviewed and judged PASS/FAIL. That is why the skill turns Severity Triage, the Iron Law, the four phases, Hypothesis Discipline, the Fix Attempt Gate, the Scorecard, and the Output Contract into one tightly constrained workflow.
1. Definition¶
systematic-debugging is used for:
- debugging test failures, production incidents, intermittent issues, performance regressions, build failures, and third-party breakages,
- requiring root-cause investigation before any permanent fix,
- using explicit hypotheses, boundary evidence, and data-flow tracing to locate the true source,
- handling P0 incidents by mitigating first and then returning to full root-cause analysis,
- and enforcing debugging quality through report structure and scoring rules.
Its output is not just a fix suggestion. It also includes:
- triage,
- reproduction,
- evidence collected,
- hypothesis log,
- root cause,
- fix plan/change,
- verification,
- residual risk/follow-ups,
- scorecard.
From a design perspective, it is closer to a debugging-governance framework than to a prompt that simply reads an error and jumps to a repair.
2. Background and Problems¶
The main problem this skill addresses is not that models cannot fix bugs. It is that debugging naturally drifts toward a few high-risk impulses:
- seeing a symptom and editing immediately,
- changing multiple things at once and destroying attribution,
- declaring success after a change without real verification.
Without process constraints, the most common failures cluster into eight categories:
| Problem | Typical consequence |
|---|---|
| No root-cause investigation first | symptom gets patched and the issue returns quickly |
| No reproduction check first | the issue seems fixed only because it did not reappear yet |
| No recent-change review | the most likely trigger gets missed |
| No environment-health check | full disk, port conflicts, or OOM get treated like code bugs |
| No boundary evidence collection | in multi-component systems, nobody knows which layer actually failed |
| No explicit hypothesis | confirmed cause and guesswork get mixed together |
| Multiple fixes bundled together | nobody knows which change actually mattered |
| Repeated failures without questioning architecture | investigation degrades into Fix #4, Fix #5, and endless trial-and-error |
The design logic of systematic-debugging is to make "what severity is this, how should it be investigated, is the evidence strong enough to support root cause, and was the fix actually verified?" explicit before implementation is allowed.
3. Comparison with Common Alternatives¶
It helps to compare it with a few common alternatives:
| Dimension | systematic-debugging skill | Asking a model to "fix this bug" | Manual intuition-driven debugging |
|---|---|---|---|
| Root-cause-first discipline | Strong | Weak | Medium |
| Explicit phase structure | Strong | Weak | Weak |
| Separation of hypothesis and verification | Strong | Weak | Medium |
| Multi-boundary evidence collection | Strong | Weak | Medium |
| Resistance to impulsive patching | Strong | Weak | Weak |
| P0 mitigation vs permanent-fix separation | Strong | Weak | Medium |
| Debug-report auditability | Strong | Weak | Weak |
| PASS/FAIL quality judgment | Strong | Weak | Weak |
Its value is not only that the debugging write-up looks more formal. Its value is that it turns debugging from one-off trial-and-error into an engineering process with evidence, gates, and reviewable outputs.
4. Core Design Rationale¶
4.1 Severity Triage Comes Before Code Analysis¶
Before entering the four phases, systematic-debugging requires classifying the issue as:
P0,P1,- or
P2.
This matters because different severity levels have different debugging goals. A P0 is first an operational problem and must be stabilized quickly; a P1 goes through the full four-phase process; a P2 can take the simplified path, usually centered on Phase 1 + Phase 4, with Pattern Analysis skipped when the cause is already obvious. The skill therefore bakes "mitigate first, investigate second" into the P0 protocol instead of pretending every incident should be handled identically.
The value of this design is that it cleanly separates service restoration from permanent correction. That prevents emergency incidents from being slowed down by over-idealized investigation, while also preventing a temporary mitigation from being mistaken for the real fix.
4.2 The Iron Law Is Written So Absolutely¶
The skill's Iron Law is:
This is not stylistic preference. It is the central constraint of the skill. It explicitly forbids:
- fixing before investigating,
- stacking multiple changes first and then checking the result,
- proposing a permanent fix first and only later backfilling the investigation.
That is also why the skill separately carves out a diagnostic instrumentation exemption. Temporary logs, breakpoints, and probe scripts are not fixes; they are observation tools. This preserves the "investigate first" rule without blocking necessary evidence collection.
4.3 The Four Phases Form the Skeleton of the Skill¶
systematic-debugging fixes the debugging flow into:
- Root Cause Investigation
- Pattern Analysis
- Hypothesis and Testing
- Implementation
These phases are not cosmetic sectioning. They prevent several common jumps:
- going straight from symptom to fix,
- seeing an apparent cause and skipping working-example comparison,
- touching code before forming a single hypothesis,
- finishing a change without explicit verification.
The evaluation makes this especially clear: without-skill responses naturally collapsed toward Root Cause -> Fix -> Test, while with-skill responses consistently preserved the full Phase 1→2→3→4 structure. That shows one of the skill's main increments is not stronger repair ability, but stronger process integrity.
4.4 Explicit Hypothesis Is a Core Design Rule, Not Just a Writing Convention¶
The skill forces statements like:
I think X is the root cause because Y
and it requires one hypothesis at a time plus one minimal test for that hypothesis.
This matters because the most common debugging distortion is not having zero ideas. It is treating "the explanation that currently feels most likely" as though it were "the cause already proven." Explicit hypotheses force the debugger to answer:
- what exactly I currently believe the cause is,
- what evidence supports that belief,
- what evidence could still disprove it.
That turns Phase 3 into a real scientific-method step instead of a more polished version of intuition.
4.5 It Enforces "One Hypothesis, One Minimal Change"¶
The skill explicitly forbids bundled changes and requires:
- one hypothesis at a time,
- one minimal test per hypothesis,
- one fix at a time.
This is a very strong design choice because one of the most common reasons debugging goes wrong is that several plausible causes get changed together. Even if the issue disappears, nobody knows which change actually mattered. The skill therefore preserves attribution so that a debugging result is not only "passing" but also "understood."
4.6 Environment Health Check Lives in Phase 1¶
When symptoms include:
- intermittent failures,
- timeouts,
- "works on my machine",
- silent process death,
- or no obvious code cause,
the skill explicitly says to check environment health first and even suggests commands like df -h, lsof, dmesg, and nslookup.
This is mature design because many issues that look like code bugs are actually:
- full disk,
- OOM kills,
- port conflicts,
- DNS/network failures,
- file-descriptor exhaustion.
By front-loading environment checks, the skill explicitly acknowledges that not every failure should begin inside the source code. This greatly reduces time wasted debugging at the wrong layer.
4.7 Multi-Component Systems Require Boundary Evidence¶
For systems like CI -> build -> signing or API -> service -> database, the skill explicitly requires:
- recording what enters each boundary,
- recording what exits each boundary,
- verifying environment/config propagation,
- using one round of observation to determine which boundary breaks.
This is critical because the most common misread in multi-component debugging is to treat the layer where the error appears as the layer that caused it. Boundary evidence forces the debugger to build an evidence chain instead of reasoning by proximity. The evaluation's multi-layer error-mapping scenario is a direct example of why this rule matters.
4.8 Phase 2 Preserves Pattern Analysis¶
A natural question is: if Phase 1 already identifies the root cause, why keep Pattern Analysis as a separate phase?
Because root-cause investigation answers "where did this fail," while Pattern Analysis answers:
- what similar code is already working,
- what the full reference pattern actually looks like,
- what all the differences are between working and broken behavior,
- and what hidden assumptions the current component depends on.
This phase is the skill's guardrail against jumping from a plausible cause to code changes too early. In the evaluation, working-example comparison was missing in some without-skill scenarios, which is enough to show that this step does not reliably appear unless the structure requires it.
4.9 The Fix Attempt Gate Forces Escalation After Three Failed Tries¶
The skill explicitly says:
- after 3 failed hypotheses or 3 failed fixes,
- stop,
- question the mental model or architecture,
- and do not drift into Fix #4 without escalation.
This design is valuable because repeated debugging failure often means the problem is not "this one line is wrong," but "the whole problem is being interpreted inside the wrong architecture or abstraction." Hard-coding that escalation point prevents endless local patching when the real issue is structural.
4.10 The P0 Protocol Says "Mitigate First, Investigate After"¶
For P0 incidents, the skill requires:
- rollback / feature flag / failover / targeted hotfix first,
- verify mitigation worked,
- then begin full root-cause investigation within 24 hours.
This solves a common confusion: does mitigating first mean skipping the debugging process? The skill's answer is no. Mitigation is an operational action; permanent correction is the debugging action. By separating them, the skill avoids delaying recovery for the sake of purity while also refusing to let emergency response replace actual root-cause work.
4.11 Debugging Report Quality Is Also a Hard Constraint¶
systematic-debugging does not only govern actions. It also requires the final report to include an explicit scorecard verdict:
- Critical,
- Standard,
- Hygiene.
This is a strong design choice because many debugging results appear to include:
- a root cause,
- a fix,
- and a test,
while still failing in substance because:
- the root cause is actually a symptom,
- the evidence is incomplete,
- the hypothesis log is missing,
- the verification is vague.
The scorecard separates "the report looks complete" from "the report is trustworthy." It explicitly allows a report to be judged FAIL when the investigation or verification quality is weak. That makes the skill's output not only a technical conclusion, but a debugging artifact whose quality can be judged.
4.12 Fixed-Order Output Contract¶
The skill requires debugging reports to follow this order:
- Triage
- Reproduction
- Evidence Collected
- Hypothesis Log
- Root Cause
- Fix Plan and Change
- Verification
- Residual Risk and Follow-ups
- Scorecard
This solves a very practical problem: when report structure is fluid, reviewers cannot quickly tell:
- whether investigation really happened before fixing,
- whether an explicit hypothesis existed,
- whether root cause is truly source-level,
- whether verification actually ran.
A fixed output order turns those into checkable structure instead of subjective reading impressions.
4.13 References Are Loaded by Symptom¶
The skill's references are not meant to be loaded all at once. They are routed by situation:
- deep-stack issues load
root-cause-tracing.md, - missing guards or layered validation load
defense-in-depth.md, - flaky / async / sleep issues load
condition-based-waiting.md, - unclear bug category loads
bug-type-strategies.md, - final report writing loads
output-contract-template.md, - report grading loads
debugging-report-scorecard.md.
This structure is sensible because debugging problems vary widely, but not every run needs every debugging technique in context. The skill keeps core discipline in SKILL.md and loads specialized tactics only when symptoms warrant them, balancing coverage against token cost.
5. Problems This Design Solves¶
Combining the current SKILL.md, key references, and the evaluation report, the skill solves the following problems:
| Problem type | Corresponding design | Practical effect |
|---|---|---|
| Seeing a bug and fixing immediately | Iron Law + four-phase process | Forces investigation before implementation |
| Confusing root cause with guesswork | Hypothesis Discipline | Makes root-cause claims more testable |
| Unclear breakpoints in multi-component systems | Boundary Evidence | Locates the failing hop faster |
| Mistaking environment failures for code bugs | Environment Health Check | Reduces debugging at the wrong layer |
| Losing attribution across repeated edits | Single hypothesis, single minimal change | Preserves causal understanding |
| Endless Fix #4 / #5 trial-and-error | Fix Attempt Gate | Escalates to architecture discussion sooner |
| Reports that look complete but are not trustworthy | Output Contract + Scorecard | Makes review and replay easier |
| Emergency mitigation replacing real debugging | P0 protocol | Preserves both restoration and root-cause analysis |
6. Key Highlights¶
6.1 It Turns Debugging from "Fixing Bugs" into "Investigating Bugs"¶
This is the skill's biggest upgrade. Evidence comes first; repair is allowed later.
6.2 The Four-Phase Structure Is Its Most Visible Process Strength¶
Phase 1→2→3→4 separates investigation, analysis, hypothesis, and implementation so debugging does not collapse into "look once, patch once."
6.3 The Explicit Hypothesis Mechanism Is Crucial¶
It forces the debugger to turn "I think this is the cause" into a testable statement instead of leaving it as hidden intuition.
6.4 Environment Health and Boundary Evidence Make It Useful for Real Systems¶
Many debugging playbooks focus only on code. systematic-debugging deliberately includes OS state, config propagation, and cross-component boundaries in root-cause work.
6.5 It Has Direct Countermeasures Against Debugging Impulses¶
Red flags, escalation after three failed attempts, and the P0 mitigate-then-investigate split all directly target the most common human debugging failures.
6.6 Its Real Increment Is Process Discipline More Than Repair Ability¶
The evaluation already shows this: the base model was already strong at reading errors, tracing data flow, identifying root cause, and writing repair code. The real delta came from phase structure, explicit hypothesis, investigation completeness, verification discipline, and report auditability. That means the skill's core value is debugging governance, not simply "smarter fixes."
7. When to Use It — and When Not To¶
| Scenario | Suitable | Reason |
|---|---|---|
| Test failures, build failures, and production incidents | Very suitable | These are its core use cases |
| Multi-layer call chains or multi-component systems | Very suitable | Boundary evidence is especially valuable |
| Intermittent, flaky, or race-related issues | Very suitable | Hypothesis discipline and evidence collection matter most here |
| Situations with strong pressure for a quick fix | Very suitable | That is exactly the failure mode it is designed to constrain |
| Obvious one-line typo or compile error | Suitable but can be lighter | This often fits the simplified P2 path |
8. Conclusion¶
The real strength of systematic-debugging is not that it invents smarter fixes. It is that it systematizes the judgments most likely to go wrong in debugging: classify severity first, investigate before changing code, form a single hypothesis, validate minimally, implement only after the source is understood, and make the whole process reviewable, scoreable, and reproducible by another engineer.
From a design perspective, the skill embodies a clear principle: the key to high-quality debugging is not writing a fix faster, but knowing earlier what you actually understand, where the evidence comes from, whether the hypothesis was tested, and whether the final change rests on a real root cause instead of a convenient symptom. That is why it is especially well suited to bug investigation, incident debugging, and root-cause analysis workflows.
9. Document Maintenance¶
This document should be updated when:
- the Severity Triage, Iron Law, four-phase flow, Mandatory Gates, Scorecard, Output Contract, or P0 protocol in
skills/systematic-debugging/SKILL.mdchange, - key rules in
skills/systematic-debugging/references/root-cause-tracing.md,bug-type-strategies.md,defense-in-depth.md,condition-based-waiting.md,output-contract-template.md,debugging-report-scorecard.md, orbad-good-debugging-reports.mdchange, - key supporting conclusions in
evaluate/systematic-debugging-skill-eval-report.mdorevaluate/systematic-debugging-skill-eval-report.zh-CN.mdchange.
Review quarterly; review immediately if the phase structure, hypothesis discipline, P0 protocol, or scorecard / output contract of systematic-debugging changes substantially.
10. Further Reading¶
skills/systematic-debugging/SKILL.mdskills/systematic-debugging/references/root-cause-tracing.mdskills/systematic-debugging/references/bug-type-strategies.mdskills/systematic-debugging/references/output-contract-template.mdskills/systematic-debugging/references/debugging-report-scorecard.mdevaluate/systematic-debugging-skill-eval-report.mdevaluate/systematic-debugging-skill-eval-report.zh-CN.md