deep-research Skill Design Rationale¶
deep-research is a source-backed research framework for factual and analytical research tasks. Its core idea is: research quality depends on defining scope, evidence requirements, retrieval budget, verification rigor, and delivery structure before retrieval, synthesis, and conclusion writing begin. That is why the skill turns Scope Classification, Ambiguity Resolution, Evidence Requirements, Research Mode, Hallucination Awareness, Budget Control, Content Extraction, and Execution Integrity into a strict serial workflow.
1. Definition¶
deep-research is meant for technical surveys, option comparison, claim verification, trend analysis, codebase research, and hybrid research tasks. Its output is not just a set of findings. It also includes:
- the normalized research scope,
- evidence-chain requirements and confidence expectations,
- the actual retrieval and extraction method used,
- a separation between consensus and debate,
- source-quality notes and explicit research gaps.
From a design perspective, it is much closer to a gated research execution framework than to a free-form web-search prompt.
2. Background and Problems¶
The core problem this skill addresses is that research tasks without explicit constraints tend to fail in four ways at once: conclusions arrive too early, evidence tiers are too weak, citations are hard to verify, and output structure varies too much to reuse.
Without a clear framework, failures usually cluster into eight categories:
| Problem | Typical consequence |
|---|---|
| Scope is never narrowed first | The topic drifts and the final result stops answering the original question |
| Comparison dimensions or time boundaries are unclear | Technology comparisons and trend analysis mix data with different timeframes or meanings |
| No evidence chain is defined up front | Recommendation-style conclusions are supported by only one or two weak sources |
| Conclusions are formed from snippets | Search previews are treated as evidence, leading to distorted reading |
| Model output has no anti-hallucination guardrail | URLs, paper titles, version facts, or benchmark numbers are stated too confidently |
| Retrieval budget is unlimited | Query rounds keep expanding while information gain keeps shrinking |
| Output structure changes from task to task | Reports are hard to compare, review, or reuse |
| Gaps are hidden instead of exposed | Missing evidence is covered over with overly confident prose |
The design logic of deep-research is to turn research from a situational behavior into a workflow that is verifiable, auditable, and reusable.
3. Comparison with Common Alternatives¶
Before looking at the details, it helps to compare the skill with a few common alternatives:
| Dimension | deep-research skill | Asking a model to "research this" | Doing web search and hand-writing a summary |
|---|---|---|---|
| Scope narrowing | Strong | Weak | Medium |
| Ambiguity handling | Strong | Weak | Weak |
| Evidence-chain requirements | Strong | Weak | Weak |
| Anti-hallucination verification | Strong | Weak | Weak |
| Snippet vs. full-content distinction | Strong | Weak | Weak |
| Content extraction discipline | Strong | Weak | Medium |
| Budget control | Strong | Weak | Weak |
| Structured delivery | Strong | Medium | Weak |
| Gap exposure | Strong | Weak | Weak |
Its value does not come from replacing model reasoning. It comes from adding boundary control, evidence discipline, and delivery discipline to the research process.
4. Core Design Rationale¶
4.1 Scope Classification and Ambiguity Resolution Come First¶
deep-research places Scope Classification and Ambiguity Resolution at the very front, requiring the research category, goal, comparison dimensions, and expected depth to be clarified before retrieval starts.
The reason is straightforward: many research failures do not come from lack of sources. They come from never clearly defining the object of research. For example:
- a request like "research microservices" is too broad and must be narrowed;
- trend analysis without a time window will mix stale and current information;
- technology comparison without explicit dimensions such as performance, cost, ecosystem, or operational complexity will drift.
These gates force the workflow to answer "what exactly are we researching?" before it starts searching.
4.2 Evidence Requirements Must Be Defined Before Retrieval¶
The third Mandatory Gate is Evidence Requirements. It requires the workflow to define the minimum acceptable evidence chain before retrieval begins.
This is critical because not all conclusions are the same kind of claim:
- a single factual claim may need an official or primary source;
- a best-practice recommendation may need an official basis plus practitioner experience;
- a technology comparison may require multiple independent benchmarks or reviews;
- a trend judgment may require cross-time data from multiple sources;
- a disputed or fast-moving topic may require several source tiers plus conflict handling.
Without an explicit evidence threshold, the model is likely to stop as soon as it finds something that resembles an answer. deep-research instead defines what "enough evidence" means before it decides when synthesis may begin.
4.3 Research Mode and Budget Control Are Paired¶
deep-research does not only classify work into Quick, Standard, and Deep. It also assigns bounded retrieval and content-extraction budgets to each mode.
That is a mature design choice because research tends to fail in two opposite directions:
- too little retrieval, where evidence is thin but the conclusion sounds complete;
- too much retrieval, where search rounds expand while marginal information value keeps dropping.
Mode answers "how deep should this go?" Budget answers "where should this stop?" The combination keeps the process from being both shallow and unbounded.
It also gives users a predictable cost profile. A quick check should stay quick. A deep dive may spend more budget, but only because the user is explicitly asking for greater completeness.
4.4 Hallucination Awareness Is Elevated to a Mandatory Gate¶
deep-research treats Hallucination Awareness as its own Mandatory Gate and backs it with a dedicated hallucination-and-verification.md reference.
That reflects a very clear understanding of research work: the most dangerous failure mode is not "the analysis is a bit shallow." It is output that looks source-backed but is not actually verifiable. In research tasks, that often appears as:
- fabricated URLs, paper titles, or author names,
- stale information presented as current fact,
- uncertain conclusions written in absolute language,
- conflation of similar products, versions, or concepts,
- selective use of supporting evidence while contradictions are ignored.
Making this a separate gate rather than a few reminders signals that anti-hallucination behavior is foundational to research correctness.
4.5 The Skill Requires Full-Content Extraction vs. Search Snippets¶
The Content Extraction Gate is one of the most important design choices in deep-research. It explicitly requires the workflow to read actual source content before drawing key conclusions.
This addresses one of the most common and least visible research errors:
- search snippets preserve conclusions but not context,
- page titles often carry marketing language that does not reflect evidence strength,
- secondary writeups frequently omit methodology, version boundaries, or experimental conditions.
That is why the skill requires fetch-content before synthesis. If extraction fails for a critical source, the failure should be recorded in gaps instead of being silently treated as sufficient evidence.
This creates a sharp distinction between deep-research and ordinary "search-and-summarize" workflows. The requirement is not merely "locate sources." It is "actually read sources."
4.6 Execution Integrity Is Called Out Separately¶
The Execution Integrity Gate requires the final report to state honestly:
- whether retrieval actually ran,
- whether content extraction actually ran,
- how many sources were extracted and cited,
- which conclusions come from page content and which are only leads from snippets.
This matters because research tasks are particularly prone to producing output that only looks executed. Without execution-integrity discipline, a report can sound complete while mixing assumptions, imagined steps, and partial evidence.
Execution Integrity turns a research writeup from a text artifact into a deliverable with process evidence behind it.
4.7 Honest Degradation Is Better Than Forcing a Complete Answer¶
deep-research supports Full, Partial, and Blocked outcomes, and it requires gaps, causes, and next-step suggestions to be stated explicitly.
That is highly practical because research frequently hits real constraints:
- important sources are paywalled,
- some domains are poorly indexed by public search,
- the topic is too recent for strong public evidence,
- the retrieval budget is exhausted before the evidence chain is satisfied.
Without a degradation model, the model usually falls into one of two bad behaviors:
- evidence is thin, but the prose still sounds definitive;
- one obstacle appears and the workflow stops with nothing useful.
deep-research takes a third route: report honestly what is confirmed, what is missing, and how the missing evidence should be pursued next.
4.8 Output Contract for Consensus, Debate, Source Quality, and Gaps¶
The current Output Contract requires 9 sections for Standard and Deep modes; Quick mode may omit sections 5 and 6. In the full report structure, the most design-significant among them are:
Consensus vs Debate,Source Quality Notes,Gaps & Limitations.
Their purpose is not cosmetic completeness. Their purpose is to let a reader answer four questions quickly:
- which conclusions are relatively stable,
- where sources still disagree,
- how strong the evidence really is,
- which gaps materially affect decision-making.
That is one of the clearest differences between deep-research and ordinary "material collection." The former delivers a research judgment framework; the latter often only accumulates information.
4.9 It Covers Web Research, Codebase Research, and Hybrid Research Together¶
At the Scope Classification stage, deep-research explicitly covers comparative research, trend analysis, claim verification, technical deep-dive, codebase research, and hybrid research.
This shows that the skill does not interpret research as "search the web." In engineering practice, many of the most valuable research tasks are hybrid:
- first inspect the current codebase,
- then validate options or best practices externally,
- finally combine internal constraints and external evidence into a recommendation.
Including codebase research and hybrid research in the same framework makes the skill fit real technical decision work rather than only public-information gathering.
4.10 References Use a "Base Always Loaded + Details On Demand" Pattern¶
The current version of deep-research is no longer a single-file skill. It treats the output contract as always-loaded report infrastructure, while high-risk verification rules and programmer-specific research patterns are loaded conditionally.
This is more scalable than pushing every rule into SKILL.md for three reasons:
- low-frequency but high-importance rules can expand only when needed,
- different research tasks do not need the same detail every time,
- verification protocol, output contract, and technical research patterns are naturally maintainable as separate assets.
This is a classic production-grade skill pattern: the base workflow and output contract stay resident, while heavy detail is loaded on demand.
4.11 The Workflow Is Bound to Subcommands vs. Pure Natural-Language Advice¶
deep-research is not only a document of recommendations. It is explicitly tied to subcommands such as retrieve, fetch-content, search-codebase, validate, and report.
That design matters because it turns research method into an executable process:
- retrieval and extraction can be reproduced,
- validation can be run independently,
- report generation has explicit inputs and outputs,
- future regression testing and automation become much easier.
That is one of the clearest ways it goes beyond an ordinary prompt: it defines not only how to think, but how to execute.
5. Problems This Design Solves¶
Combining the current SKILL.md with its supporting references, the skill addresses the following engineering problems:
| Problem type | Corresponding design | Practical effect |
|---|---|---|
| Scope drift | Scope Classification + Ambiguity Resolution | Narrows the question before retrieval begins |
| Weak evidence strength | Evidence Requirements Gate | Different claim types get different evidence thresholds |
| Retrieval that is too shallow or too deep | Research Mode + Budget Control | Balances cost against completeness |
| Treating snippets as evidence | Content Extraction Gate | Key findings rely on real page content rather than previews |
| Hallucinations contaminating findings | Hallucination Awareness + Verification Protocol | Reduces fabricated citations, phantom facts, and confidence inflation |
| Research steps are not auditable | Execution Integrity Gate | Makes execution state, depth, and citation count explicit |
| Thin evidence is stretched into certainty | Honest Degradation | Clearly separates Full / Partial / Blocked |
| Reports are hard to reuse or compare | Output Contract | Makes research outputs structurally consistent |
| External research is disconnected from the codebase | Codebase / Hybrid Research categories + search-codebase | Makes the workflow fit real engineering decision work |
6. Key Highlights¶
6.1 It Turns Research into a Gated Execution Flow¶
Many research workflows only define "search for information." deep-research makes preconditions, evidence thresholds, extraction discipline, verification protocol, and delivery structure all explicit.
6.2 Its Anti-Hallucination Design Is Systematic¶
It does not stop at general warnings. It organizes hallucination types, verification priority, source tiers, query patterns, and insufficient-evidence handling into a coherent system.
6.3 It Is Strict About Whether Evidence Was Actually Read¶
Mandatory content extraction is one of the skill's strongest quality controls. It raises the minimum reliability of the resulting conclusions.
6.4 Its Structured Delivery Is Well Suited to Long-Term Reuse¶
In Standard and Deep modes, Research Question, Method, Executive Summary, Key Findings, Detailed Analysis, Consensus vs Debate, Source Quality Notes, Sources, and Gaps & Limitations together create a report shape that is easy to review, compare, and update; Quick mode deliberately allows a lighter-weight version for speed.
6.5 It Fits Hybrid Engineering Research Especially Well¶
Many skills are good at either web search or code search. deep-research is stronger because it covers both and places them under the same evidence and delivery framework.
6.6 The Current Version Emphasizes Execution Integrity More Than the Evaluation Snapshot¶
The existing evaluation report most strongly validates the skill's structural discipline, especially template consistency, numbered citations, and source-credibility labels. At the same time, the current SKILL.md has expanded into 8 Mandatory Gates, a 9-section Output Contract, and several supporting references. In other words, the evaluation confirms the core design direction, while the current skill extends that direction with stronger execution gates and anti-hallucination controls.
7. When to Use It — and When Not To¶
| Scenario | Suitable | Reason |
|---|---|---|
| Technology option comparison | Yes | Evidence chains, mode selection, and source-quality notes are highly valuable |
| Fact-checking or claim verification | Yes | The anti-hallucination and verification protocol fits directly |
| Trend analysis | Yes | It explicitly supports time windows and cross-period sources |
| External research informed by current codebase state | Yes | Hybrid research is one of its intended scenarios |
| Producing a reusable research report | Yes | The Output Contract is designed for this |
| Asking a quick general-knowledge question | Not always necessary | A direct answer may be lighter-weight |
| Work that depends entirely on private internal material with no external retrieval path | Limited fit | It needs additional data access mechanisms |
| Purely creative ideation | No | That is outside the skill's design goal |
8. Conclusion¶
The real strength of deep-research is that it systematizes the parts of research most likely to go wrong: define the question boundary first, specify the evidence threshold, constrain depth and budget, force reading of source content, verify high-risk claims through anti-hallucination protocols, and deliver the result in a contract that makes consensus, disagreement, source quality, and missing evidence explicit.
From a design perspective, the skill embodies a clear principle: research quality depends first on evidence discipline and delivery discipline, and only then on writing quality. That is why it is especially well suited to engineering decisions, technical comparisons, and fact-verification work.
9. Document Maintenance¶
This document should be updated when:
- the Mandatory Gates, Research Modes, budgets, Safety Rules, or Output Contract in
skills/deep-research/SKILL.mdchange, - the key protocols in
skills/deep-research/references/output-contract-template.md,hallucination-and-verification.md, orresearch-patterns.mdchange, - the subcommands, execution flow, or output fields in
skills/deep-research/scripts/deep_research.pychange, - key supporting conclusions in
evaluate/deep-research-skill-eval-report.mdchange, - the skill evolves further and the gap between the evaluation snapshot and the current implementation becomes materially larger.
Review quarterly; review immediately if the gate structure, reference set, or script shape of deep-research changes substantially.
10. Further Reading¶
skills/deep-research/SKILL.mdskills/deep-research/references/output-contract-template.mdskills/deep-research/references/hallucination-and-verification.mdskills/deep-research/references/research-patterns.mdskills/deep-research/scripts/deep_research.pyevaluate/deep-research-skill-eval-report.md