deep-research is a source-backed research skill for factual and analytical research tasks. It suits technical surveys, option comparison, claim verification, and cross-source synthesis, emphasizing evidence retrieval before conclusions. Its three main strengths are: built-in evidence-chain requirements and hallucination-aware validation that reduce unsupported conclusions; a stable 7-section output template suitable for reusable research reports; and numbered citations, source-credibility labels, and execution-completeness notes that make results easier to verify, review, and extend.
This evaluation reviews the deep-research skill along two axes: actual task performance and token cost-effectiveness. Three research scenarios of increasing complexity were designed (focused technical research, multi-perspective analysis, cross-domain synthesis). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 27 assertions.
Dimension
With Skill
Without Skill
Delta
Assertion pass rate
27/27 (100%)
9/27 (33.3%)
+66.7 pp
7-section template compliance
3/3 correct
0/3
Skill-only
Numbered citation format [1]-[n]
3/3 correct
0/3
Skill-only
Source credibility labels
3/3 correct
0/3
Skill-only
Content quality (depth/breadth/data)
3/3 correct
3/3 correct
No delta
Skill Token cost
~1,350 tokens
0
—
Token cost per 1% pass-rate gain
~20 tokens
—
Best among evaluated skills
Key finding: The deep-research skill’s core value is structural discipline, not content quality. The base model already has strong research ability (breadth, depth, data citation), but lacks consistent report structure. The skill’s 7-section template + numbered citations + credibility labels fill that gap.
Key Findings (3), Areas of Consensus (3), Areas of Debate (3), Gaps and Further Research (3)
Missing [1]-[n] citation format
3
1/2/3
Used inline URLs or reference tables, no unified numbering
Missing source credibility labels
3
1/2/3
Listed sources but no "peer-reviewed / authoritative / moderate credibility" labels
Note: All 18 failures are structural/format failures, not content-quality failures. Without-skill passed all content dimensions (source count, data points, perspective coverage, evidence layering).
The skill’s advantage is highly stable across scenarios (62.5%–70.0%), unlike other skills with complexity-dependent trends. The reason: the skill’s core value—template compliance—does not depend on scenario complexity. Regardless of topic, the 7-section template and citation format are either followed or not.
This is the skill’s unique differentiator and accounts for 12 assertion deltas.
Section
With Skill 3/3
Without Skill alternative
Executive Summary
✅ Always present
✅ Usually present (2/3 have heading)
Key Findings
✅ Concise points + citations
❌ No dedicated section; findings scattered
Detailed Analysis
✅ In-depth analysis with subheadings
⚠️ Often similar content, different naming
Areas of Consensus
✅ Dedicated section
❌ None; consensus implied in body
Areas of Debate
✅ Dedicated section
❌ None; debate scattered
Sources
✅ Numbered + credibility
⚠️ Present but varied format (tables/lists/inline)
Gaps and Further Research
✅ Forward-looking research directions
❌ No dedicated section or brief mention only
Practical value: - Areas of Consensus + Debate is the most valuable structural element—it forces researchers to separate "confirmed" from "still debated" findings and avoids readers treating preliminary findings as settled - Gaps section drives forward-looking thinking—without-skill output is a "snapshot of the moment"; with-skill adds a "future research directions" dimension - Key Findings section gives busy readers a quick overview—without-skill readers must read the full report to extract main points
Analysis: Without-skill Eval 1 used a Markdown table for sources (URL + "Key Contribution"), Eval 2 used a numbered table, Eval 3 listed sources by category. All three differed. With-skill’s 3 scenarios used the same format: body [n], end [n] Full citation (credibility note).
18 sources, each labeled e.g. "(Official Go team guidance; highest credibility)"
11 sources, only "Key Contribution" column
Eval 2
19 sources, each labeled e.g. "(Pre-print; moderate credibility)"
10 sources, only "Type" column
Eval 3
29 sources, each labeled e.g. "(Peer-reviewed conference paper; high credibility)"
~30 sources by Academic/Industry, no per-source credibility
Practical value: Credibility labels help readers quickly assess evidence weight. E.g. in Eval 3, with-skill labeled "self-reported survey data, not a randomized trial, but the effect sizes are large", making Tidelift data limitations clear. Without-skill only listed source names without authority assessment.
Conclusion: The base model’s content quality is already strong. With-skill and without-skill are nearly identical on source count, data density, and analysis depth. The skill’s incremental value is entirely in structured template and citation format.
High leverage (~280 tokens → 18 assertion deltas): - Output Format template definition (~200 tok → 12) - Citation format + credibility rules (~80 tok → 6)
Medium leverage (~260 tokens → indirect): - Research Process 5 steps (~200 tok) - Source Evaluation Criteria (~60 tok)
Low leverage (~810 tokens → 0 direct deltas): - Full example (~550 tok)—41% of total; may indirectly help template adherence - Other framework content (~260 tok)
deep-research has the best token cost-effectiveness among evaluated skills because: 1. Single file, zero references — fixed ~1,350 token cost, no conditional loading 2. Precise fit for base-model gap — the gap is structural template (easy to fill with few tokens), not domain knowledge 3. Very compact template instructions — 7-section definition in ~200 tokens drives 12 assertion deltas
Medium — readers can’t separate confirmed vs. unsettled
Missing Key Findings quick overview
3/3 scenarios no dedicated section
Low — readers can extract themselves
Missing Gaps and Further Research
3/3 scenarios none or brief mention
Medium — no forward-looking dimension
Inconsistent citation format
3/3 scenarios different formats
Low — functionality unaffected
No source credibility labels
3/3 scenarios no per-source assessment
Medium — readers can’t quickly assess evidence weight
Core finding: The base model’s "research ability" (search, synthesis, analysis) is strong, but its "research report discipline" (structure consistency, citation norms, credibility assessment) has clear gaps. The skill fills the latter.
Lower scores on content quality and source increment reflect an important fact: the base model’s research ability is already strong. The skill’s value is in structured report writing, not information gathering or analysis depth. This is not a skill defect but an accurate reflection of its design.