Skip to content

deep-research Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Evaluation target: deep-research


deep-research is a source-backed research skill for factual and analytical research tasks. It suits technical surveys, option comparison, claim verification, and cross-source synthesis, emphasizing evidence retrieval before conclusions. Its three main strengths are: built-in evidence-chain requirements and hallucination-aware validation that reduce unsupported conclusions; a stable 7-section output template suitable for reusable research reports; and numbered citations, source-credibility labels, and execution-completeness notes that make results easier to verify, review, and extend.

1. Evaluation Overview

This evaluation reviews the deep-research skill along two axes: actual task performance and token cost-effectiveness. Three research scenarios of increasing complexity were designed (focused technical research, multi-perspective analysis, cross-domain synthesis). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 27 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 27/27 (100%) 9/27 (33.3%) +66.7 pp
7-section template compliance 3/3 correct 0/3 Skill-only
Numbered citation format [1]-[n] 3/3 correct 0/3 Skill-only
Source credibility labels 3/3 correct 0/3 Skill-only
Content quality (depth/breadth/data) 3/3 correct 3/3 correct No delta
Skill Token cost ~1,350 tokens 0
Token cost per 1% pass-rate gain ~20 tokens Best among evaluated skills

Key finding: The deep-research skill’s core value is structural discipline, not content quality. The base model already has strong research ability (breadth, depth, data citation), but lacks consistent report structure. The skill’s 7-section template + numbered citations + credibility labels fill that gap.


2. Test Methodology

2.1 Scenario Design

Scenario User request Core focus Assertions
Eval 1: Focused technical research "Research Go generics adoption — patterns, best practices, pitfalls" Template compliance, citation format, technical depth 10
Eval 2: Multi-perspective analysis "Research AI code review tools — developer, team lead, security perspectives" Multi-perspective coverage, debate identification, balance 8
Eval 3: Cross-domain synthesis "Research OSS maintainer burnout — causes, strategies, evidence" Evidence layering, consensus vs. debate, research gaps 9

2.2 Execution

  • With-skill runs load SKILL.md first and follow its Research Process and Output Format
  • Without-skill runs load no skill; reports are generated by model default behavior
  • All runs may use WebSearch and WebFetch for real sources
  • 6 subagents run in parallel

2.3 Skill Characteristics

deep-research is a single-file skill (SKILL.md only, no references): 193 lines, 985 words, ~1,350 tokens. Core components:

Component Lines Est. tokens
Research Process (5 steps) ~30 ~200
Output Format (7-section template) ~30 ~200
Source Evaluation Criteria ~8 ~60
Full example (Intermittent Fasting) ~80 ~550
Other (description/frontmatter/headers) ~45 ~340
Total 193 ~1,350

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: Go generics research 10 10/10 (100%) 3/10 (30.0%) +70.0%
Eval 2: AI code review 8 8/8 (100%) 3/8 (37.5%) +62.5%
Eval 3: OSS maintainer burnout 9 9/9 (100%) 3/9 (33.3%) +66.7%
Total 27 27/27 (100%) 9/27 (33.3%) +66.7%

3.2 Per-Item Score Details

Eval 1: Go Generics Research

# Assertion With Skill Without Skill
A1 "Executive Summary" section exists
A2 "Key Findings" section has numbered citations [1]-[n] ✅ (6 findings)
A3 "Detailed Analysis" section has subtopics ✅ (7 subtopics)
A4 "Areas of Consensus" section ✅ (6 points)
A5 "Areas of Debate" section ✅ (6 points)
A6 "Sources" section uses numbered [1]-[n] citations ✅ (18 sources)
A7 "Gaps and Further Research" section ✅ (8 gaps)
A8 ≥3 independent sources ✅ (18) ✅ (11)
A9 Sources include credibility labels
A10 Findings include concrete data points

Eval 2: AI Code Review Multi-Perspective Analysis

# Assertion With Skill Without Skill
B1 All 7 template sections present
B2 Covers 3 perspectives (developer/manager/security)
B3 ≥4 independent sources ✅ (19) ✅ (10)
B4 Citations use [1]-[n] format
B5 Sources section has credibility labels
B6 Areas of Debate identifies real disagreements ✅ (6 debates)
B7 Balanced pros and cons
B8 Mentions specific tools or studies

Eval 3: OSS Maintainer Burnout Research

# Assertion With Skill Without Skill
C1 All 7 template sections present
C2 ≥4 independent sources ✅ (29) ✅ (~30)
C3 Citations use [1]-[n] and are referenced in body
C4 Sources include credibility assessment
C5 Strategies have evidence layering (strong/moderate/weak)
C6 Covers three themes (causes/strategies/evidence)
C7 Consensus vs. debate clearly distinguished
C8 Gaps section proposes concrete research directions ✅ (8 gaps)
C9 Includes data points and study citations

3.3 Classification of 18 Without-Skill Failures

Failure type Count Evals Notes
Missing specific 7-section template sections 12 1/2/3 Key Findings (3), Areas of Consensus (3), Areas of Debate (3), Gaps and Further Research (3)
Missing [1]-[n] citation format 3 1/2/3 Used inline URLs or reference tables, no unified numbering
Missing source credibility labels 3 1/2/3 Listed sources but no "peer-reviewed / authoritative / moderate credibility" labels

Note: All 18 failures are structural/format failures, not content-quality failures. Without-skill passed all content dimensions (source count, data points, perspective coverage, evidence layering).

3.4 Trend Analysis

Scenario complexity With-Skill advantage Failure type
Eval 1 (focused technical) +70.0% (7 failures) All structural
Eval 2 (multi-perspective) +62.5% (5 failures) All structural
Eval 3 (cross-domain) +66.7% (6 failures) All structural

The skill’s advantage is highly stable across scenarios (62.5%–70.0%), unlike other skills with complexity-dependent trends. The reason: the skill’s core value—template compliance—does not depend on scenario complexity. Regardless of topic, the 7-section template and citation format are either followed or not.


4. Dimension-by-Dimension Comparison

4.1 Report Structure (7-Section Template)

This is the skill’s unique differentiator and accounts for 12 assertion deltas.

Section With Skill 3/3 Without Skill alternative
Executive Summary ✅ Always present ✅ Usually present (2/3 have heading)
Key Findings ✅ Concise points + citations ❌ No dedicated section; findings scattered
Detailed Analysis ✅ In-depth analysis with subheadings ⚠️ Often similar content, different naming
Areas of Consensus ✅ Dedicated section ❌ None; consensus implied in body
Areas of Debate ✅ Dedicated section ❌ None; debate scattered
Sources ✅ Numbered + credibility ⚠️ Present but varied format (tables/lists/inline)
Gaps and Further Research ✅ Forward-looking research directions ❌ No dedicated section or brief mention only

Practical value: - Areas of Consensus + Debate is the most valuable structural element—it forces researchers to separate "confirmed" from "still debated" findings and avoids readers treating preliminary findings as settled - Gaps section drives forward-looking thinking—without-skill output is a "snapshot of the moment"; with-skill adds a "future research directions" dimension - Key Findings section gives busy readers a quick overview—without-skill readers must read the full report to extract main points

4.2 Citation Format (Numbered [1]-[n])

Dimension With Skill Without Skill
Citation format [1], [2], ..., [n] — body numbers + full citations at end Inline URLs, tables, parenthetical citations, author-year mix
Cross-reference Body [1][2] maps directly to Sources section Manual matching across formats
Consistency 3/3 scenarios identical format 3/3 scenarios different formats

Analysis: Without-skill Eval 1 used a Markdown table for sources (URL + "Key Contribution"), Eval 2 used a numbered table, Eval 3 listed sources by category. All three differed. With-skill’s 3 scenarios used the same format: body [n], end [n] Full citation (credibility note).

4.3 Source Credibility Labels

Scenario With Skill Without Skill
Eval 1 18 sources, each labeled e.g. "(Official Go team guidance; highest credibility)" 11 sources, only "Key Contribution" column
Eval 2 19 sources, each labeled e.g. "(Pre-print; moderate credibility)" 10 sources, only "Type" column
Eval 3 29 sources, each labeled e.g. "(Peer-reviewed conference paper; high credibility)" ~30 sources by Academic/Industry, no per-source credibility

Practical value: Credibility labels help readers quickly assess evidence weight. E.g. in Eval 3, with-skill labeled "self-reported survey data, not a randomized trial, but the effect sizes are large", making Tidelift data limitations clear. Without-skill only listed source names without authority assessment.

4.4 Content Quality Comparison

Dimension With Skill Without Skill Delta
Source count 18 / 19 / 29 11 / 10 / ~30 Comparable or with-skill slightly more
Data-point density High High No significant difference
Code examples (Eval 1) Multiple full Go code blocks Multiple full Go code blocks No significant difference
Performance data (Eval 1) PlanetScale benchmark table DeepSource citation + qualitative With-skill slightly better
Tool comparison table (Eval 2) 5 tools × 3 dimensions 5 tools × 3 dimensions (different data) Comparable
Evidence layering (Eval 3) Strong/Moderate/Weak + Consensus/Debate Strongest/Moderate/Weak/Absent Comparable
WebSearch usage Extensive (12+ searches/eval) Extensive (8+ searches/eval) Comparable
Research depth Excellent Excellent No significant difference

Conclusion: The base model’s content quality is already strong. With-skill and without-skill are nearly identical on source count, data density, and analysis depth. The skill’s incremental value is entirely in structured template and citation format.


5. Token Cost-Effectiveness

5.1 Skill Size

deep-research is a very lightweight skill—single file, no references, fixed ~1,350 token cost.

File Lines Words Bytes Est. tokens
SKILL.md 193 985 6,995 ~1,350
Description (always in context) ~40 ~50
References None 0
Total 193 985 6,995 ~1,350

5.2 Token Cost vs. Quality Gain

Metric Value
With-skill pass rate 100% (27/27)
Without-skill pass rate 33.3% (9/27)
Pass-rate gain +66.7 pp
Token cost per assertion fixed ~75 tokens
Token cost per 1% pass-rate gain ~20 tokens

5.3 Token Segment Cost-Effectiveness

Module Est. tokens Related assertion deltas Cost-effectiveness
Output Format template ~200 12 (7-section × 3 evals, minus Executive Summary) Very high — 17 tok/assertion
Citation rules ([1]-[n] + credibility) ~80 6 (number format 3 + credibility 3) Very high — 13 tok/assertion
Research Process (5 steps) ~200 Indirect (drives systematic method) Medium — no direct assertion
Source Evaluation Criteria ~60 Indirect (drives credibility content) Medium — indirect
Full example (Intermittent Fasting) ~550 Indirect (demonstrates template use) Low — 41% tokens, no direct assertion
Other (frontmatter/headers) ~260 0 Low — basic framework

5.4 High-Leverage vs. Low-Leverage Instructions

High leverage (~280 tokens → 18 assertion deltas): - Output Format template definition (~200 tok → 12) - Citation format + credibility rules (~80 tok → 6)

Medium leverage (~260 tokens → indirect): - Research Process 5 steps (~200 tok) - Source Evaluation Criteria (~60 tok)

Low leverage (~810 tokens → 0 direct deltas): - Full example (~550 tok)—41% of total; may indirectly help template adherence - Other framework content (~260 tok)

5.5 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~1,350 tokens for +66.7% pass rate
High-leverage token share ~21% (280/1,350) directly contributes to 18/18 assertion deltas
Low-leverage token share ~60% (810/1,350) with no direct assertion contribution
Reference cost-effectiveness N/A — no references
Example cost-effectiveness Optimizable — 550 tokens (41%) for one example; room to compress

5.6 Comparison with Other Skills

Metric deep-research yt-dlp-downloader go-makefile-writer tdd-workflow
SKILL.md tokens ~1,350 ~2,370 ~1,960 ~2,100
Total load tokens ~1,350 ~5,100–5,730 ~4,100–4,600 ~3,600–4,800
Pass-rate gain +66.7% +55.0% +31.0% +46.2%
Tokens per 1% (SKILL.md) ~20 tok ~43 tok ~63 tok ~45 tok
Tokens per 1% (full) ~20 tok ~95 tok ~149 tok ~92 tok

deep-research has the best token cost-effectiveness among evaluated skills because: 1. Single file, zero references — fixed ~1,350 token cost, no conditional loading 2. Precise fit for base-model gap — the gap is structural template (easy to fill with few tokens), not domain knowledge 3. Very compact template instructions — 7-section definition in ~200 tokens drives 12 assertion deltas


6. Boundary with Base Model Capabilities

6.1 Capabilities Base Model Already Has (No Skill Increment)

Capability Evidence
WebSearch + WebFetch information gathering 3/3 scenarios used 8–12+ searches
Multi-source synthesis 3/3 scenarios cited 10–30 sources
Concrete data-point citation 3/3 scenarios included numbers, percentages, study results
Multi-perspective coverage Eval 2 correctly covered developer/manager/security
Evidence layering (strong/moderate/weak) Eval 3 without-skill implemented Strongest/Moderate/Weak on its own
Code examples and benchmark data Eval 1 without-skill included full Go code and performance tables
Balanced pros and cons 3/3 scenarios covered both sides

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
No consistent report template 3/3 scenarios used different structures Medium — hard to compare across reports
Missing Areas of Consensus/Debate 3/3 scenarios no dedicated sections Medium — readers can’t separate confirmed vs. unsettled
Missing Key Findings quick overview 3/3 scenarios no dedicated section Low — readers can extract themselves
Missing Gaps and Further Research 3/3 scenarios none or brief mention Medium — no forward-looking dimension
Inconsistent citation format 3/3 scenarios different formats Low — functionality unaffected
No source credibility labels 3/3 scenarios no per-source assessment Medium — readers can’t quickly assess evidence weight

Core finding: The base model’s "research ability" (search, synthesis, analysis) is strong, but its "research report discipline" (structure consistency, citation norms, credibility assessment) has clear gaps. The skill fills the latter.


7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Report structure compliance 5.0/5 1.0/5 +4.0
Citation format and credibility 5.0/5 1.5/5 +3.5
Consensus/debate distinction 5.0/5 1.0/5 +4.0
Forward-looking (Gaps section) 5.0/5 1.5/5 +3.5
Content depth and breadth 5.0/5 4.5/5 +0.5
Source count and quality 5.0/5 4.5/5 +0.5
Mean 5.0/5 2.33/5 +2.67

7.2 Weighted Total

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Report structure compliance 20% 10/10 2.00
Citation format and credibility 15% 10/10 1.50
Consensus/debate + forward-looking 10% 10/10 1.00
Token cost-effectiveness 15% 10/10 1.50
Content quality increment 10% 2.0/10 0.20
Source count/quality increment 5% 2.0/10 0.10
Weighted total 8.80/10

Lower scores on content quality and source increment reflect an important fact: the base model’s research ability is already strong. The skill’s value is in structured report writing, not information gathering or analysis depth. This is not a skill defect but an accurate reflection of its design.


8. Evaluation Materials

Material Path
Eval 1 with-skill output /tmp/research-eval/eval-1/with_skill/response.md
Eval 1 without-skill output /tmp/research-eval/eval-1/without_skill/response.md
Eval 2 with-skill output /tmp/research-eval/eval-2/with_skill/response.md
Eval 2 without-skill output /tmp/research-eval/eval-2/without_skill/response.md
Eval 3 with-skill output /tmp/research-eval/eval-3/with_skill/response.md
Eval 3 without-skill output /tmp/research-eval/eval-3/without_skill/response.md