deep-research Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Evaluation target: deep-research

deep-research is a source-backed research skill for factual and analytical research tasks. It suits technical surveys, option comparison, claim verification, and cross-source synthesis, emphasizing evidence retrieval before conclusions. Its three main strengths are: built-in evidence-chain requirements and hallucination-aware validation that reduce unsupported conclusions; a stable 7-section output template suitable for reusable research reports; and numbered citations, source-credibility labels, and execution-completeness notes that make results easier to verify, review, and extend.

1. Evaluation Overview¶

This evaluation reviews the deep-research skill along two axes: actual task performance and token cost-effectiveness. Three research scenarios of increasing complexity were designed (focused technical research, multi-perspective analysis, cross-domain synthesis). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 27 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	27/27 (100%)	9/27 (33.3%)	+66.7 pp
7-section template compliance	3/3 correct	0/3	Skill-only
Numbered citation format [1]-[n]	3/3 correct	0/3	Skill-only
Source credibility labels	3/3 correct	0/3	Skill-only
Content quality (depth/breadth/data)	3/3 correct	3/3 correct	No delta
Skill Token cost	~1,350 tokens	0	—
Token cost per 1% pass-rate gain	~20 tokens	—	Best among evaluated skills

Key finding: The deep-research skill’s core value is structural discipline, not content quality. The base model already has strong research ability (breadth, depth, data citation), but lacks consistent report structure. The skill’s 7-section template + numbered citations + credibility labels fill that gap.

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	User request	Core focus	Assertions
Eval 1: Focused technical research	"Research Go generics adoption — patterns, best practices, pitfalls"	Template compliance, citation format, technical depth	10
Eval 2: Multi-perspective analysis	"Research AI code review tools — developer, team lead, security perspectives"	Multi-perspective coverage, debate identification, balance	8
Eval 3: Cross-domain synthesis	"Research OSS maintainer burnout — causes, strategies, evidence"	Evidence layering, consensus vs. debate, research gaps	9

2.2 Execution¶

With-skill runs load SKILL.md first and follow its Research Process and Output Format
Without-skill runs load no skill; reports are generated by model default behavior
All runs may use WebSearch and WebFetch for real sources
6 subagents run in parallel

2.3 Skill Characteristics¶

deep-research is a single-file skill (SKILL.md only, no references): 193 lines, 985 words, ~1,350 tokens. Core components:

Component	Lines	Est. tokens
Research Process (5 steps)	~30	~200
Output Format (7-section template)	~30	~200
Source Evaluation Criteria	~8	~60
Full example (Intermittent Fasting)	~80	~550
Other (description/frontmatter/headers)	~45	~340
Total	193	~1,350

3. Assertion Pass Rate¶

3.1 Summary¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: Go generics research	10	10/10 (100%)	3/10 (30.0%)	+70.0%
Eval 2: AI code review	8	8/8 (100%)	3/8 (37.5%)	+62.5%
Eval 3: OSS maintainer burnout	9	9/9 (100%)	3/9 (33.3%)	+66.7%
Total	27	27/27 (100%)	9/27 (33.3%)	+66.7%

3.2 Per-Item Score Details¶

Eval 1: Go Generics Research¶

#	Assertion	With Skill	Without Skill
A1	"Executive Summary" section exists	✅	✅
A2	"Key Findings" section has numbered citations [1]-[n]	✅ (6 findings)	❌
A3	"Detailed Analysis" section has subtopics	✅ (7 subtopics)	❌
A4	"Areas of Consensus" section	✅ (6 points)	❌
A5	"Areas of Debate" section	✅ (6 points)	❌
A6	"Sources" section uses numbered [1]-[n] citations	✅ (18 sources)	❌
A7	"Gaps and Further Research" section	✅ (8 gaps)	❌
A8	≥3 independent sources	✅ (18)	✅ (11)
A9	Sources include credibility labels	✅	❌
A10	Findings include concrete data points	✅	✅

Eval 2: AI Code Review Multi-Perspective Analysis¶

#	Assertion	With Skill	Without Skill
B1	All 7 template sections present	✅	❌
B2	Covers 3 perspectives (developer/manager/security)	✅	✅
B3	≥4 independent sources	✅ (19)	✅ (10)
B4	Citations use [1]-[n] format	✅	❌
B5	Sources section has credibility labels	✅	❌
B6	Areas of Debate identifies real disagreements	✅ (6 debates)	❌
B7	Balanced pros and cons	✅	✅
B8	Mentions specific tools or studies	✅	✅

Eval 3: OSS Maintainer Burnout Research¶

#	Assertion	With Skill	Without Skill
C1	All 7 template sections present	✅	❌
C2	≥4 independent sources	✅ (29)	✅ (~30)
C3	Citations use [1]-[n] and are referenced in body	✅	❌
C4	Sources include credibility assessment	✅	❌
C5	Strategies have evidence layering (strong/moderate/weak)	✅	✅
C6	Covers three themes (causes/strategies/evidence)	✅	✅
C7	Consensus vs. debate clearly distinguished	✅	❌
C8	Gaps section proposes concrete research directions	✅ (8 gaps)	❌
C9	Includes data points and study citations	✅	✅

3.3 Classification of 18 Without-Skill Failures¶

Failure type	Count	Evals	Notes
Missing specific 7-section template sections	12	1/2/3	Key Findings (3), Areas of Consensus (3), Areas of Debate (3), Gaps and Further Research (3)
Missing [1]-[n] citation format	3	1/2/3	Used inline URLs or reference tables, no unified numbering
Missing source credibility labels	3	1/2/3	Listed sources but no "peer-reviewed / authoritative / moderate credibility" labels

Note: All 18 failures are structural/format failures, not content-quality failures. Without-skill passed all content dimensions (source count, data points, perspective coverage, evidence layering).

3.4 Trend Analysis¶

Scenario complexity	With-Skill advantage	Failure type
Eval 1 (focused technical)	+70.0% (7 failures)	All structural
Eval 2 (multi-perspective)	+62.5% (5 failures)	All structural
Eval 3 (cross-domain)	+66.7% (6 failures)	All structural

The skill’s advantage is highly stable across scenarios (62.5%–70.0%), unlike other skills with complexity-dependent trends. The reason: the skill’s core value—template compliance—does not depend on scenario complexity. Regardless of topic, the 7-section template and citation format are either followed or not.

4. Dimension-by-Dimension Comparison¶

4.1 Report Structure (7-Section Template)¶

This is the skill’s unique differentiator and accounts for 12 assertion deltas.

Section	With Skill 3/3	Without Skill alternative
Executive Summary	✅ Always present	✅ Usually present (2/3 have heading)
Key Findings	✅ Concise points + citations	❌ No dedicated section; findings scattered
Detailed Analysis	✅ In-depth analysis with subheadings	⚠️ Often similar content, different naming
Areas of Consensus	✅ Dedicated section	❌ None; consensus implied in body
Areas of Debate	✅ Dedicated section	❌ None; debate scattered
Sources	✅ Numbered + credibility	⚠️ Present but varied format (tables/lists/inline)
Gaps and Further Research	✅ Forward-looking research directions	❌ No dedicated section or brief mention only

Practical value: - Areas of Consensus + Debate is the most valuable structural element—it forces researchers to separate "confirmed" from "still debated" findings and avoids readers treating preliminary findings as settled - Gaps section drives forward-looking thinking—without-skill output is a "snapshot of the moment"; with-skill adds a "future research directions" dimension - Key Findings section gives busy readers a quick overview—without-skill readers must read the full report to extract main points

4.2 Citation Format (Numbered [1]-[n])¶

Dimension	With Skill	Without Skill
Citation format	`[1]`, `[2]`, ..., `[n]` — body numbers + full citations at end	Inline URLs, tables, parenthetical citations, author-year mix
Cross-reference	Body `[1][2]` maps directly to Sources section	Manual matching across formats
Consistency	3/3 scenarios identical format	3/3 scenarios different formats

Analysis: Without-skill Eval 1 used a Markdown table for sources (URL + "Key Contribution"), Eval 2 used a numbered table, Eval 3 listed sources by category. All three differed. With-skill’s 3 scenarios used the same format: body [n], end [n] Full citation (credibility note).

4.3 Source Credibility Labels¶

Scenario	With Skill	Without Skill
Eval 1	18 sources, each labeled e.g. "(Official Go team guidance; highest credibility)"	11 sources, only "Key Contribution" column
Eval 2	19 sources, each labeled e.g. "(Pre-print; moderate credibility)"	10 sources, only "Type" column
Eval 3	29 sources, each labeled e.g. "(Peer-reviewed conference paper; high credibility)"	~30 sources by Academic/Industry, no per-source credibility

Practical value: Credibility labels help readers quickly assess evidence weight. E.g. in Eval 3, with-skill labeled "self-reported survey data, not a randomized trial, but the effect sizes are large", making Tidelift data limitations clear. Without-skill only listed source names without authority assessment.

4.4 Content Quality Comparison¶

Dimension	With Skill	Without Skill	Delta
Source count	18 / 19 / 29	11 / 10 / ~30	Comparable or with-skill slightly more
Data-point density	High	High	No significant difference
Code examples (Eval 1)	Multiple full Go code blocks	Multiple full Go code blocks	No significant difference
Performance data (Eval 1)	PlanetScale benchmark table	DeepSource citation + qualitative	With-skill slightly better
Tool comparison table (Eval 2)	5 tools × 3 dimensions	5 tools × 3 dimensions (different data)	Comparable
Evidence layering (Eval 3)	Strong/Moderate/Weak + Consensus/Debate	Strongest/Moderate/Weak/Absent	Comparable
WebSearch usage	Extensive (12+ searches/eval)	Extensive (8+ searches/eval)	Comparable
Research depth	Excellent	Excellent	No significant difference

Conclusion: The base model’s content quality is already strong. With-skill and without-skill are nearly identical on source count, data density, and analysis depth. The skill’s incremental value is entirely in structured template and citation format.

5. Token Cost-Effectiveness¶

5.1 Skill Size¶

deep-research is a very lightweight skill—single file, no references, fixed ~1,350 token cost.

File	Lines	Words	Bytes	Est. tokens
SKILL.md	193	985	6,995	~1,350
Description (always in context)	—	~40	—	~50
References	None	—	—	0
Total	193	985	6,995	~1,350

5.2 Token Cost vs. Quality Gain¶

Metric	Value
With-skill pass rate	100% (27/27)
Without-skill pass rate	33.3% (9/27)
Pass-rate gain	+66.7 pp
Token cost per assertion fixed	~75 tokens
Token cost per 1% pass-rate gain	~20 tokens

5.3 Token Segment Cost-Effectiveness¶

Module	Est. tokens	Related assertion deltas	Cost-effectiveness
Output Format template	~200	12 (7-section × 3 evals, minus Executive Summary)	Very high — 17 tok/assertion
Citation rules ([1]-[n] + credibility)	~80	6 (number format 3 + credibility 3)	Very high — 13 tok/assertion
Research Process (5 steps)	~200	Indirect (drives systematic method)	Medium — no direct assertion
Source Evaluation Criteria	~60	Indirect (drives credibility content)	Medium — indirect
Full example (Intermittent Fasting)	~550	Indirect (demonstrates template use)	Low — 41% tokens, no direct assertion
Other (frontmatter/headers)	~260	0	Low — basic framework

5.4 High-Leverage vs. Low-Leverage Instructions¶

High leverage (~280 tokens → 18 assertion deltas): - Output Format template definition (~200 tok → 12) - Citation format + credibility rules (~80 tok → 6)

Medium leverage (~260 tokens → indirect): - Research Process 5 steps (~200 tok) - Source Evaluation Criteria (~60 tok)

Low leverage (~810 tokens → 0 direct deltas): - Full example (~550 tok)—41% of total; may indirectly help template adherence - Other framework content (~260 tok)

5.5 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Excellent — ~1,350 tokens for +66.7% pass rate
High-leverage token share	~21% (280/1,350) directly contributes to 18/18 assertion deltas
Low-leverage token share	~60% (810/1,350) with no direct assertion contribution
Reference cost-effectiveness	N/A — no references
Example cost-effectiveness	Optimizable — 550 tokens (41%) for one example; room to compress

5.6 Comparison with Other Skills¶

Metric	deep-research	yt-dlp-downloader	go-makefile-writer	tdd-workflow
SKILL.md tokens	~1,350	~2,370	~1,960	~2,100
Total load tokens	~1,350	~5,100–5,730	~4,100–4,600	~3,600–4,800
Pass-rate gain	+66.7%	+55.0%	+31.0%	+46.2%
Tokens per 1% (SKILL.md)	~20 tok	~43 tok	~63 tok	~45 tok
Tokens per 1% (full)	~20 tok	~95 tok	~149 tok	~92 tok

deep-research has the best token cost-effectiveness among evaluated skills because: 1. Single file, zero references — fixed ~1,350 token cost, no conditional loading 2. Precise fit for base-model gap — the gap is structural template (easy to fill with few tokens), not domain knowledge 3. Very compact template instructions — 7-section definition in ~200 tokens drives 12 assertion deltas

6. Boundary with Base Model Capabilities¶

6.1 Capabilities Base Model Already Has (No Skill Increment)¶

Capability	Evidence
WebSearch + WebFetch information gathering	3/3 scenarios used 8–12+ searches
Multi-source synthesis	3/3 scenarios cited 10–30 sources
Concrete data-point citation	3/3 scenarios included numbers, percentages, study results
Multi-perspective coverage	Eval 2 correctly covered developer/manager/security
Evidence layering (strong/moderate/weak)	Eval 3 without-skill implemented Strongest/Moderate/Weak on its own
Code examples and benchmark data	Eval 1 without-skill included full Go code and performance tables
Balanced pros and cons	3/3 scenarios covered both sides

6.2 Base Model Gaps (Skill Fills)¶

Gap	Evidence	Risk level
No consistent report template	3/3 scenarios used different structures	Medium — hard to compare across reports
Missing Areas of Consensus/Debate	3/3 scenarios no dedicated sections	Medium — readers can’t separate confirmed vs. unsettled
Missing Key Findings quick overview	3/3 scenarios no dedicated section	Low — readers can extract themselves
Missing Gaps and Further Research	3/3 scenarios none or brief mention	Medium — no forward-looking dimension
Inconsistent citation format	3/3 scenarios different formats	Low — functionality unaffected
No source credibility labels	3/3 scenarios no per-source assessment	Medium — readers can’t quickly assess evidence weight

Core finding: The base model’s "research ability" (search, synthesis, analysis) is strong, but its "research report discipline" (structure consistency, citation norms, credibility assessment) has clear gaps. The skill fills the latter.

7. Overall Score¶

7.1 Dimension Scores¶

Dimension	With Skill	Without Skill	Delta
Report structure compliance	5.0/5	1.0/5	+4.0
Citation format and credibility	5.0/5	1.5/5	+3.5
Consensus/debate distinction	5.0/5	1.0/5	+4.0
Forward-looking (Gaps section)	5.0/5	1.5/5	+3.5
Content depth and breadth	5.0/5	4.5/5	+0.5
Source count and quality	5.0/5	4.5/5	+0.5
Mean	5.0/5	2.33/5	+2.67

7.2 Weighted Total¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	10/10	2.50
Report structure compliance	20%	10/10	2.00
Citation format and credibility	15%	10/10	1.50
Consensus/debate + forward-looking	10%	10/10	1.00
Token cost-effectiveness	15%	10/10	1.50
Content quality increment	10%	2.0/10	0.20
Source count/quality increment	5%	2.0/10	0.10
Weighted total			8.80/10

Lower scores on content quality and source increment reflect an important fact: the base model’s research ability is already strong. The skill’s value is in structured report writing, not information gathering or analysis depth. This is not a skill defect but an accurate reflection of its design.

8. Evaluation Materials¶

Material	Path
Eval 1 with-skill output	`/tmp/research-eval/eval-1/with_skill/response.md`
Eval 1 without-skill output	`/tmp/research-eval/eval-1/without_skill/response.md`
Eval 2 with-skill output	`/tmp/research-eval/eval-2/with_skill/response.md`
Eval 2 without-skill output	`/tmp/research-eval/eval-2/without_skill/response.md`
Eval 3 with-skill output	`/tmp/research-eval/eval-3/with_skill/response.md`
Eval 3 without-skill output	`/tmp/research-eval/eval-3/without_skill/response.md`