Skip to content

yt-dlp-downloader Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Evaluation subject: yt-dlp-downloader


yt-dlp-downloader is a skill for generating and running yt-dlp download commands. It is suited for single videos, playlists, audio extraction, subtitle downloads, SponsorBlock, resolution limits, and authenticated download scenarios. Its three standout strengths are: probe-first, using format lists and subtitle info to decide command combinations instead of guessing parameters; safe defaults including --no-playlist, retries, output naming, and archive to reduce accidental full-playlist downloads, re-downloads, and runaway commands; and structured execution reports, especially useful for complex combined requests to reuse, review, and adjust.

1. Evaluation Overview

This evaluation reviews the yt-dlp-downloader skill along two axes: actual task performance and token cost-effectiveness. Three yt-dlp command-generation scenarios of increasing complexity were designed (single video download, audio extraction + subtitles, playlist + resolution + SponsorBlock + subtitles). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 40 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 40/40 (100%) 18/40 (45.0%) +55.0 pp
Output Contract structured report 3/3 correct 0/3 Skill-only
Probe decision compliance 3/3 correct 0/3 Skill-only
Safety guard (--no-playlist) 3/3 (including correct --yes-playlist in playlist scenario) 0/2 (missing in single-video scenarios) Largest safety delta
Safe defaults (archive/retries/truncation) 3/3 correct 0/3 Skill-only
Skill Token cost (SKILL.md only) ~2,370 tokens 0
Skill Token cost (with references) ~5,100–6,260 tokens 0
Token cost per 1% pass-rate gain ~43 tokens (SKILL.md only) / ~103 tokens (full)

2. Test Methodology

2.1 Scenario Design

Scenario User request Core focus Assertions
Eval 1: Single video download "Help me download this YouTube video to ~/Downloads/videos, MP4 format best quality" Basic command structure, safe defaults, Output Contract 12
Eval 2: Audio extraction + subtitles "Extract audio as MP3, save English subtitles as SRT" Dual-scenario combination, subtitle probing, ffmpeg dependency 13
Eval 3: Playlist + 720p + SponsorBlock + subtitles "Download entire playlist at max 720p, skip sponsors, embed Chinese subtitles" Four scenarios combined, format selection, complex command combination 15

2.2 Execution

  • With-skill runs first read SKILL.md and its referenced materials
  • Without-skill runs read no skill, using model default yt-dlp knowledge
  • All runs in Degraded mode (no yt-dlp installed); evaluates command recommendation quality, not actual execution
  • 6 subagents run in parallel

3. Assertion Pass Rate

3.1 Overview

Scenario Assertions With Skill Without Skill Delta
Eval 1: Single video download 12 12/12 (100%) 5/12 (41.7%) +58.3%
Eval 2: Audio extraction + subtitles 13 13/13 (100%) 7/13 (53.8%) +46.2%
Eval 3: Playlist + 720p + SponsorBlock + subtitles 15 15/15 (100%) 6/15 (40.0%) +60.0%
Total 40 40/40 (100%) 18/40 (45.0%) +55.0%

3.2 Per-Assertion Details

Eval 1: Single Video Download

# Assertion With Skill Without Skill
A1 --no-playlist flag present
A2 Best quality format selector (bv*+ba/b or equivalent)
A3 --merge-output-format mp4
A4 --download-archive flag present
A5 --retries and --fragment-retries
A6 Title truncation %(title).200s
A7 7-field Output Contract complete
A8 Probe decision correct (skip + reason)
A9 Output path includes ~/Downloads/videos
A10 No hardcoded format ID
A11 Mentions ffmpeg dependency
A12 Explicit Degraded mode declaration

Eval 2: Audio Extraction + Subtitles

# Assertion With Skill Without Skill
B1 -x flag present
B2 --audio-format mp3
B3 --audio-quality 0 (best VBR quality)
B4 Subtitle probe --list-subs recommended
B5 --write-subs (standalone file, not embed)
B6 --sub-lang en or equivalent
B7 --convert-subs srt (ensure SRT output)
B8 Mentions ffmpeg dependency
B9 7-field Output Contract complete
B10 --no-playlist present
B11 Output directory ~/Music/podcast/
B12 --download-archive present
B13 Title truncation %(title).200s

Eval 3: Playlist + 720p + SponsorBlock + Subtitles

# Assertion With Skill Without Skill
C1 --yes-playlist explicitly declared
C2 Resolution cap [height<=720] or -S "res:720"
C3 --sponsorblock-remove with relevant categories
C4 Subtitle probe --list-subs recommended
C5 --embed-subs for embedded subtitles
C6 Chinese subtitle language code coverage
C7 Nested playlist output template with truncation + zero-padding
C8 --download-archive present
C9 7-field Output Contract complete
C10 Probe section has format/subtitle probe commands
C11 --merge-output-format mp4
C12 Output directory ~/Videos/course/
C13 Mentions ffmpeg dependency
C14 --write-subs paired with --embed-subs
C15 Title truncation %(title).200s

3.3 Classification of 22 Failed Assertions (Without-Skill)

Failure type Count Evals Notes
Missing 7-field Output Contract 3 1/2/3 No structured Scenario/Inputs/Probe/Command/Status/Location/Next report
Missing --download-archive 3 1/2/3 Re-run would re-download all content
Missing title truncation %(title).200s 3 1/2/3 Long titles may cause filesystem path overflow
Missing --no-playlist safety guard 2 1/2 Single-video URL with list param may trigger full playlist download
Missing Probe decision/subtitle probe 3 1/2/3 Assumes subtitles exist without checking; no skip rationale
Missing --retries/--fragment-retries 1 1 Unstable network may cause download failure
Missing Degraded mode declaration 1 1 Does not state command was not executed
Missing --audio-quality 0 1 2 MP3 not using best VBR quality
Missing --yes-playlist explicit declaration 1 3 Playlist URL default behavior may be unstable
Playlist template missing truncation + zero-padding 1 3 %(playlist_index)s without zero-padding leads to wrong sort order
Missing --merge-output-format mp4 1 3 Output format uncertain (may be mkv/webm)
Missing --write-subs with --embed-subs 1 3 --embed-subs requires subtitles to be downloaded first

3.4 Trend: Skill Advantage Increases with Scenario Complexity

Scenario complexity With-Skill advantage
Eval 1 (simple single video) +58.3% (7 failures)
Eval 2 (medium dual-scenario) +46.2% (6 failures)
Eval 3 (complex four-scenario overlay) +60.0% (9 failures)

Unlike the go-makefile-writer evaluation where "Skill advantage decreases with complexity", this skill is strongest in the most complex scenario. Reason: yt-dlp command combinations have many implicit rules (--write-subs with --embed-subs, playlist template zero-padding, SponsorBlock ffmpeg dependency, etc.); the base model omits more details when stacking multiple scenarios.


4. Dimension-by-Dimension Comparison

4.1 Output Contract (Structured Report)

This is a Skill-only differentiator, contributing 3 assertion deltas.

Field With Skill output Without Skill output
1. Scenario "Single video / Audio extraction + Subtitles / Composite: Playlist + Fixed Resolution + SponsorBlock + Subtitles" None
2. Inputs Structured table (URL/dir/format/subs/auth) Prose description
3. Probe Explicit decision (skipped + reason / recommended command) None
4. Final command Full copy-paste command + table of reasons per flag Command + brief param notes
5. Execution status "Not run in this environment" No explicit declaration
6. Output location Expected file path pattern Brief save location
7. Next step Ordered follow-up action list Brief hint

Practical value: Output Contract enables: - Auditable command recommendations (know why specific flags were chosen) - Transparent Probe decisions (whether probe was skipped and why) - Clear next steps for users (no guessing)

4.2 Probe Decision Framework

This is the skill’s core design advantage, contributing 3 assertion deltas.

Scenario With Skill Probe decision Without Skill
Eval 1 Skipped — public video, default best quality, no probe needed No framework
Eval 2 --list-subs recommended — subtitle availability unknown; probe before deciding --write-subs or --write-auto-subs Assumes subtitles exist
Eval 3 3 probe commands — playlist content, format availability, subtitle availability No probe

Without-skill’s key issue: assumes subtitles exist and adds --write-subs or --embed-subs; if subtitles don’t exist, silent failure. The skill’s Probe Gate forces verify-before-download.

4.3 Safety Guard Flags

Flag Purpose With Skill Without Skill
--no-playlist Prevent watch URL from accidentally triggering full playlist download Eval 1 ✅ / Eval 2 ✅ ❌ / ❌
--yes-playlist Explicitly declare playlist intent Eval 3 ✅
--download-archive Prevent re-download 3/3 ✅ 0/3 ❌
--retries/--fragment-retries Network resilience 3/3 ✅ 1/3
%(title).200s Prevent long title path overflow 3/3 ✅ 0/3 ❌

--no-playlist is the highest-risk safety gap. When a YouTube watch URL includes &list=, omitting --no-playlist downloads the entire playlist instead of one video, potentially causing tens of GB of accidental downloads. This is explicitly addressed in Skill Anti-Example #3.

4.4 Command Technical Correctness

Detail With Skill Without Skill
Format selector bv*+ba/b (includes pre-merged fallback) bestvideo+bestaudio/best (equivalent but no *)
Playlist template %(playlist_title).120s/%(playlist_index)05d %(playlist)s/%(playlist_index)s
Subtitle embed chain --write-subs --write-auto-subs + --embed-subs --embed-subs (missing --write-subs)
SponsorBlock categories sponsor,selfpromo,interaction all (may over-delete)
Audio quality --audio-quality 0 (best VBR) Not specified (default quality 5)

With-skill’s bv* selector is better than bestvideo because * includes pre-merged video streams (some sites only offer pre-merged format). Without-skill’s bestvideo does not include pre-merged streams.

4.5 Ambiguity Resolution Quality

In Eval 3, "Chinese subtitles" is ambiguous:

Dimension With Skill Without Skill
Ambiguity identification Explicitly notes "assumption: zh-Hans" and explains YouTube language tag inconsistency No ambiguity analysis
Language code coverage zh-Hans,zh-Hant,zh (three-code fallback chain) zh,zh-Hans,zh-Hant
Fallback strategy Explicitly recommends probe first; adjust if language codes differ Brief "skip if no Chinese subtitles"

Both cover three language codes, but With-skill’s ambiguity resolution is more transparent — users know why these codes were chosen and how to adjust.


5. Token Cost-Effectiveness Analysis

5.1 Skill Size

File Lines Words Bytes Est. tokens
SKILL.md 214 1,298 9,742 ~2,370
references/scenario-templates.md 168 548 5,053 ~980
references/decision-rules.md 124 646 4,515 ~870
references/safety-and-recovery.md 154 557 3,778 ~730
references/golden-examples.md 110 497 4,290 ~830
references/format-selection-guide.md 126 515 3,512 ~680
Description (always in context) ~50 ~70

5.2 Typical Load Scenarios

SKILL.md’s "Load References Selectively" section guides on-demand loading:

Scenario Files read Total tokens
Simple download (Eval 1) SKILL.md + scenario-templates + golden-examples ~4,180
Medium combination (Eval 2) SKILL.md + scenario-templates + decision-rules + golden-examples ~5,050
Complex multi-scenario (Eval 3) SKILL.md + scenario-templates + decision-rules + format-selection-guide + golden-examples ~5,730
Failure recovery SKILL.md + safety-and-recovery ~3,100
Full load All files ~6,460

5.3 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (40/40)
Without-skill pass rate 45.0% (18/40)
Pass-rate gain +55.0 pp
Token cost per assertion fixed ~108 tokens (SKILL.md only) / ~240 tokens (average full)
Token cost per 1% pass-rate gain ~43 tokens (SKILL.md only) / ~95 tokens (average full)

5.4 Token Segment Cost-Effectiveness

Module Est. tokens Related assertion delta Cost-effectiveness
Output Contract definition ~200 3 assertions (3 evals 7-field report) Very high — 67 tok/assertion
Probe Gate decision framework ~250 3 assertions (probe skip/recommend) Very high — 83 tok/assertion
--no-playlist safety rule + Anti-Example #3 ~80 2 assertions (Eval 1/2 missing guard) Very high — 40 tok/assertion
Safe defaults (archive/retries/truncation) ~150 7 assertions (3×archive + 1×retries + 3×truncation) Very high — 21 tok/assertion
--yes-playlist explicit declaration rule ~30 1 assertion Very high — 30 tok/assertion
Audio quality 0 rule ~20 1 assertion Very high — 20 tok/assertion
--write-subs + --embed-subs chain ~40 1 assertion Very high — 40 tok/assertion
Playlist template truncation + zero-padding ~30 1 assertion Very high — 30 tok/assertion
--merge-output-format mp4 rule ~20 1 assertion Very high — 20 tok/assertion
Degraded mode framework ~100 1 assertion High — 100 tok/assertion
Gate pipeline architecture (7 gates diagram) ~300 Indirect (structured thinking) Medium — no direct assertion
Anti-Examples (8) ~350 Indirect (avoid hardcoded format ID, etc.) Medium — indirect
Scope Classification table ~120 Indirect (correct scenario classification) Medium — indirect
Auth Safety Gate ~100 0 (no auth scenario in this eval) Low — not tested
Live Stream rules ~50 0 (no live stream in this eval) Low — not tested

5.5 High-Leverage vs Low-Leverage Instructions

High leverage (~820 tokens SKILL.md → 20 assertion deltas): - Safe defaults (150 tok → 7 assertions) - Probe Gate (250 tok → 3 assertions) - Output Contract (200 tok → 3 assertions) - --no-playlist rule (80 tok → 2 assertions) - Other single-rule items (140 tok → 5 assertions)

Medium leverage (~770 tokens → indirect): - Anti-Examples (350 tok) — avoid hardcoded format ID - Gate pipeline (300 tok) — drive structured thinking flow - Scope classification (120 tok) — correct multi-scenario overlay identification

Low leverage (~150 tokens → 0 deltas): - Auth Safety (100 tok) — no auth scenario in this eval - Live Stream (50 tok) — no live stream in this eval

References (~3,090–4,090 tokens → indirect): - scenario-templates.md drives command completeness and flag selection - golden-examples.md drives answer format consistency - decision-rules.md drives format selection technical correctness

5.6 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~5,000 tokens for +55% pass rate
SKILL.md ROI Outstanding — ~2,370 tokens contains all high-leverage rules
High-leverage token share ~35% (820/2,370) directly contributes to 20/22 assertion deltas
Low-leverage token share ~6% (150/2,370) no incremental contribution in this eval
Reference cost-effectiveness Good — indirectly improves command completeness and technical correctness

5.7 Comparison with Other Skills’ Cost-Effectiveness

Metric yt-dlp-downloader go-makefile-writer tdd-workflow
SKILL.md tokens ~2,370 ~1,960 ~2,100
Total load tokens ~5,100–5,730 ~4,100–4,600 ~3,600–4,800
Pass-rate gain +55.0% +31.0% +46.2%
Tokens per 1% (SKILL.md) ~43 tok ~63 tok ~45 tok
Tokens per 1% (full) ~95 tok ~149 tok ~92 tok

yt-dlp-downloader has best token cost-effectiveness among the three skills because: 1. Base model has weaker grasp of yt-dlp’s implicit rules (45% baseline vs go-makefile 69%), more room for improvement 2. Skill’s high-leverage rules are compact (safe defaults, probe gate, output contract only ~820 tokens) 3. Reference conditional loading is well designed; simple scenarios don’t load everything


6. Boundary Analysis vs Base Model Capabilities

6.1 Capabilities Base Model Already Has (No Skill Increment)

Capability Evidence
-f "bestvideo+bestaudio/best" format selection 3/3 scenarios correct
--merge-output-format mp4 2/3 scenarios correct (Eval 3 omitted)
-x --audio-format mp3 audio extraction 1/1 scenario correct
--convert-subs srt format conversion 1/1 scenario correct
[height<=720] resolution cap 1/1 scenario correct
--sponsorblock-remove basic usage 1/1 scenario correct
--embed-subs subtitle embedding 1/1 scenario correct
Chinese subtitle multi-language code coverage 1/1 scenario correct
ffmpeg dependency prompt 3/3 scenarios correct
Output path basically correct 3/3 scenarios correct

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
Missing --no-playlist safety guard 2/2 single-video scenarios missing High — may accidentally download entire playlist
Missing --download-archive 3/3 scenarios missing Medium — re-run re-downloads
Missing title truncation 3/3 scenarios use %(title)s Medium — long title path overflow
No Probe decision framework 3/3 scenarios no probe awareness Medium — assumes subtitles exist, silent failure
No structured Output Contract 3/3 scenarios no report Medium — command recommendations lack auditability
--write-subs + --embed-subs chain 1/1 scenario omitted High — subtitle embed silent failure
Playlist template zero-padding 1/1 scenario missing Low — sort order wrong but usable
--audio-quality 0 1/1 scenario missing Low — default quality slightly lower but acceptable
Degraded mode declaration 1/3 scenarios missing Low — user may think command was executed

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Command technical correctness 5.0/5 3.5/5 +1.5
Safety guards (no-playlist/archive/truncation) 5.0/5 1.5/5 +3.5
Probe decision framework 5.0/5 1.0/5 +4.0
Structured report (Output Contract) 5.0/5 1.0/5 +4.0
Multi-scenario overlay handling 5.0/5 3.0/5 +2.0
Ambiguity resolution 5.0/5 2.5/5 +2.5
Overall mean 5.0/5 2.08/5 +2.92

7.2 Weighted Total Score

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Safety guards 20% 10/10 2.00
Probe decision + ambiguity resolution 15% 10/10 1.50
Output Contract 10% 10/10 1.00
Multi-scenario overlay handling 10% 10/10 1.00
Token cost-effectiveness 15% 9.0/10 1.35
Command technical correctness increment 5% 7.0/10 0.35
Weighted total 9.70/10

Command technical correctness increment is scored lower because Without-skill’s core commands are technically sound — the base model has good grasp of basic yt-dlp usage; the skill’s core value is safety guards, Probe discipline, and structured reports.


8. Evaluation Materials

Material Path
Eval 1 with-skill output /tmp/ytdlp-eval/eval-1/with_skill/response.md
Eval 1 without-skill output /tmp/ytdlp-eval/eval-1/without_skill/response.md
Eval 2 with-skill output /tmp/ytdlp-eval/eval-2/with_skill/response.md
Eval 2 without-skill output /tmp/ytdlp-eval/eval-2/without_skill/response.md
Eval 3 with-skill output /tmp/ytdlp-eval/eval-3/with_skill/response.md
Eval 3 without-skill output /tmp/ytdlp-eval/eval-3/without_skill/response.md