local-transcript is a local media transcription skill for turning audio and video files into cleaned final transcripts in txt, pdf, or docx form. Its three main strengths are: a practical end-to-end pipeline that combines fast local ASR with LLM proofreading instead of stopping at raw recognition output; Chinese-focused cleanup such as Simplified Chinese conversion, punctuation normalization, paragraphing, and proper-noun consistency; and real deliverables suitable for reuse, rather than intermediate transcripts that still require heavy manual editing.
This evaluation reviews the local-transcript skill across two dimensions: real task performance and token cost efficiency. We used one real 30-minute Chinese political commentary video as the test input, ran both with-skill (full pipeline) and without-skill (bare ASR) configurations in real execution, and scored them automatically with 17 programmatic assertions.
Dimension
With Skill
Without Skill
Delta
Assertion pass rate
17/17 (100%)
1/17 (5.9%)
+94.1 percentage points
End-to-end transcription time
452.3s
679.6s
-33% (227s faster)
ASR stage time
116.4s (mlx GPU)
~670s (CPU)
5.8x faster
LLM proofreading time
330.7s (4 chunks)
—
Skill-only
Output language
Simplified Chinese ✅
Traditional Chinese ❌
Auto-converted by skill
Paragraphing
Natural paragraphs (36 lines) ✅
Sentence-by-sentence output (917 lines) ❌
Skill-only
Chinese punctuation
Full-width punctuation ✅
Half-width / mixed ❌
Skill-only
Typos corrected
All corrected, or no corresponding error was produced
With-skill: ran uv run local_transcript.py <video> --format txt --force-transcribe. The full pipeline includes mlx-whisper (large-v3-turbo) ASR + Qwen2.5-7B-Instruct-4bit LLM proofreading (4 chunks, ~2500 characters per chunk) + a deterministic replacement table + proper noun normalization.
Without-skill: wrote a standalone Python script to simulate typical Agent behavior without the skill: extract audio with ffmpeg + transcribe with faster-whisper (small model, CPU), then output raw text with no post-processing.
Both tests used the same video file.
Assertions were scored automatically with programmatic checks (string matching + boolean conditions), not manual review.
Chinese punctuation is normalized (full-width comma)
✅
❌
formatting
A05
“搭便车” is correct (not “大便车”)
✅
❌
homophone
A06
“痛定思痛” is correct (not “通定思通”)
✅
❌
homophone
A07
“噤若寒蝉” is correct (not “静若寒蝉”)
✅
❌
homophone
A08
“配给制” is correct (not “配剂制”)
✅
❌
homophone
A09
“禁入区” is correct (not “进入区”)
✅
❌
semantic
A10
“税负过重” is correct (not “说服过重”)
✅*
❌
semantic
A11
“惨案” is correct (not “灿案”)
✅
❌
homophone
A12
“繁文缛节” is correct (not “繁荣入节”)
✅*
❌
idiom
A13
“肥皂泡” is correct (not “肥皂炮”)
✅
❌
homophone
A14
“计入活产” is correct (not “寄入活产”)
✅
❌
homophone
A15
“奇怪死亡” is correct (not “奇外死亡”)
✅
❌
homophone
A16
“哈萨尼” is used consistently throughout
✅
❌
proper-noun
A17
Total transcription time < 600 seconds
✅
❌
performance
Total
17/17 (100%)
1/17 (5.9%)
* A10 and A12: the corresponding incorrect forms did not appear in this particular ASR run. ASR is non-deterministic, so the output passed because the error forms were absent.
Passed by both: A01 (basic output quality). If ASR runs at all, it can usually produce >5000 characters.
Skill-only differences (16 assertions): Only the with-skill run passed A02-A17. These improvements span four dimensions: format, accuracy, proper noun consistency, and performance. This shows that the skill is not a single-point optimization, but a systematic quality upgrade. For A10 and A12, the corresponding error forms were not produced in this run, so the output still counts as passing.
Analysis: The with-skill path uses a larger model (large-v3-turbo vs small), yet is still 5.8x faster because it runs on GPU. This is not a trade-off. On Apple Silicon, GPU acceleration delivers both better speed and better quality at the same time. In the without-skill setup, the Agent would first need to discover mlx-whisper and know how to configure it correctly, which is already a non-trivial engineering task.
4.2 LLM Proofreading (Core Source of Accuracy Gains)¶
Runtime data for the with-skill LLM proofreading pipeline:
Metric
Value
LLM model
Qwen2.5-7B-Instruct-4bit (mlx-lm, local GPU)
Total proofreading time
330.7s
Number of chunks
4 (~2500 characters/chunk)
Verified chunks
4/4 (100%)
Context strategy
Trailing source context only (no serial cross-chunk dependency)
Assertions directly improved
At least 10 (A05-A09, A11-A15, A16)
Key finding: LLM proofreading accounts for 73% of total runtime (330.7 / 452.3s), so it is the main performance bottleneck. But it also directly drives 10+ passing assertions. Without the LLM layer, even a better ASR model would still not automatically fix these homophone and semantic errors.
The replacement table and LLM proofreading complement each other. The replacement table handles systematic high-frequency Whisper errors at essentially zero cost, while the LLM handles context-dependent semantic and proper-noun corrections that rules alone cannot reliably solve. Proper noun normalization serves as a final safety net after the LLM pass to ensure document-wide consistency.
Key point: The script runs directly via uv run --script, so it does not need to be loaded into the LLM context. In normal use, only the ~2,135 tokens from SKILL.md are consumed. This is a built-in token efficiency advantage of script-backed skills.
Low-leverage (~435 tokens → no direct delta in this evaluation): - Multi-format output guidance, CPU fallback guidance, and related supporting material
local-transcript is significantly more token-efficient than the comparison skills. The reasons are straightforward: (1) it externalizes ~1,120 lines of execution logic into a script, so SKILL.md mainly serves as an orchestration layer; (2) it bridges a real knowledge gap: the Agent would not normally know about mlx-whisper + local LLM proofreading, and that missing knowledge carries very high information density.
8.1 [P2] Add an English-Video Evaluation Scenario¶
The current evaluation only covers Chinese video. The English path (which does not use LLM proofreading or the replacement table) has not yet been validated in a real task setting.