local-transcript Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-14 Target: local-transcript

local-transcript is a local media transcription skill for turning audio and video files into cleaned final transcripts in txt, pdf, or docx form. Its three main strengths are: a practical end-to-end pipeline that combines fast local ASR with LLM proofreading instead of stopping at raw recognition output; Chinese-focused cleanup such as Simplified Chinese conversion, punctuation normalization, paragraphing, and proper-noun consistency; and real deliverables suitable for reuse, rather than intermediate transcripts that still require heavy manual editing.

1. Evaluation Summary¶

This evaluation reviews the local-transcript skill across two dimensions: real task performance and token cost efficiency. We used one real 30-minute Chinese political commentary video as the test input, ran both with-skill (full pipeline) and without-skill (bare ASR) configurations in real execution, and scored them automatically with 17 programmatic assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	17/17 (100%)	1/17 (5.9%)	+94.1 percentage points
End-to-end transcription time	452.3s	679.6s	-33% (227s faster)
ASR stage time	116.4s (mlx GPU)	~670s (CPU)	5.8x faster
LLM proofreading time	330.7s (4 chunks)	—	Skill-only
Output language	Simplified Chinese ✅	Traditional Chinese ❌	Auto-converted by skill
Paragraphing	Natural paragraphs (36 lines) ✅	Sentence-by-sentence output (917 lines) ❌	Skill-only
Chinese punctuation	Full-width punctuation ✅	Half-width / mixed ❌	Skill-only
Typos corrected	All corrected, or no corresponding error was produced	0/16	Skill-only
Proper noun consistency	✅ (37 occurrences of “哈萨尼”, 0 variants)	❌	Skill-only
Skill token overhead (SKILL.md)	~2,135 tokens	0	—
Token cost per 1% pass-rate improvement	~23 tokens (SKILL.md only)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	Video	Duration	What it tests	Assertions
Eval 1: zh-video-full-pipeline	《欧洲你个垃圾》：美国为什么必须拒绝一个"垂死大陆"的失败理念？美国保守派如何看待美欧大分裂	~30 min	Chinese political commentary with many foreign proper nouns	17

2.2 Execution Method¶

With-skill: ran uv run local_transcript.py <video> --format txt --force-transcribe. The full pipeline includes mlx-whisper (large-v3-turbo) ASR + Qwen2.5-7B-Instruct-4bit LLM proofreading (4 chunks, ~2500 characters per chunk) + a deterministic replacement table + proper noun normalization.
Without-skill: wrote a standalone Python script to simulate typical Agent behavior without the skill: extract audio with ffmpeg + transcribe with faster-whisper (small model, CPU), then output raw text with no post-processing.
Both tests used the same video file.
Assertions were scored automatically with programmatic checks (string matching + boolean conditions), not manual review.

2.3 Assertion Design (17 total)¶

Category	Count	Coverage
Basic quality	1	Non-empty output with >5000 characters
Format rules	3	Simplified Chinese, paragraphing, full-width punctuation
Homophone corrections	8	搭便车, 痛定思痛, 噤若寒蝉, 配给制, 惨案, 肥皂泡, 计入活产, 奇怪死亡
Semantic corrections	2	禁入区, 税负过重
Idiom correction	1	繁文缛节
Proper noun consistency	1	Consistent use of “哈萨尼”
Performance	1	Total runtime < 600 seconds

3. Assertion Pass Rate¶

3.1 Per-Assertion Results¶

ID	Assertion	With Skill	Without Skill	Category
A01	Output file exists and is non-empty (>5000 chars)	✅	✅	basic
A02	Output is Simplified Chinese (not Traditional)	✅	❌	formatting
A03	Text is paragraphized (blank-line separated)	✅	❌	formatting
A04	Chinese punctuation is normalized (full-width comma)	✅	❌	formatting
A05	“搭便车” is correct (not “大便车”)	✅	❌	homophone
A06	“痛定思痛” is correct (not “通定思通”)	✅	❌	homophone
A07	“噤若寒蝉” is correct (not “静若寒蝉”)	✅	❌	homophone
A08	“配给制” is correct (not “配剂制”)	✅	❌	homophone
A09	“禁入区” is correct (not “进入区”)	✅	❌	semantic
A10	“税负过重” is correct (not “说服过重”)	✅*	❌	semantic
A11	“惨案” is correct (not “灿案”)	✅	❌	homophone
A12	“繁文缛节” is correct (not “繁荣入节”)	✅*	❌	idiom
A13	“肥皂泡” is correct (not “肥皂炮”)	✅	❌	homophone
A14	“计入活产” is correct (not “寄入活产”)	✅	❌	homophone
A15	“奇怪死亡” is correct (not “奇外死亡”)	✅	❌	homophone
A16	“哈萨尼” is used consistently throughout	✅	❌	proper-noun
A17	Total transcription time < 600 seconds	✅	❌	performance
	Total	17/17 (100%)	1/17 (5.9%)

* A10 and A12: the corresponding incorrect forms did not appear in this particular ASR run. ASR is non-deterministic, so the output passed because the error forms were absent.

3.2 Breakdown of the 16 Without-Skill Failures¶

Failure type	Count	Notes
Traditional Chinese not converted to Simplified	1	`faster-whisper` small tends to output Traditional Chinese by default
No paragraphing	1	Raw ASR output is 917 short sentence lines
Messy punctuation	1	Half-width commas and periods are mixed in
ASR homophone errors not corrected	8	All homophone mistakes remain untouched
ASR semantic errors not corrected	2	禁入区→进入区, 说服过重→税负过重
Idiom error not corrected	1	繁荣入节 (Traditional: 繁榮入節) → 繁文缛节
Proper noun inconsistency	1	The same name appears in multiple variants
Timeout	1	`faster-whisper` on CPU took 679.6s > 600s

3.3 Trend Analysis¶

Passed by both: A01 (basic output quality). If ASR runs at all, it can usually produce >5000 characters.

Skill-only differences (16 assertions): Only the with-skill run passed A02-A17. These improvements span four dimensions: format, accuracy, proper noun consistency, and performance. This shows that the skill is not a single-point optimization, but a systematic quality upgrade. For A10 and A12, the corresponding error forms were not produced in this run, so the output still counts as passing.

4. Dimension-by-Dimension Comparison¶

4.1 ASR Engine Choice (Speed + Quality)¶

Metric	With Skill (mlx-whisper)	Without Skill (faster-whisper)
Model	large-v3-turbo (fp16)	small (int8)
Hardware	Apple Silicon GPU/ANE	CPU multi-threading
ASR time	116.4s	~670s
Relative speed	—	5.8x slower
Output language	Simplified Chinese	Traditional Chinese
Output character count	10,111	10,214

Analysis: The with-skill path uses a larger model (large-v3-turbo vs small), yet is still 5.8x faster because it runs on GPU. This is not a trade-off. On Apple Silicon, GPU acceleration delivers both better speed and better quality at the same time. In the without-skill setup, the Agent would first need to discover mlx-whisper and know how to configure it correctly, which is already a non-trivial engineering task.

4.2 LLM Proofreading (Core Source of Accuracy Gains)¶

Runtime data for the with-skill LLM proofreading pipeline:

Metric	Value
LLM model	Qwen2.5-7B-Instruct-4bit (mlx-lm, local GPU)
Total proofreading time	330.7s
Number of chunks	4 (~2500 characters/chunk)
Verified chunks	4/4 (100%)
Context strategy	Trailing source context only (no serial cross-chunk dependency)
Assertions directly improved	At least 10 (A05-A09, A11-A15, A16)

Key finding: LLM proofreading accounts for 73% of total runtime (330.7 / 452.3s), so it is the main performance bottleneck. But it also directly drives 10+ passing assertions. Without the LLM layer, even a better ASR model would still not automatically fix these homophone and semantic errors.

4.3 Deterministic Replacement Table + Proper Noun Normalization (Zero-Cost Correction Layer)¶

Metric	Value
Built-in replacement entries	17
External replacement file	`zh_replacements.json`, customizable via `--replacements-file`
Token cost	~0 (embedded in script + JSON file, not loaded into context)
Runtime cost	<1ms
Proper noun normalization	“哈萨尼” appears 37 times; 2 variants (哈萨迪×1, 哈塔尼×1) are automatically normalized
Direct contribution	A05, A07, A08, A09, A11, A13, A14, A15, A16

The replacement table and LLM proofreading complement each other. The replacement table handles systematic high-frequency Whisper errors at essentially zero cost, while the LLM handles context-dependent semantic and proper-noun corrections that rules alone cannot reliably solve. Proper noun normalization serves as a final safety net after the LLM pass to ensure document-wide consistency.

4.4 Output Format and Post-Processing¶

Feature	With Skill	Without Skill
Traditional → Simplified conversion	✅ OpenCC t2s	❌ Raw Traditional Chinese output
Paragraphing	✅ 36 natural paragraphs	❌ 917 short lines
Chinese punctuation normalization	✅ Full-width commas / periods	❌ Mixed half-width punctuation
Multi-format output	✅ txt / pdf / docx	❌ Raw text only
Three-layer cache	✅ audio / raw / clean	❌ No cache

5. Token Cost Efficiency Analysis¶

5.1 Skill Size¶

local-transcript is a SKILL.md + script style skill. The script is the main execution engine, but does not consume context during normal use.

File	Lines	Bytes	Estimated Tokens
SKILL.md	175	8,553	~2,135
scripts/local_transcript.py	~1,120	~42,000	~10,200 (not loaded into context)
scripts/zh_replacements.json	~25	~800	~200 (not loaded into context)
Description (always in context)	—	—	~120

5.2 Typical Load Scenarios¶

Scenario	What gets loaded	Token cost
Typical use	SKILL.md → execute script	~2,135
Debugging / modifying script	SKILL.md + local_transcript.py	~12,335
Description-trigger only	frontmatter only	~120

Key point: The script runs directly via uv run --script, so it does not need to be loaded into the LLM context. In normal use, only the ~2,135 tokens from SKILL.md are consumed. This is a built-in token efficiency advantage of script-backed skills.

5.3 Quality Improvement per Token¶

Metric	Value
With-skill pass rate	100% (17/17)
Without-skill pass rate	5.9% (1/17)
Pass-rate improvement	+94.1 percentage points
Token cost per fixed assertion	~134 tokens (SKILL.md only)
Token cost per 1% pass-rate improvement	~23 tokens (SKILL.md only)

5.4 Segment-Level Efficiency Inside SKILL.md¶

Module	Estimated Tokens	Linked Assertion Delta	Efficiency
Default Behavior (ASR backend / model config)	~400	A02, A17 (Simplified Chinese + speed)	Very high — 200 tok / 2 assertions
LLM Proofreading guidance	~300	A05-A16 (12 corrections)	Very high — 25 tok / assertion
Workflow (9-step pipeline)	~300	Indirect (enforces execution order)	High
Execution examples	~350	Indirect (reduces trial-and-error)	High
Cleaning Rules (paragraphing / punctuation)	~200	A03, A04	High — 100 tok / assertion
Format Resolution Gate	~100	Indirect	Medium
Dependency Gate	~150	Indirect (fail fast)	Medium
Output Contract	~200	Indirect (auditability)	Medium

5.5 High-Leverage vs Low-Leverage Tokens¶

High-leverage (~900 tokens → directly drives all 16 assertion deltas): - Default Behavior: ASR backend choice + model configuration (~400 tok → A02, A17) - LLM Proofreading architecture (~300 tok → 12 typo / proper noun assertions) - Cleaning Rules (~200 tok → A03, A04)

Medium-leverage (~800 tokens → indirect contribution): - Workflow, Execution, Format Gate, Dependency Gate, Output Contract

Low-leverage (~435 tokens → no direct delta in this evaluation): - Multi-format output guidance, CPU fallback guidance, and related supporting material

5.6 Token Efficiency Rating¶

Rating	Conclusion
Overall ROI	Excellent — ~2,135 tokens buy +94.1% pass-rate improvement
High-leverage token share	~42% (900 / 2,135) directly drives all 16 deltas
Script efficiency	Extremely high — ~1,120 lines of Python execute at 0 context-token cost

5.7 Efficiency Compared with Other Skills¶

Metric	local-transcript	go-makefile-writer	git-commit
SKILL.md tokens	~2,135	~1,960	~1,120
Typical total loaded tokens	~2,135	~4,100-4,600	~1,120
Pass-rate improvement	+94.1%	+31.0%	+22.7%
Tokens per 1% improvement (SKILL.md)	~23 tok	~63 tok	~51 tok
Tokens per 1% improvement (full)	~23 tok	~149 tok	~51 tok

local-transcript is significantly more token-efficient than the comparison skills. The reasons are straightforward: (1) it externalizes ~1,120 lines of execution logic into a script, so SKILL.md mainly serves as an orchestration layer; (2) it bridges a real knowledge gap: the Agent would not normally know about mlx-whisper + local LLM proofreading, and that missing knowledge carries very high information density.

6. Capability Boundary vs the Base Model¶

6.1 What the Base Model Can Already Do (No Skill Increment)¶

Capability	Evidence from the without-skill run
Call `ffmpeg` to extract audio	Baseline script extracted audio successfully
Use `faster-whisper` for transcription	Baseline transcription succeeded (using the small model)
Write a text file	A01 passed in both runs

6.2 Capability Gaps Filled by the Skill¶

Gap	Evidence in this evaluation	Impact
Does not know about `mlx-whisper`	Baseline used CPU `faster-whisper`, 5.8x slower	A17 performance
Does not know to use `large-v3-turbo`	Baseline used the small model and produced Traditional Chinese	A02 language
No Traditional → Simplified conversion	Baseline output is entirely Traditional Chinese	A02
No paragraphing	Baseline output has 917 short lines	A03
No punctuation normalization	Baseline mixes half-width punctuation	A04
No LLM proofreading	All typo-like errors remain untouched in baseline output	A05-A15 (10 assertions)
No deterministic replacement table	Baseline has no automatic error-correction layer	Same as above
No proper noun normalization	The same name appears in multiple variants	A16
No cache	Baseline always reruns from scratch	Repeated-run efficiency

7. Overall Score¶

7.1 Scores by Dimension¶

Dimension	With Skill	Without Skill	Delta
ASR speed	5.0/5	1.5/5	+3.5
Transcription accuracy	4.5/5	2.0/5	+2.5
Typo correction rate	4.5/5	1.0/5	+3.5
Output format quality	5.0/5	1.0/5	+4.0
Engineering completeness (cache / multi-format support)	5.0/5	1.0/5	+4.0
Overall average	4.80/5	1.30/5	+3.50

7.2 Weighted Total Score¶

Dimension	Weight	Score	Weighted
Assertion pass rate (delta)	25%	10/10	2.50
Typo correction quality	20%	9.0/10	1.80
ASR speed (`mlx-whisper`)	15%	10/10	1.50
Output format and post-processing	15%	9.5/10	1.43
Token efficiency	15%	9.5/10	1.43
Engineering quality (cache / configurability)	10%	9.0/10	0.90
Weighted total			9.56/10

8. Improvement Suggestions¶

8.1 [P2] Add an English-Video Evaluation Scenario¶

The current evaluation only covers Chinese video. The English path (which does not use LLM proofreading or the replacement table) has not yet been validated in a real task setting.

8.2 [P3] Further LLM Speed-Up Opportunities¶

The LLM proofreading stage still takes 73% of total runtime (330.7 / 452.3s). Possible next steps:

Wait for mlx-lm to support a batch inference API, so chunk inference can run truly in parallel.
Skip LLM proofreading for non-Chinese content to save time.
Use an API backend (for example Qwen-Turbo) instead of local inference, trading latency for concurrency.

9. Evaluation Artifacts¶

Artifact	Path
Eval definition	`/tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/eval_metadata.json`
With-skill output	`/tmp/local-transcript-eval/iteration-3/with_skill/outputs/transcript.txt`
With-skill grading	`/tmp/local-transcript-eval/iteration-3/with_skill/grading.json`
Without-skill output	`/tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/outputs/transcript.txt`
Without-skill grading	`/tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/grading.json`
Without-skill timing	`/tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/timing.json`
Test video	`/Users/john/Downloads/《欧洲你个垃圾》...美欧大分裂 [dHiLbgTK_ME].mp4`
Skill path	`/Users/john/.codex/skills/local-transcript/`
Script path	`/Users/john/.codex/skills/local-transcript/scripts/local_transcript.py`

Runtime Timeline¶

Event	With Skill	Without Skill
Start	00:06:35	23:07:58
ASR complete	after 116.4s	—
LLM proofreading	4 chunks / 330.7s	—
Finish	00:14:08 (452.3s)	23:19:18 (679.6s)