Skip to content

local-transcript Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-14 Target: local-transcript


local-transcript is a local media transcription skill for turning audio and video files into cleaned final transcripts in txt, pdf, or docx form. Its three main strengths are: a practical end-to-end pipeline that combines fast local ASR with LLM proofreading instead of stopping at raw recognition output; Chinese-focused cleanup such as Simplified Chinese conversion, punctuation normalization, paragraphing, and proper-noun consistency; and real deliverables suitable for reuse, rather than intermediate transcripts that still require heavy manual editing.

1. Evaluation Summary

This evaluation reviews the local-transcript skill across two dimensions: real task performance and token cost efficiency. We used one real 30-minute Chinese political commentary video as the test input, ran both with-skill (full pipeline) and without-skill (bare ASR) configurations in real execution, and scored them automatically with 17 programmatic assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 17/17 (100%) 1/17 (5.9%) +94.1 percentage points
End-to-end transcription time 452.3s 679.6s -33% (227s faster)
ASR stage time 116.4s (mlx GPU) ~670s (CPU) 5.8x faster
LLM proofreading time 330.7s (4 chunks) Skill-only
Output language Simplified Chinese ✅ Traditional Chinese ❌ Auto-converted by skill
Paragraphing Natural paragraphs (36 lines) ✅ Sentence-by-sentence output (917 lines) ❌ Skill-only
Chinese punctuation Full-width punctuation ✅ Half-width / mixed ❌ Skill-only
Typos corrected All corrected, or no corresponding error was produced 0/16 Skill-only
Proper noun consistency ✅ (37 occurrences of “哈萨尼”, 0 variants) Skill-only
Skill token overhead (SKILL.md) ~2,135 tokens 0
Token cost per 1% pass-rate improvement ~23 tokens (SKILL.md only)

2. Test Methodology

2.1 Scenario Design

Scenario Video Duration What it tests Assertions
Eval 1: zh-video-full-pipeline 《欧洲你个垃圾》:美国为什么必须拒绝一个"垂死大陆"的失败理念?美国保守派如何看待美欧大分裂 ~30 min Chinese political commentary with many foreign proper nouns 17

2.2 Execution Method

  • With-skill: ran uv run local_transcript.py <video> --format txt --force-transcribe. The full pipeline includes mlx-whisper (large-v3-turbo) ASR + Qwen2.5-7B-Instruct-4bit LLM proofreading (4 chunks, ~2500 characters per chunk) + a deterministic replacement table + proper noun normalization.
  • Without-skill: wrote a standalone Python script to simulate typical Agent behavior without the skill: extract audio with ffmpeg + transcribe with faster-whisper (small model, CPU), then output raw text with no post-processing.
  • Both tests used the same video file.
  • Assertions were scored automatically with programmatic checks (string matching + boolean conditions), not manual review.

2.3 Assertion Design (17 total)

Category Count Coverage
Basic quality 1 Non-empty output with >5000 characters
Format rules 3 Simplified Chinese, paragraphing, full-width punctuation
Homophone corrections 8 搭便车, 痛定思痛, 噤若寒蝉, 配给制, 惨案, 肥皂泡, 计入活产, 奇怪死亡
Semantic corrections 2 禁入区, 税负过重
Idiom correction 1 繁文缛节
Proper noun consistency 1 Consistent use of “哈萨尼”
Performance 1 Total runtime < 600 seconds

3. Assertion Pass Rate

3.1 Per-Assertion Results

ID Assertion With Skill Without Skill Category
A01 Output file exists and is non-empty (>5000 chars) basic
A02 Output is Simplified Chinese (not Traditional) formatting
A03 Text is paragraphized (blank-line separated) formatting
A04 Chinese punctuation is normalized (full-width comma) formatting
A05 “搭便车” is correct (not “大便车”) homophone
A06 “痛定思痛” is correct (not “通定思通”) homophone
A07 “噤若寒蝉” is correct (not “静若寒蝉”) homophone
A08 “配给制” is correct (not “配剂制”) homophone
A09 “禁入区” is correct (not “进入区”) semantic
A10 “税负过重” is correct (not “说服过重”) ✅* semantic
A11 “惨案” is correct (not “灿案”) homophone
A12 “繁文缛节” is correct (not “繁荣入节”) ✅* idiom
A13 “肥皂泡” is correct (not “肥皂炮”) homophone
A14 “计入活产” is correct (not “寄入活产”) homophone
A15 “奇怪死亡” is correct (not “奇外死亡”) homophone
A16 “哈萨尼” is used consistently throughout proper-noun
A17 Total transcription time < 600 seconds performance
Total 17/17 (100%) 1/17 (5.9%)

* A10 and A12: the corresponding incorrect forms did not appear in this particular ASR run. ASR is non-deterministic, so the output passed because the error forms were absent.

3.2 Breakdown of the 16 Without-Skill Failures

Failure type Count Notes
Traditional Chinese not converted to Simplified 1 faster-whisper small tends to output Traditional Chinese by default
No paragraphing 1 Raw ASR output is 917 short sentence lines
Messy punctuation 1 Half-width commas and periods are mixed in
ASR homophone errors not corrected 8 All homophone mistakes remain untouched
ASR semantic errors not corrected 2 禁入区→进入区, 说服过重→税负过重
Idiom error not corrected 1 繁荣入节 (Traditional: 繁榮入節) → 繁文缛节
Proper noun inconsistency 1 The same name appears in multiple variants
Timeout 1 faster-whisper on CPU took 679.6s > 600s

3.3 Trend Analysis

Passed by both: A01 (basic output quality). If ASR runs at all, it can usually produce >5000 characters.

Skill-only differences (16 assertions): Only the with-skill run passed A02-A17. These improvements span four dimensions: format, accuracy, proper noun consistency, and performance. This shows that the skill is not a single-point optimization, but a systematic quality upgrade. For A10 and A12, the corresponding error forms were not produced in this run, so the output still counts as passing.


4. Dimension-by-Dimension Comparison

4.1 ASR Engine Choice (Speed + Quality)

Metric With Skill (mlx-whisper) Without Skill (faster-whisper)
Model large-v3-turbo (fp16) small (int8)
Hardware Apple Silicon GPU/ANE CPU multi-threading
ASR time 116.4s ~670s
Relative speed 5.8x slower
Output language Simplified Chinese Traditional Chinese
Output character count 10,111 10,214

Analysis: The with-skill path uses a larger model (large-v3-turbo vs small), yet is still 5.8x faster because it runs on GPU. This is not a trade-off. On Apple Silicon, GPU acceleration delivers both better speed and better quality at the same time. In the without-skill setup, the Agent would first need to discover mlx-whisper and know how to configure it correctly, which is already a non-trivial engineering task.

4.2 LLM Proofreading (Core Source of Accuracy Gains)

Runtime data for the with-skill LLM proofreading pipeline:

Metric Value
LLM model Qwen2.5-7B-Instruct-4bit (mlx-lm, local GPU)
Total proofreading time 330.7s
Number of chunks 4 (~2500 characters/chunk)
Verified chunks 4/4 (100%)
Context strategy Trailing source context only (no serial cross-chunk dependency)
Assertions directly improved At least 10 (A05-A09, A11-A15, A16)

Key finding: LLM proofreading accounts for 73% of total runtime (330.7 / 452.3s), so it is the main performance bottleneck. But it also directly drives 10+ passing assertions. Without the LLM layer, even a better ASR model would still not automatically fix these homophone and semantic errors.

4.3 Deterministic Replacement Table + Proper Noun Normalization (Zero-Cost Correction Layer)

Metric Value
Built-in replacement entries 17
External replacement file zh_replacements.json, customizable via --replacements-file
Token cost ~0 (embedded in script + JSON file, not loaded into context)
Runtime cost <1ms
Proper noun normalization “哈萨尼” appears 37 times; 2 variants (哈萨迪×1, 哈塔尼×1) are automatically normalized
Direct contribution A05, A07, A08, A09, A11, A13, A14, A15, A16

The replacement table and LLM proofreading complement each other. The replacement table handles systematic high-frequency Whisper errors at essentially zero cost, while the LLM handles context-dependent semantic and proper-noun corrections that rules alone cannot reliably solve. Proper noun normalization serves as a final safety net after the LLM pass to ensure document-wide consistency.

4.4 Output Format and Post-Processing

Feature With Skill Without Skill
Traditional → Simplified conversion ✅ OpenCC t2s ❌ Raw Traditional Chinese output
Paragraphing ✅ 36 natural paragraphs ❌ 917 short lines
Chinese punctuation normalization ✅ Full-width commas / periods ❌ Mixed half-width punctuation
Multi-format output ✅ txt / pdf / docx ❌ Raw text only
Three-layer cache ✅ audio / raw / clean ❌ No cache

5. Token Cost Efficiency Analysis

5.1 Skill Size

local-transcript is a SKILL.md + script style skill. The script is the main execution engine, but does not consume context during normal use.

File Lines Bytes Estimated Tokens
SKILL.md 175 8,553 ~2,135
scripts/local_transcript.py ~1,120 ~42,000 ~10,200 (not loaded into context)
scripts/zh_replacements.json ~25 ~800 ~200 (not loaded into context)
Description (always in context) ~120

5.2 Typical Load Scenarios

Scenario What gets loaded Token cost
Typical use SKILL.md → execute script ~2,135
Debugging / modifying script SKILL.md + local_transcript.py ~12,335
Description-trigger only frontmatter only ~120

Key point: The script runs directly via uv run --script, so it does not need to be loaded into the LLM context. In normal use, only the ~2,135 tokens from SKILL.md are consumed. This is a built-in token efficiency advantage of script-backed skills.

5.3 Quality Improvement per Token

Metric Value
With-skill pass rate 100% (17/17)
Without-skill pass rate 5.9% (1/17)
Pass-rate improvement +94.1 percentage points
Token cost per fixed assertion ~134 tokens (SKILL.md only)
Token cost per 1% pass-rate improvement ~23 tokens (SKILL.md only)

5.4 Segment-Level Efficiency Inside SKILL.md

Module Estimated Tokens Linked Assertion Delta Efficiency
Default Behavior (ASR backend / model config) ~400 A02, A17 (Simplified Chinese + speed) Very high — 200 tok / 2 assertions
LLM Proofreading guidance ~300 A05-A16 (12 corrections) Very high — 25 tok / assertion
Workflow (9-step pipeline) ~300 Indirect (enforces execution order) High
Execution examples ~350 Indirect (reduces trial-and-error) High
Cleaning Rules (paragraphing / punctuation) ~200 A03, A04 High — 100 tok / assertion
Format Resolution Gate ~100 Indirect Medium
Dependency Gate ~150 Indirect (fail fast) Medium
Output Contract ~200 Indirect (auditability) Medium

5.5 High-Leverage vs Low-Leverage Tokens

High-leverage (~900 tokens → directly drives all 16 assertion deltas): - Default Behavior: ASR backend choice + model configuration (~400 tok → A02, A17) - LLM Proofreading architecture (~300 tok → 12 typo / proper noun assertions) - Cleaning Rules (~200 tok → A03, A04)

Medium-leverage (~800 tokens → indirect contribution): - Workflow, Execution, Format Gate, Dependency Gate, Output Contract

Low-leverage (~435 tokens → no direct delta in this evaluation): - Multi-format output guidance, CPU fallback guidance, and related supporting material

5.6 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~2,135 tokens buy +94.1% pass-rate improvement
High-leverage token share ~42% (900 / 2,135) directly drives all 16 deltas
Script efficiency Extremely high — ~1,120 lines of Python execute at 0 context-token cost

5.7 Efficiency Compared with Other Skills

Metric local-transcript go-makefile-writer git-commit
SKILL.md tokens ~2,135 ~1,960 ~1,120
Typical total loaded tokens ~2,135 ~4,100-4,600 ~1,120
Pass-rate improvement +94.1% +31.0% +22.7%
Tokens per 1% improvement (SKILL.md) ~23 tok ~63 tok ~51 tok
Tokens per 1% improvement (full) ~23 tok ~149 tok ~51 tok

local-transcript is significantly more token-efficient than the comparison skills. The reasons are straightforward: (1) it externalizes ~1,120 lines of execution logic into a script, so SKILL.md mainly serves as an orchestration layer; (2) it bridges a real knowledge gap: the Agent would not normally know about mlx-whisper + local LLM proofreading, and that missing knowledge carries very high information density.


6. Capability Boundary vs the Base Model

6.1 What the Base Model Can Already Do (No Skill Increment)

Capability Evidence from the without-skill run
Call ffmpeg to extract audio Baseline script extracted audio successfully
Use faster-whisper for transcription Baseline transcription succeeded (using the small model)
Write a text file A01 passed in both runs

6.2 Capability Gaps Filled by the Skill

Gap Evidence in this evaluation Impact
Does not know about mlx-whisper Baseline used CPU faster-whisper, 5.8x slower A17 performance
Does not know to use large-v3-turbo Baseline used the small model and produced Traditional Chinese A02 language
No Traditional → Simplified conversion Baseline output is entirely Traditional Chinese A02
No paragraphing Baseline output has 917 short lines A03
No punctuation normalization Baseline mixes half-width punctuation A04
No LLM proofreading All typo-like errors remain untouched in baseline output A05-A15 (10 assertions)
No deterministic replacement table Baseline has no automatic error-correction layer Same as above
No proper noun normalization The same name appears in multiple variants A16
No cache Baseline always reruns from scratch Repeated-run efficiency

7. Overall Score

7.1 Scores by Dimension

Dimension With Skill Without Skill Delta
ASR speed 5.0/5 1.5/5 +3.5
Transcription accuracy 4.5/5 2.0/5 +2.5
Typo correction rate 4.5/5 1.0/5 +3.5
Output format quality 5.0/5 1.0/5 +4.0
Engineering completeness (cache / multi-format support) 5.0/5 1.0/5 +4.0
Overall average 4.80/5 1.30/5 +3.50

7.2 Weighted Total Score

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Typo correction quality 20% 9.0/10 1.80
ASR speed (mlx-whisper) 15% 10/10 1.50
Output format and post-processing 15% 9.5/10 1.43
Token efficiency 15% 9.5/10 1.43
Engineering quality (cache / configurability) 10% 9.0/10 0.90
Weighted total 9.56/10

8. Improvement Suggestions

8.1 [P2] Add an English-Video Evaluation Scenario

The current evaluation only covers Chinese video. The English path (which does not use LLM proofreading or the replacement table) has not yet been validated in a real task setting.

8.2 [P3] Further LLM Speed-Up Opportunities

The LLM proofreading stage still takes 73% of total runtime (330.7 / 452.3s). Possible next steps:

  • Wait for mlx-lm to support a batch inference API, so chunk inference can run truly in parallel.
  • Skip LLM proofreading for non-Chinese content to save time.
  • Use an API backend (for example Qwen-Turbo) instead of local inference, trading latency for concurrency.

9. Evaluation Artifacts

Artifact Path
Eval definition /tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/eval_metadata.json
With-skill output /tmp/local-transcript-eval/iteration-3/with_skill/outputs/transcript.txt
With-skill grading /tmp/local-transcript-eval/iteration-3/with_skill/grading.json
Without-skill output /tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/outputs/transcript.txt
Without-skill grading /tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/grading.json
Without-skill timing /tmp/local-transcript-eval/iteration-1/eval-1-zh-video-full-pipeline/without_skill/timing.json
Test video /Users/john/Downloads/《欧洲你个垃圾》...美欧大分裂 [dHiLbgTK_ME].mp4
Skill path /Users/john/.codex/skills/local-transcript/
Script path /Users/john/.codex/skills/local-transcript/scripts/local_transcript.py

Runtime Timeline

Event With Skill Without Skill
Start 00:06:35 23:07:58
ASR complete after 116.4s
LLM proofreading 4 chunks / 330.7s
Finish 00:14:08 (452.3s) 23:19:18 (679.6s)