Local Transcript¶

Overview¶

Use this skill to turn a local media file into cleaned final transcript files in .txt, .pdf, or .docx format. Extract audio with ffmpeg, transcribe with mlx-whisper (Apple Silicon GPU) or faster-whisper (CPU fallback), then clean and proofread the transcript using a two-layer correction pipeline — deterministic replacements for known ASR bugs, then LLM-based contextual proofreading for domain-specific and semantic errors.

Workflow¶

Validate the input path.
Confirm the requested output format.
Check dependencies.
Resolve the ASR mode: fast, balanced, or accurate.
Reuse cached audio/raw transcript/clean transcript layers when available.
Extract or reuse 16 kHz mono WAV audio.
Transcribe with the selected ASR backend (language auto-detected or user-specified via --language).
Clean the transcript: simplified Chinese → deterministic replacements → LLM proofreading → post-LLM safety replacements.
Paragraphize and write the requested final file(s).

Format Resolution Gate¶

If the user explicitly requests txt, pdf, or word/docx, use that format directly.
If the user requests multiple formats, generate all requested formats from the same cleaned transcript.
If the user asks to transcribe a file but does not specify an output format, ask a short follow-up question before execution: Which output format do you want: txt, pdf, or docx?
Do not guess the output format from context alone when the user did not say.

Dependency Gate¶

Before running, verify:

ffmpeg
local Python execution for scripts/local_transcript.py
Python packages required by the script: mlx-whisper, mlx-lm, faster-whisper, opencc-python-reimplemented, reportlab, python-docx
LLM proofreading: default local backend uses mlx-lm + Qwen2.5 on Apple Silicon (no API key needed). Optional claude backend requires claude CLI with API access.
For Chinese PDF output, the environment must provide at least one supported CJK font. On macOS the script prefers built-in fonts such as STHeiti.

The script bootstraps ASR models automatically if missing (downloaded from HuggingFace Hub).

If a dependency is missing, stop and say which dependency is unavailable.

Default Behavior¶

Input: one local media file path
Default output format for direct script usage: txt
Default output file: same directory as the media file, named <stem>-transcript.<ext>
Default ASR backend: mlx (Apple Silicon GPU acceleration via mlx-whisper)
Default and recommended ASR mode: balanced
Mode presets:
fast: mlx-whisper with whisper-small, no LLM proofreading override needed
balanced: mlx-whisper with whisper-large-v3-turbo + LLM proofreading (recommended)
accurate: mlx-whisper with whisper-large-v3-turbo + higher beam size + LLM proofreading
Fallback backend: --backend faster-whisper for non-Apple-Silicon machines (CPU-only, slower)
Default cache behavior: reuse three cache layers for the same unchanged media file
extracted WAV audio
raw ASR transcript
cleaned final transcript (separate caches for LLM-proofread and non-proofread)
LLM proofreading: enabled by default for Chinese transcripts
Default backend: local — uses mlx-lm on Apple Silicon GPU. No API key, no network, no cost.
- balanced/accurate mode: Qwen2.5-7B-Instruct-4bit (higher quality)
- fast mode: Qwen2.5-3B-Instruct-4bit (faster, ~50% less proofreading time)
Alternative backend: claude — uses claude -p CLI (requires API access)
Splits text into ~1500-char chunks with 400-char context overlap from the previous chunk
Short tail chunks (<500 chars) are automatically merged into the previous chunk to avoid validation failures
Video/audio title is passed to the LLM as domain context for better proper-noun correction
Output validation: rejects LLM responses that are too short/long or contain meta-commentary
Retry: failed/invalid chunks are retried up to 2 times before falling back to the original text
Can be disabled with --no-llm-proofread or --llm-backend none
For English transcripts, LLM proofreading is off by default; enable with --llm-proofread-en for complex content with non-English proper nouns
Custom model: --llm-model <hf-repo> to use a different MLX model
Language: auto-detected from speech, or user-specified via --language zh / --language en
Three-layer Chinese correction pipeline:
Deterministic replacements: a curated table of universal Whisper ASR bugs (not video-specific). Supports extra replacements via --replacements-file <path.json>.
LLM contextual proofreading: handles domain-specific terms, proper nouns, idioms, and homophones
Post-LLM safety pass: deterministic replacements applied again to catch any regressions
Proper noun unification: automatically detects near-duplicate CJK names (e.g. 哈萨迪/哈塔尼→哈萨尼) and unifies low-frequency variants to the dominant form
Final deliverable: cleaned transcript in the user-requested format(s) only
PDF output: use a Chinese-capable font when the inferred transcript language is Chinese
PDF and DOCX output: emit transcript body only, without prepending headers

Execution¶

Run (default: mlx backend, balanced mode, LLM proofreading enabled):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4"

Request PDF output:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format pdf

Prioritize speed (smaller model, still fast on Apple Silicon):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --mode fast

Disable LLM proofreading (ASR-only output):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-backend none

Use claude CLI for proofreading (requires API access):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-backend claude

Use a different local LLM model:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-model mlx-community/Qwen2.5-14B-Instruct-4bit

Specify language explicitly (skip auto-detection):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --language zh

Enable LLM proofreading for English transcripts (off by default):

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/english-video.mp4" --llm-proofread-en

Use CPU fallback backend:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --backend faster-whisper

Force a fresh transcription and ignore cache:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format pdf --force-transcribe

Load extra replacements from a JSON file:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --replacements-file custom_fixes.json

Request multiple formats:

uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format txt --format pdf --format docx

Cleaning Rules¶

Remove timestamps if present.
Collapse caption-style short lines into natural paragraphs.
For Chinese:
Convert traditional to simplified Chinese.
Apply deterministic replacements for universal Whisper ASR bugs (curated, cross-video).
Run LLM-based contextual proofreading with video title as domain context.
Apply deterministic replacements again as a post-LLM safety net.
Normalize Chinese punctuation.
The deterministic replacement table contains only universal, cross-video Whisper errors (also available as scripts/zh_replacements.json). Video-specific corrections (proper nouns, domain terms) are handled by the LLM layer and proper noun unification pass.
Users can supply additional replacements via --replacements-file for domain-specific corrections.
Preserve English output as English.
Strip trailing ASR garbage: repetitive patterns (e.g. "www www www...") from video credits or silence are auto-removed.
Do not invent missing content.

Output Contract¶

For every run, report:

Input file
Detected or inferred transcript language
ASR backend used (mlx or faster-whisper)
ASR mode used
Model used
LLM proofreading status (enabled/disabled)
Requested output format(s)
Cache status for audio/raw transcript/clean transcript
Final output path(s)
Total processing time
Whether the transcript was cleaned successfully

If execution fails, report the exact failed step and stop.

Script¶

Use scripts/local_transcript.py for the workflow. Prefer the script over retyping the pipeline manually.