Local Transcript¶
Overview¶
Use this skill to turn a local media file into cleaned final transcript files in .txt, .pdf, or .docx format. Extract audio with ffmpeg, transcribe with mlx-whisper (Apple Silicon GPU) or faster-whisper (CPU fallback), then clean and proofread the transcript using a two-layer correction pipeline — deterministic replacements for known ASR bugs, then LLM-based contextual proofreading for domain-specific and semantic errors.
Workflow¶
- Validate the input path.
- Confirm the requested output format.
- Check dependencies.
- Resolve the ASR mode:
fast,balanced, oraccurate. - Reuse cached audio/raw transcript/clean transcript layers when available.
- Extract or reuse 16 kHz mono WAV audio.
- Transcribe with the selected ASR backend (language auto-detected or user-specified via
--language). - Clean the transcript: simplified Chinese → deterministic replacements → LLM proofreading → post-LLM safety replacements.
- Paragraphize and write the requested final file(s).
Format Resolution Gate¶
- If the user explicitly requests
txt,pdf, orword/docx, use that format directly. - If the user requests multiple formats, generate all requested formats from the same cleaned transcript.
- If the user asks to transcribe a file but does not specify an output format, ask a short follow-up question before execution:
Which output format do you want: txt, pdf, or docx? - Do not guess the output format from context alone when the user did not say.
Dependency Gate¶
Before running, verify:
ffmpeg- local Python execution for
scripts/local_transcript.py - Python packages required by the script:
mlx-whisper,mlx-lm,faster-whisper,opencc-python-reimplemented,reportlab,python-docx - LLM proofreading: default
localbackend usesmlx-lm+ Qwen2.5 on Apple Silicon (no API key needed). Optionalclaudebackend requiresclaudeCLI with API access. - For Chinese PDF output, the environment must provide at least one supported CJK font. On macOS the script prefers built-in fonts such as
STHeiti.
The script bootstraps ASR models automatically if missing (downloaded from HuggingFace Hub).
If a dependency is missing, stop and say which dependency is unavailable.
Default Behavior¶
- Input: one local media file path
- Default output format for direct script usage:
txt - Default output file: same directory as the media file, named
<stem>-transcript.<ext> - Default ASR backend:
mlx(Apple Silicon GPU acceleration via mlx-whisper) - Default and recommended ASR mode:
balanced - Mode presets:
fast: mlx-whisper withwhisper-small, no LLM proofreading override neededbalanced: mlx-whisper withwhisper-large-v3-turbo+ LLM proofreading (recommended)accurate: mlx-whisper withwhisper-large-v3-turbo+ higher beam size + LLM proofreading- Fallback backend:
--backend faster-whisperfor non-Apple-Silicon machines (CPU-only, slower) - Default cache behavior: reuse three cache layers for the same unchanged media file
- extracted WAV audio
- raw ASR transcript
- cleaned final transcript (separate caches for LLM-proofread and non-proofread)
- LLM proofreading: enabled by default for Chinese transcripts
- Default backend:
local— usesmlx-lmon Apple Silicon GPU. No API key, no network, no cost.balanced/accuratemode:Qwen2.5-7B-Instruct-4bit(higher quality)fastmode:Qwen2.5-3B-Instruct-4bit(faster, ~50% less proofreading time)
- Alternative backend:
claude— usesclaude -pCLI (requires API access) - Splits text into ~1500-char chunks with 400-char context overlap from the previous chunk
- Short tail chunks (<500 chars) are automatically merged into the previous chunk to avoid validation failures
- Video/audio title is passed to the LLM as domain context for better proper-noun correction
- Output validation: rejects LLM responses that are too short/long or contain meta-commentary
- Retry: failed/invalid chunks are retried up to 2 times before falling back to the original text
- Can be disabled with
--no-llm-proofreador--llm-backend none - For English transcripts, LLM proofreading is off by default; enable with
--llm-proofread-enfor complex content with non-English proper nouns - Custom model:
--llm-model <hf-repo>to use a different MLX model - Language: auto-detected from speech, or user-specified via
--language zh/--language en - Three-layer Chinese correction pipeline:
- Deterministic replacements: a curated table of universal Whisper ASR bugs (not video-specific). Supports extra replacements via
--replacements-file <path.json>. - LLM contextual proofreading: handles domain-specific terms, proper nouns, idioms, and homophones
- Post-LLM safety pass: deterministic replacements applied again to catch any regressions
- Proper noun unification: automatically detects near-duplicate CJK names (e.g. 哈萨迪/哈塔尼→哈萨尼) and unifies low-frequency variants to the dominant form
- Final deliverable: cleaned transcript in the user-requested format(s) only
- PDF output: use a Chinese-capable font when the inferred transcript language is Chinese
- PDF and DOCX output: emit transcript body only, without prepending headers
Execution¶
Run (default: mlx backend, balanced mode, LLM proofreading enabled):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4"
Request PDF output:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format pdf
Prioritize speed (smaller model, still fast on Apple Silicon):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --mode fast
Disable LLM proofreading (ASR-only output):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-backend none
Use claude CLI for proofreading (requires API access):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-backend claude
Use a different local LLM model:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --llm-model mlx-community/Qwen2.5-14B-Instruct-4bit
Specify language explicitly (skip auto-detection):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --language zh
Enable LLM proofreading for English transcripts (off by default):
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/english-video.mp4" --llm-proofread-en
Use CPU fallback backend:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --backend faster-whisper
Force a fresh transcription and ignore cache:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format pdf --force-transcribe
Load extra replacements from a JSON file:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --replacements-file custom_fixes.json
Request multiple formats:
uv run /absolute/path/to/skills/local-transcript/scripts/local_transcript.py "/absolute/path/to/video.mp4" --format txt --format pdf --format docx
Cleaning Rules¶
- Remove timestamps if present.
- Collapse caption-style short lines into natural paragraphs.
- For Chinese:
- Convert traditional to simplified Chinese.
- Apply deterministic replacements for universal Whisper ASR bugs (curated, cross-video).
- Run LLM-based contextual proofreading with video title as domain context.
- Apply deterministic replacements again as a post-LLM safety net.
- Normalize Chinese punctuation.
- The deterministic replacement table contains only universal, cross-video Whisper errors (also available as
scripts/zh_replacements.json). Video-specific corrections (proper nouns, domain terms) are handled by the LLM layer and proper noun unification pass. - Users can supply additional replacements via
--replacements-filefor domain-specific corrections. - Preserve English output as English.
- Strip trailing ASR garbage: repetitive patterns (e.g. "www www www...") from video credits or silence are auto-removed.
- Do not invent missing content.
Output Contract¶
For every run, report:
- Input file
- Detected or inferred transcript language
- ASR backend used (mlx or faster-whisper)
- ASR mode used
- Model used
- LLM proofreading status (enabled/disabled)
- Requested output format(s)
- Cache status for audio/raw transcript/clean transcript
- Final output path(s)
- Total processing time
- Whether the transcript was cleaned successfully
If execution fails, report the exact failed step and stop.
Script¶
Use scripts/local_transcript.py for the workflow. Prefer the script over retyping the pipeline manually.