- Shell 87.4%
- Swift 12.6%
The launcher correctly exports LANGUAGE=nl to the workflow when defaults
are used, and the workflow's `${LANGUAGE:-nl}` preserves it — but the
"Skipping interactive menus" status line only listed PREPROCESS_LEVEL
and WHISPER_MODEL, making it look like the language wasn't applied.
Now the line shows all three effective values so it's clear what the run
will use.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|---|---|---|
| .claude | ||
| app | ||
| docs | ||
| .gitignore | ||
| CLAUDE.md | ||
| categorie | created | modified | tags |
|---|---|---|---|
| project | 2026-01-06 | 2025-12-18 |
Batch MP3 to Text
Script to convert MP3 audio files to text transcriptions. Optimized for Apple Silicon chips (M1/M2/M3).
Repository Layout
All runtime files live in app/. To use this tool, copy the entire app/ folder to a working location (e.g. ~/Desktop/batch-mp3-to-text/) and run from there. The commands below assume your current directory is inside app/.
batch-mp3-to-text/
├── app/ ← runtime — copy this folder anywhere
│ ├── apple_voice_isolate(.swift) Swift CLI for ANE voice isolation
│ ├── audio_files/ drop inputs here
│ ├── launch_transcription.command double-clickable launcher
│ ├── transcribe.conf config
│ ├── transcribe_workflow.sh main pipeline
│ └── transcriptions/ outputs land here
└── docs/ ← reference docs (this README, LICENSE, logs)
Features
- 🚀 Batch processing - Transcribe multiple MP3 files automatically
- 🎯 Apple Silicon optimized - Uses MLX-Whisper for 8-10x faster processing on M1/M2/M3 chips
- 🍎 ANE-accelerated voice isolation - Apple's
AUSoundIsolationAudioUnit runs on the Neural Engine (~20× realtime), default for group recordings - 🎙️ Demucs fallback - Classical ML vocal separation also available
- 🎚️ Audio preprocessing - Five levels from none to full voice isolation
- 📊 Real-time progress - Live progress bars during transcription
- 🧠 Smart memory management - Handles large files with automatic memory cleanup
- 🌍 Multi-language support - Supports 90+ languages including English, Dutch, French, German, Spanish, and more
- ⚙️ Easy configuration - External config file for customization without editing scripts
Requirements
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.9+
- ffmpeg - For audio processing
- mlx-whisper - ML model for transcription
- demucs + torch + torchaudio (optional, for voice isolation)
Installation
1. Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
2. Install dependencies
# Install ffmpeg
brew install ffmpeg
# Install Python 3.11
brew install python@3.11
# Install MLX-Whisper
pip3 install mlx-whisper
# Install voice isolation (recommended for group recordings)
pip3 install demucs torch torchaudio
3. Fix Python SSL certificates (required for demucs model download)
/Applications/Python\ 3.x/Install\ Certificates.command
Replace 3.x with your Python version (e.g. 3.12).
3. Download this script
git clone https://github.com/YOUR_USERNAME/batch-mp3-to-text.git
cd batch-mp3-to-text
chmod +x transcribe_workflow.sh
Quick Start
Option 1: GUI Launcher (Easiest for macOS)
- Place your MP3 files in the
audio_files/directory - Double-click
launch_transcription.command - The launcher will:
- Show you which files will be transcribed
- Ask for language (Dutch/English/config)
- Ask for preprocessing level (default: 4 — voice isolation)
- Ask for Whisper model (default: Large-v3-turbo)
- Run transcription automatically
- Your Mac will not sleep during transcription (uses caffeinate)
Option 2: Command Line
- Place your MP3 files in the
audio_files/directory - Run the script:
./transcribe_workflow.sh
- Choose preprocessing level (default: 4 — voice isolation)
- Choose Whisper model (default: Large-v3-turbo)
- Find your transcripts in
transcriptions/
To prevent your Mac from sleeping during long transcriptions: (this is the default setting anyway)
caffeinate -i ./transcribe_workflow.sh
Custom Input Directory
./transcribe_workflow.sh /path/to/your/audio/files
Configuration
Configuration File: transcribe.conf
The script uses an external configuration file (transcribe.conf) that allows you to customize all settings without modifying the script itself. This file is automatically loaded when the script runs.
What's in the config file:
- 🎚️ Preprocessing levels (none/basic/enhanced)
- 🤖 Whisper model selection (tiny/base/small/medium/large)
- 🌍 Language settings (90+ languages supported)
- 📏 File size limits and chunking behavior
- ⚙️ Feature toggles (skip existing, keep originals, diarization)
- 🎛️ Advanced audio settings (sample rate, bitrate, filters)
- 📊 Output preferences (combined transcripts, naming, formats)
Variable precedence: Environment variables > Config file > Script defaults
This means you can override config file settings by setting environment variables when running the script.
Quick Start: Using the Config File (Recommended)
-
The config file is already included (
transcribe.conf) -
Edit the config file:
nano transcribe.conf # or open -e transcribe.conf # Opens in TextEdit on macOS -
Customize your settings:
# Example: Change language to English LANGUAGE="en" # Example: Use enhanced preprocessing for noisy audio PREPROCESS_LEVEL="enhanced" # Example: Use the large model for best accuracy WHISPER_MODEL="mlx-community/whisper-large-v3-mlx" -
Run the script normally - it will automatically load your config
The config file includes 280+ lines of detailed comments explaining each setting, all available options, and what each change means for performance and quality.
Choosing Preprocessing Level
When you run the script, you'll be asked to choose a preprocessing level:
Choose audio preprocessing level:
1) None - No preprocessing (fastest)
2) Basic - Mono + normalization + highpass filter
3) Enhanced - Basic + FFT noise reduction
4) Apple isolation - AUSoundIsolation (ANE, ~20× realtime, recommended)
5) Demucs isolation - ML vocal separation (slow, legacy fallback)
Enter your choice (1-5) [default: 4]:
| Level | Speed | Best for |
|---|---|---|
| 1 — None | Fastest | Already clean, close-mic mono recordings |
| 2 — Basic | Fast | Solo recordings, quiet environments |
| 3 — Enhanced | Medium | Consistent background hum or hiss |
| 4 — Apple isolation | Fast (~20× realtime, ANE) | Default. Group conversations, meetings, noisy recordings |
| 5 — Demucs isolation | Slow (~0.5–1× realtime, CPU) | Legacy fallback when Apple AU is unavailable |
To set a persistent default, edit PREPROCESS_LEVEL in transcribe.conf (apple / voice / enhanced / basic / none).
Tuning Apple isolation (optional env vars):
APPLE_VOICE_WET_DRY=100 # -100 (dry) to +100 (full isolation), default 75
APPLE_VOICE_MODE="High Quality Voice" # or "Voice" (faster mode)
Higher wet/dry removes more noise but sounds more processed. For pure transcription quality (where the audio is never heard, only fed to Whisper) you may prefer =100.
Model
The script prompts you to choose a Whisper model at startup:
| # | Model | RAM | Speed | Notes |
|---|---|---|---|---|
| 1 | Small | ~3-4GB | ~10x realtime | Fast, lower accuracy |
| 2 | Medium | ~5-6GB | ~10x realtime | Good accuracy |
| 3 | Large-v3-turbo | ~5-7GB | ~10x realtime | Default. Best accuracy/speed balance |
Pressing Enter selects Large-v3-turbo — full large-v3 encoder with a pruned decoder, nearly the same memory as Medium but significantly better multilingual accuracy.
Directory Structure
After running, app/ will contain:
app/
├── apple_voice_isolate # Swift CLI binary (ANE voice isolation)
├── apple_voice_isolate.swift # Source — rebuild with `swiftc -O apple_voice_isolate.swift -o apple_voice_isolate`
├── transcribe_workflow.sh # Main script
├── launch_transcription.command # GUI launcher for macOS (double-click to run)
├── transcribe.conf # Configuration file
├── audio_files/ # Input: MP3/WAV files
├── transcriptions/ # Output: Text transcripts
│ ├── combined_transcript_*.txt
│ └── *_transcript.txt
├── preprocessed_audio/ # Processed audio files
├── temp_chunks/ # Temporary chunks (auto-cleaned)
└── transcription_log_*.txt # Detailed logs
Why Audio Preprocessing Matters
The Whisper Hallucination Problem
Whisper — the underlying transcription model — is trained on clean, close-microphone mono audio. When given a group recording with background noise, reverb, and cross-talk, it produces accurate transcription for clear passages but hallucinates during low-quality passages: generating long runs of repeated words ("Ja." "Oké." "Dan moet ik even...") instead of silence or real speech. This is not a transcription error — it is real speech that Whisper simply cannot hear through the noise.
Two solutions are applied in this script:
-
--condition-on-previous-text Falsepassed to Whisper — prevents the model from "locking on" to a token it already generated, which is the mechanism behind repetition loops. -
Voice isolation preprocessing — separates speech from all non-speech before Whisper processes the audio, so Whisper only receives clean vocal signal.
Two voice isolation backends
The pipeline supports two voice isolation methods:
apple (default) — Wraps Apple's AUSoundIsolation AudioUnit (component aufx/vois/appl). Runs on the Apple Neural Engine via a small Swift CLI (apple_voice_isolate) compiled from apple_voice_isolate.swift. Roughly 20× realtime on M1, minimal RAM usage. Available on macOS with the AU installed (verify: auval -a | grep -i sound).
The Swift CLI hosts the AU in offline rendering mode with these defensive measures:
- 500 ms warmup silence pre-roll to stabilize the ML model's internal state
- 2048-frame render quantum (stable behavior for ML AUs)
- Silence-detection drain (renders until output is silent for 4 consecutive buffers, with a 2 s safety cap) — necessary because
AUSoundIsolationunder-reports its tail time by ~4×, which would otherwise clip final consonants
voice (legacy) — Uses demucs, a neural network from Facebook Research for music source separation. The --two-stems=vocals mode splits any audio into vocals and no-vocals. Runs on CPU (or MPS on Apple Silicon, but inconsistent). Roughly 0.5–1× realtime, ~500 MB to 2 GB RAM. Kept for systems where the Apple AU isn't available.
Preprocessing Levels
| Level | Tool | What it does | Use when |
|---|---|---|---|
none |
— | Original file as-is | Already clean mono |
basic |
ffmpeg | Stereo→mono, highpass 200Hz, loudnorm | Solo/quiet recordings |
enhanced |
ffmpeg | Basic + FFT noise reduction | Consistent background hum |
apple (default) |
AUSoundIsolation + ffmpeg | ANE voice isolation, then mono + loudnorm | Group conversations, meetings, field recordings on macOS |
voice |
demucs + ffmpeg | ML vocal separation, then mono + loudnorm | Fallback when Apple AU unavailable |
Apple Voice Isolation Setup
Default preprocessing path. Requires the AUSoundIsolation AudioUnit (ships with modern macOS) and a compiled apple_voice_isolate binary in the same folder as transcribe_workflow.sh.
Verify the AU is installed
auval -a | grep -i sound
# Should show: aufx vois appl - Apple: AUSoundIsolation
Rebuild the binary
The repo ships with a pre-built app/apple_voice_isolate. If you ever need to rebuild (different Mac, source changes, etc.):
cd app
swiftc -O apple_voice_isolate.swift -o apple_voice_isolate
Requires Xcode Command Line Tools (xcode-select --install).
Demucs Voice Isolation Setup (Legacy)
Required only if you want to use voice (preprocessing level 5) as a fallback.
Install
pip3 install demucs diffq torch torchaudio
diffq is required by the mdx_extra_q model used by this script.
Fix SSL certificate error (required for model download)
/Applications/Python\ 3.12/Install\ Certificates.command
Fix torch/torchaudio version mismatch
If you see Symbol not found or OSError when running voice isolation:
pip3 install --upgrade torch torchaudio
First run
Demucs downloads the htdemucs model (~80MB) on first use to ~/.cache/torch/hub/checkpoints/. After that, no internet is needed.
Memory Management Tips (8GB Systems)
If you're running on an 8GB M1/M2 Mac, follow these tips for optimal performance:
Before Starting Transcription
1. Free up RAM by closing memory-heavy apps:
# Close browsers (biggest memory hogs)
killall "Google Chrome" Safari Firefox
# Close development tools
killall "Visual Studio Code" Xcode
# Close communication apps
killall Slack Discord Teams
2. Clear memory cache:
sudo purge # Requires password, but frees up inactive memory
3. Check available memory:
vm_stat | grep "Pages free" | awk '{print int($3 * 4096 / 1048576) " MB free"}'
You want at least 5GB free before starting.
Memory Requirements
The Medium model requires:
- Minimum: 5GB free RAM
- Recommended: 6GB+ free RAM for optimal performance
For 8GB systems: Close all unnecessary applications before running to ensure 5-6GB is available.
Overnight Processing
For batch processing multiple files overnight:
Using GUI Launcher (Easiest):
- Restart your Mac (clears all memory)
- Close all applications
- Double-click
launch_transcription.command - Keep Mac plugged in and leave it running
- ✅ Caffeinate is already enabled (prevents sleep automatically)
Using Command Line:
# 1. Restart your Mac (clears all memory)
# 2. Close all applications
# 3. Run with caffeinate to prevent sleep:
caffeinate -i ./transcribe_workflow.sh
# Keep Mac plugged in and leave it running
Monitor Memory During Processing
Open a second terminal to watch memory usage:
watch -n 5 'vm_stat | grep "Pages free" | awk "{print int(\$3 * 4096 / 1048576) \" MB free\"}"; sysctl vm.swapusage'
Warning signs:
- Free memory < 500MB → May slow down
- Swap usage > 3GB → Close more applications and restart
Quick Memory Check Commands
# See memory hogs
ps aux | sort -nrk 4 | head -10
# Free up memory
sudo purge
# Prevent sleep during processing
caffeinate -i ./transcribe_workflow.sh
Troubleshooting
Progress bar not showing
The progress bar displays in real-time during transcription. If you don't see it, the transcription is still running - check the log file for details.
"Fetching 4 files" message
This is normal - it's loading the AI model components (only takes a second).
Out of memory errors
- Close other applications (browsers, IDEs, etc.)
- Run
sudo purgeto free up inactive memory - Restart your Mac for a fresh start
- Process files one at a time instead of in batch
Script hangs
- Check
transcription_log_*.txtfor errors - Verify MLX-Whisper is installed:
mlx_whisper --version - Ensure Python 3.9+:
python3 --version
Permission denied
Make the script executable:
chmod +x transcribe_workflow.sh
Performance
On Apple Silicon M1/M2/M3:
- Processing speed: ~10x realtime (60 minutes of audio = ~6 minutes processing)
- Example: A 1-hour recording transcribes in approximately 6 minutes
- MLX-Whisper is 8-10x faster than whisper.cpp on Apple Silicon
Language Support
Default: Dutch (nl) - This script is optimized for Dutch recordings
Why Language Setting Matters for Quality
⚠️ Setting the correct language significantly improves transcription accuracy!
Benefits of specifying the language:
- ✅ Better Accuracy - Uses language-specific vocabulary and grammar models
- ✅ Faster Processing - Skips auto-detection, saving time
- ✅ Correct Spelling - Dutch words like "bijvoorbeeld", "misschien", "waarschijnlijk" transcribed properly
- ✅ Context Understanding - Recognizes language-specific patterns and expressions
- ✅ Proper Nouns - Dutch names and places recognized correctly
Example:
- ❌ Without language setting: "I can for build" (misheard English)
- ✅ With Dutch setting: "ik kan bijvoorbeeld" (correct Dutch)
Changing the Language
Option 1: Interactive (GUI Launcher)
When using launch_transcription.command, you'll be prompted to select:
- Dutch (Nederlands) - Default
- English
- Other (use config file setting)
Option 2: Configuration File
Edit transcribe.conf:
LANGUAGE="nl" # Dutch (DEFAULT - optimized for this script)
# LANGUAGE="en" # English
# LANGUAGE="fr" # French
# LANGUAGE="de" # German
# LANGUAGE="es" # Spanish
# ... and 85+ more languages!
Supported Languages
90+ languages supported including:
- Western European: English (en), Dutch (nl), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt)
- Eastern European: Polish (pl), Russian (ru), Ukrainian (uk), Czech (cs), Romanian (ro)
- Asian: Japanese (ja), Chinese (zh), Korean (ko), Hindi (hi), Vietnamese (vi), Thai (th)
- Middle Eastern: Arabic (ar), Turkish (tr), Hebrew (he), Persian (fa)
- Nordic: Swedish (sv), Norwegian (no), Danish (da), Finnish (fi)
- And 60+ more languages!
See Whisper documentation for the complete list.
Logs
Detailed logs are saved to transcription_log_TIMESTAMP.txt including:
- System information
- Memory usage
- Processing speeds
- Errors and warnings
- File locations
Security
- No data is sent to external servers
- All processing happens locally on your Mac
- Models are downloaded once and cached in
~/.cache/huggingface/
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
GNU General Public License v3.0 - see LICENSE file for details.
Author
Douwe
Acknowledgments
- MLX-Whisper - Fast Whisper implementation for Apple Silicon
- OpenAI Whisper - Original Whisper model
- ffmpeg - Audio processing
Version History
-
1.3.0 - Voice isolation and hallucination prevention
- 🎙️ Voice isolation: New preprocessing level 4 using demucs (Facebook Research) — ML-based vocal separation before transcription. Eliminates Whisper hallucinations in group recordings caused by background noise and cross-talk.
- 🔇 Hallucination fix: Added
--condition-on-previous-text Falseto Whisper — prevents repetition loops ("Ja." "Oké." etc.) during quiet passages. - 🔧 Fixed lowpass filter: Removed aggressive 3000Hz lowpass that was cutting off consonants. Speech frequencies extend to ~8000Hz.
- 🤖 Simplified model menu: Three choices only — Small / Medium / Large-v3-turbo (default).
- ⚡ Faster startup: Removed "Press ENTER to continue" prompt in GUI launcher.
- 🛡️ Better error reporting: Demucs errors (missing dependencies, torch/torchaudio version mismatch) are now diagnosed and reported with specific fix commands.
-
1.2.0 - Model selection and accuracy improvements
- 🤖 Interactive model selection: Choose Whisper model at startup
- 🎯 New default model: Large-v3-turbo — better Dutch accuracy at similar memory cost to Medium
- 🔁 Fewer repetitions: Turbo's pruned decoder reduces repeated phrases in transcriptions
-
1.1.0 - Code quality improvements
- ✅ Security: Fixed eval vulnerability in log_command function
- ⚙️ Configuration: Added external
transcribe.conffile for easy customization - 🌍 Internationalization: Standardized all UI text to English International
- 🇳🇱 Language Optimization: Defaults to Dutch with interactive language selection in GUI launcher
-
1.0.0 - Initial release
- Batch MP3 transcription
- Apple Silicon optimization
- Audio preprocessing
- Real-time progress display