Script to convert MP3 audio files to text transcriptions. Optimized for Apple Silicon chips (M1/M2/M3)
  • Shell 87.4%
  • Swift 12.6%
Find a file
dosch 7c3bc58ae9 Surface LANGUAGE in the "skipping menus" status line
The launcher correctly exports LANGUAGE=nl to the workflow when defaults
are used, and the workflow's `${LANGUAGE:-nl}` preserves it — but the
"Skipping interactive menus" status line only listed PREPROCESS_LEVEL
and WHISPER_MODEL, making it look like the language wasn't applied.

Now the line shows all three effective values so it's clear what the run
will use.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 15:26:20 +02:00
.claude Surface LANGUAGE in the "skipping menus" status line 2026-05-11 15:26:20 +02:00
app Surface LANGUAGE in the "skipping menus" status line 2026-05-11 15:26:20 +02:00
docs Add Apple ANE voice isolation; reorganize into app/ and docs/ 2026-05-11 15:21:41 +02:00
.gitignore Initial commit: Batch MP3 to Text Transcription Tool 2025-10-15 14:51:43 +02:00
CLAUDE.md Add Apple ANE voice isolation; reorganize into app/ and docs/ 2026-05-11 15:21:41 +02:00

categorie created modified tags
project 2026-01-06 2025-12-18

Batch MP3 to Text

Script to convert MP3 audio files to text transcriptions. Optimized for Apple Silicon chips (M1/M2/M3).

Repository Layout

All runtime files live in app/. To use this tool, copy the entire app/ folder to a working location (e.g. ~/Desktop/batch-mp3-to-text/) and run from there. The commands below assume your current directory is inside app/.

batch-mp3-to-text/
├── app/                ← runtime — copy this folder anywhere
│   ├── apple_voice_isolate(.swift)   Swift CLI for ANE voice isolation
│   ├── audio_files/                  drop inputs here
│   ├── launch_transcription.command  double-clickable launcher
│   ├── transcribe.conf               config
│   ├── transcribe_workflow.sh        main pipeline
│   └── transcriptions/               outputs land here
└── docs/               ← reference docs (this README, LICENSE, logs)

Features

  • 🚀 Batch processing - Transcribe multiple MP3 files automatically
  • 🎯 Apple Silicon optimized - Uses MLX-Whisper for 8-10x faster processing on M1/M2/M3 chips
  • 🍎 ANE-accelerated voice isolation - Apple's AUSoundIsolation AudioUnit runs on the Neural Engine (~20× realtime), default for group recordings
  • 🎙️ Demucs fallback - Classical ML vocal separation also available
  • 🎚️ Audio preprocessing - Five levels from none to full voice isolation
  • 📊 Real-time progress - Live progress bars during transcription
  • 🧠 Smart memory management - Handles large files with automatic memory cleanup
  • 🌍 Multi-language support - Supports 90+ languages including English, Dutch, French, German, Spanish, and more
  • ⚙️ Easy configuration - External config file for customization without editing scripts

Requirements

  • macOS with Apple Silicon (M1/M2/M3)
  • Python 3.9+
  • ffmpeg - For audio processing
  • mlx-whisper - ML model for transcription
  • demucs + torch + torchaudio (optional, for voice isolation)

Installation

1. Install Homebrew (if not installed)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

2. Install dependencies

# Install ffmpeg
brew install ffmpeg

# Install Python 3.11
brew install python@3.11

# Install MLX-Whisper
pip3 install mlx-whisper

# Install voice isolation (recommended for group recordings)
pip3 install demucs torch torchaudio

3. Fix Python SSL certificates (required for demucs model download)

/Applications/Python\ 3.x/Install\ Certificates.command

Replace 3.x with your Python version (e.g. 3.12).

3. Download this script

git clone https://github.com/YOUR_USERNAME/batch-mp3-to-text.git
cd batch-mp3-to-text
chmod +x transcribe_workflow.sh

Quick Start

Option 1: GUI Launcher (Easiest for macOS)

  1. Place your MP3 files in the audio_files/ directory
  2. Double-click launch_transcription.command
  3. The launcher will:
    • Show you which files will be transcribed
    • Ask for language (Dutch/English/config)
    • Ask for preprocessing level (default: 4 — voice isolation)
    • Ask for Whisper model (default: Large-v3-turbo)
    • Run transcription automatically
    • Your Mac will not sleep during transcription (uses caffeinate)

Option 2: Command Line

  1. Place your MP3 files in the audio_files/ directory
  2. Run the script:
./transcribe_workflow.sh
  1. Choose preprocessing level (default: 4 — voice isolation)
  2. Choose Whisper model (default: Large-v3-turbo)
  3. Find your transcripts in transcriptions/

To prevent your Mac from sleeping during long transcriptions: (this is the default setting anyway)

caffeinate -i ./transcribe_workflow.sh

Custom Input Directory

./transcribe_workflow.sh /path/to/your/audio/files

Configuration

Configuration File: transcribe.conf

The script uses an external configuration file (transcribe.conf) that allows you to customize all settings without modifying the script itself. This file is automatically loaded when the script runs.

What's in the config file:

  • 🎚️ Preprocessing levels (none/basic/enhanced)
  • 🤖 Whisper model selection (tiny/base/small/medium/large)
  • 🌍 Language settings (90+ languages supported)
  • 📏 File size limits and chunking behavior
  • ⚙️ Feature toggles (skip existing, keep originals, diarization)
  • 🎛️ Advanced audio settings (sample rate, bitrate, filters)
  • 📊 Output preferences (combined transcripts, naming, formats)

Variable precedence: Environment variables > Config file > Script defaults

This means you can override config file settings by setting environment variables when running the script.

  1. The config file is already included (transcribe.conf)

  2. Edit the config file:

    nano transcribe.conf
    # or
    open -e transcribe.conf  # Opens in TextEdit on macOS
    
  3. Customize your settings:

    # Example: Change language to English
    LANGUAGE="en"
    
    # Example: Use enhanced preprocessing for noisy audio
    PREPROCESS_LEVEL="enhanced"
    
    # Example: Use the large model for best accuracy
    WHISPER_MODEL="mlx-community/whisper-large-v3-mlx"
    
  4. Run the script normally - it will automatically load your config

The config file includes 280+ lines of detailed comments explaining each setting, all available options, and what each change means for performance and quality.

Choosing Preprocessing Level

When you run the script, you'll be asked to choose a preprocessing level:

Choose audio preprocessing level:
  1) None              - No preprocessing (fastest)
  2) Basic             - Mono + normalization + highpass filter
  3) Enhanced          - Basic + FFT noise reduction
  4) Apple isolation   - AUSoundIsolation (ANE, ~20× realtime, recommended)
  5) Demucs isolation  - ML vocal separation (slow, legacy fallback)

Enter your choice (1-5) [default: 4]:
Level Speed Best for
1 — None Fastest Already clean, close-mic mono recordings
2 — Basic Fast Solo recordings, quiet environments
3 — Enhanced Medium Consistent background hum or hiss
4 — Apple isolation Fast (~20× realtime, ANE) Default. Group conversations, meetings, noisy recordings
5 — Demucs isolation Slow (~0.51× realtime, CPU) Legacy fallback when Apple AU is unavailable

To set a persistent default, edit PREPROCESS_LEVEL in transcribe.conf (apple / voice / enhanced / basic / none).

Tuning Apple isolation (optional env vars):

APPLE_VOICE_WET_DRY=100             # -100 (dry) to +100 (full isolation), default 75
APPLE_VOICE_MODE="High Quality Voice"  # or "Voice" (faster mode)

Higher wet/dry removes more noise but sounds more processed. For pure transcription quality (where the audio is never heard, only fed to Whisper) you may prefer =100.

Model

The script prompts you to choose a Whisper model at startup:

# Model RAM Speed Notes
1 Small ~3-4GB ~10x realtime Fast, lower accuracy
2 Medium ~5-6GB ~10x realtime Good accuracy
3 Large-v3-turbo ~5-7GB ~10x realtime Default. Best accuracy/speed balance

Pressing Enter selects Large-v3-turbo — full large-v3 encoder with a pruned decoder, nearly the same memory as Medium but significantly better multilingual accuracy.

Directory Structure

After running, app/ will contain:

app/
├── apple_voice_isolate          # Swift CLI binary (ANE voice isolation)
├── apple_voice_isolate.swift    # Source — rebuild with `swiftc -O apple_voice_isolate.swift -o apple_voice_isolate`
├── transcribe_workflow.sh       # Main script
├── launch_transcription.command # GUI launcher for macOS (double-click to run)
├── transcribe.conf              # Configuration file
├── audio_files/                 # Input: MP3/WAV files
├── transcriptions/              # Output: Text transcripts
│   ├── combined_transcript_*.txt
│   └── *_transcript.txt
├── preprocessed_audio/          # Processed audio files
├── temp_chunks/                 # Temporary chunks (auto-cleaned)
└── transcription_log_*.txt      # Detailed logs

Why Audio Preprocessing Matters

The Whisper Hallucination Problem

Whisper — the underlying transcription model — is trained on clean, close-microphone mono audio. When given a group recording with background noise, reverb, and cross-talk, it produces accurate transcription for clear passages but hallucinates during low-quality passages: generating long runs of repeated words ("Ja." "Oké." "Dan moet ik even...") instead of silence or real speech. This is not a transcription error — it is real speech that Whisper simply cannot hear through the noise.

Two solutions are applied in this script:

  1. --condition-on-previous-text False passed to Whisper — prevents the model from "locking on" to a token it already generated, which is the mechanism behind repetition loops.

  2. Voice isolation preprocessing — separates speech from all non-speech before Whisper processes the audio, so Whisper only receives clean vocal signal.

Two voice isolation backends

The pipeline supports two voice isolation methods:

apple (default) — Wraps Apple's AUSoundIsolation AudioUnit (component aufx/vois/appl). Runs on the Apple Neural Engine via a small Swift CLI (apple_voice_isolate) compiled from apple_voice_isolate.swift. Roughly 20× realtime on M1, minimal RAM usage. Available on macOS with the AU installed (verify: auval -a | grep -i sound).

The Swift CLI hosts the AU in offline rendering mode with these defensive measures:

  • 500 ms warmup silence pre-roll to stabilize the ML model's internal state
  • 2048-frame render quantum (stable behavior for ML AUs)
  • Silence-detection drain (renders until output is silent for 4 consecutive buffers, with a 2 s safety cap) — necessary because AUSoundIsolation under-reports its tail time by ~4×, which would otherwise clip final consonants

voice (legacy) — Uses demucs, a neural network from Facebook Research for music source separation. The --two-stems=vocals mode splits any audio into vocals and no-vocals. Runs on CPU (or MPS on Apple Silicon, but inconsistent). Roughly 0.51× realtime, ~500 MB to 2 GB RAM. Kept for systems where the Apple AU isn't available.

Preprocessing Levels

Level Tool What it does Use when
none Original file as-is Already clean mono
basic ffmpeg Stereo→mono, highpass 200Hz, loudnorm Solo/quiet recordings
enhanced ffmpeg Basic + FFT noise reduction Consistent background hum
apple (default) AUSoundIsolation + ffmpeg ANE voice isolation, then mono + loudnorm Group conversations, meetings, field recordings on macOS
voice demucs + ffmpeg ML vocal separation, then mono + loudnorm Fallback when Apple AU unavailable

Apple Voice Isolation Setup

Default preprocessing path. Requires the AUSoundIsolation AudioUnit (ships with modern macOS) and a compiled apple_voice_isolate binary in the same folder as transcribe_workflow.sh.

Verify the AU is installed

auval -a | grep -i sound
# Should show: aufx vois appl - Apple: AUSoundIsolation

Rebuild the binary

The repo ships with a pre-built app/apple_voice_isolate. If you ever need to rebuild (different Mac, source changes, etc.):

cd app
swiftc -O apple_voice_isolate.swift -o apple_voice_isolate

Requires Xcode Command Line Tools (xcode-select --install).

Demucs Voice Isolation Setup (Legacy)

Required only if you want to use voice (preprocessing level 5) as a fallback.

Install

pip3 install demucs diffq torch torchaudio

diffq is required by the mdx_extra_q model used by this script.

Fix SSL certificate error (required for model download)

/Applications/Python\ 3.12/Install\ Certificates.command

Fix torch/torchaudio version mismatch

If you see Symbol not found or OSError when running voice isolation:

pip3 install --upgrade torch torchaudio

First run

Demucs downloads the htdemucs model (~80MB) on first use to ~/.cache/torch/hub/checkpoints/. After that, no internet is needed.


Memory Management Tips (8GB Systems)

If you're running on an 8GB M1/M2 Mac, follow these tips for optimal performance:

Before Starting Transcription

1. Free up RAM by closing memory-heavy apps:

# Close browsers (biggest memory hogs)
killall "Google Chrome" Safari Firefox

# Close development tools
killall "Visual Studio Code" Xcode

# Close communication apps
killall Slack Discord Teams

2. Clear memory cache:

sudo purge  # Requires password, but frees up inactive memory

3. Check available memory:

vm_stat | grep "Pages free" | awk '{print int($3 * 4096 / 1048576) " MB free"}'

You want at least 5GB free before starting.

Memory Requirements

The Medium model requires:

  • Minimum: 5GB free RAM
  • Recommended: 6GB+ free RAM for optimal performance

For 8GB systems: Close all unnecessary applications before running to ensure 5-6GB is available.

Overnight Processing

For batch processing multiple files overnight:

Using GUI Launcher (Easiest):

  1. Restart your Mac (clears all memory)
  2. Close all applications
  3. Double-click launch_transcription.command
  4. Keep Mac plugged in and leave it running
    • Caffeinate is already enabled (prevents sleep automatically)

Using Command Line:

# 1. Restart your Mac (clears all memory)
# 2. Close all applications
# 3. Run with caffeinate to prevent sleep:
caffeinate -i ./transcribe_workflow.sh

# Keep Mac plugged in and leave it running

Monitor Memory During Processing

Open a second terminal to watch memory usage:

watch -n 5 'vm_stat | grep "Pages free" | awk "{print int(\$3 * 4096 / 1048576) \" MB free\"}"; sysctl vm.swapusage'

Warning signs:

  • Free memory < 500MB → May slow down
  • Swap usage > 3GB → Close more applications and restart

Quick Memory Check Commands

# See memory hogs
ps aux | sort -nrk 4 | head -10

# Free up memory
sudo purge

# Prevent sleep during processing
caffeinate -i ./transcribe_workflow.sh

Troubleshooting

Progress bar not showing

The progress bar displays in real-time during transcription. If you don't see it, the transcription is still running - check the log file for details.

"Fetching 4 files" message

This is normal - it's loading the AI model components (only takes a second).

Out of memory errors

  • Close other applications (browsers, IDEs, etc.)
  • Run sudo purge to free up inactive memory
  • Restart your Mac for a fresh start
  • Process files one at a time instead of in batch

Script hangs

  • Check transcription_log_*.txt for errors
  • Verify MLX-Whisper is installed: mlx_whisper --version
  • Ensure Python 3.9+: python3 --version

Permission denied

Make the script executable:

chmod +x transcribe_workflow.sh

Performance

On Apple Silicon M1/M2/M3:

  • Processing speed: ~10x realtime (60 minutes of audio = ~6 minutes processing)
  • Example: A 1-hour recording transcribes in approximately 6 minutes
  • MLX-Whisper is 8-10x faster than whisper.cpp on Apple Silicon

Language Support

Default: Dutch (nl) - This script is optimized for Dutch recordings

Why Language Setting Matters for Quality

⚠️ Setting the correct language significantly improves transcription accuracy!

Benefits of specifying the language:

  • Better Accuracy - Uses language-specific vocabulary and grammar models
  • Faster Processing - Skips auto-detection, saving time
  • Correct Spelling - Dutch words like "bijvoorbeeld", "misschien", "waarschijnlijk" transcribed properly
  • Context Understanding - Recognizes language-specific patterns and expressions
  • Proper Nouns - Dutch names and places recognized correctly

Example:

  • Without language setting: "I can for build" (misheard English)
  • With Dutch setting: "ik kan bijvoorbeeld" (correct Dutch)

Changing the Language

Option 1: Interactive (GUI Launcher) When using launch_transcription.command, you'll be prompted to select:

  • Dutch (Nederlands) - Default
  • English
  • Other (use config file setting)

Option 2: Configuration File Edit transcribe.conf:

LANGUAGE="nl"    # Dutch (DEFAULT - optimized for this script)
# LANGUAGE="en"  # English
# LANGUAGE="fr"  # French
# LANGUAGE="de"  # German
# LANGUAGE="es"  # Spanish
# ... and 85+ more languages!

Supported Languages

90+ languages supported including:

  • Western European: English (en), Dutch (nl), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt)
  • Eastern European: Polish (pl), Russian (ru), Ukrainian (uk), Czech (cs), Romanian (ro)
  • Asian: Japanese (ja), Chinese (zh), Korean (ko), Hindi (hi), Vietnamese (vi), Thai (th)
  • Middle Eastern: Arabic (ar), Turkish (tr), Hebrew (he), Persian (fa)
  • Nordic: Swedish (sv), Norwegian (no), Danish (da), Finnish (fi)
  • And 60+ more languages!

See Whisper documentation for the complete list.

Logs

Detailed logs are saved to transcription_log_TIMESTAMP.txt including:

  • System information
  • Memory usage
  • Processing speeds
  • Errors and warnings
  • File locations

Security

  • No data is sent to external servers
  • All processing happens locally on your Mac
  • Models are downloaded once and cached in ~/.cache/huggingface/

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

GNU General Public License v3.0 - see LICENSE file for details.

Author

Douwe

Acknowledgments

Version History

  • 1.3.0 - Voice isolation and hallucination prevention

    • 🎙️ Voice isolation: New preprocessing level 4 using demucs (Facebook Research) — ML-based vocal separation before transcription. Eliminates Whisper hallucinations in group recordings caused by background noise and cross-talk.
    • 🔇 Hallucination fix: Added --condition-on-previous-text False to Whisper — prevents repetition loops ("Ja." "Oké." etc.) during quiet passages.
    • 🔧 Fixed lowpass filter: Removed aggressive 3000Hz lowpass that was cutting off consonants. Speech frequencies extend to ~8000Hz.
    • 🤖 Simplified model menu: Three choices only — Small / Medium / Large-v3-turbo (default).
    • Faster startup: Removed "Press ENTER to continue" prompt in GUI launcher.
    • 🛡️ Better error reporting: Demucs errors (missing dependencies, torch/torchaudio version mismatch) are now diagnosed and reported with specific fix commands.
  • 1.2.0 - Model selection and accuracy improvements

    • 🤖 Interactive model selection: Choose Whisper model at startup
    • 🎯 New default model: Large-v3-turbo — better Dutch accuracy at similar memory cost to Medium
    • 🔁 Fewer repetitions: Turbo's pruned decoder reduces repeated phrases in transcriptions
  • 1.1.0 - Code quality improvements

    • Security: Fixed eval vulnerability in log_command function
    • ⚙️ Configuration: Added external transcribe.conf file for easy customization
    • 🌍 Internationalization: Standardized all UI text to English International
    • 🇳🇱 Language Optimization: Defaults to Dutch with interactive language selection in GUI launcher
  • 1.0.0 - Initial release

    • Batch MP3 transcription
    • Apple Silicon optimization
    • Audio preprocessing
    • Real-time progress display