b73d5b8078478b288ae656e759c63fa22eeaeaae
Transcribe - Audio Transcription Tool
A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.
Features
- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
- Speaker diarization using voice embeddings (resemblyzer + clustering)
- Multiple output formats: SRT subtitles, plain text, JSON
- Batch processing of multiple audio files
- Automatic language detection
- Progress indicators with spinners
Installation
Prerequisites
- Go 1.20+
- Python 3.8+
- FFmpeg
Python Dependencies
# Required for transcription
pip install openai-whisper
# Required for speaker diarization
pip install resemblyzer scikit-learn librosa
Note: If resemblyzer fails to install due to webrtcvad, install Python development headers first:
# Fedora/RHEL
sudo dnf install python3-devel
# Ubuntu/Debian
sudo apt install python3-dev
Build from Source
go build -o transcribe
Usage
Output file (-o) is required unless --no-write is specified.
Basic Transcription
./transcribe audio.mp3 -o output.srt
Choose Whisper Model
./transcribe audio.mp3 --model small -o output.srt
Available models: tiny (default), base, small, medium, large, turbo
Output Formats
SRT subtitles (default):
./transcribe audio.mp3 -o subtitles.srt
Plain text with timestamps:
./transcribe audio.mp3 --format text -o output.txt
JSON:
./transcribe audio.mp3 --format json -o output.json
Speaker Diarization
Enable automatic speaker detection:
./transcribe audio.mp3 --diarize -o output.srt
Specify number of speakers for better accuracy:
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
Print to stdout
./transcribe audio.mp3 --no-write
Full Example
Transcribe with speaker diarization:
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
Output:
1
00:00:00,000 --> 00:00:05,200
[Speaker 1] Hello, how are you?
2
00:00:05,200 --> 00:00:12,300
[Speaker 2] I'm doing well, thanks!
CLI Reference
Usage:
transcribe <audio files...> [flags]
Flags:
--diarize Enable speaker diarization
-f, --format string Output format: srt, text, json (default "srt")
-h, --help help for transcribe
-m, --model string Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
--no-write Print output to stdout instead of file
-o, --output string Output file path (required)
-s, --speakers int Number of speakers (0 = auto-detect)
Supported Audio Formats
MP3, WAV, FLAC, M4A, OGG, OPUS
Architecture
transcribe/
├── cmd/
│ └── root.go # CLI commands and flags
├── internal/
│ ├── whisper/
│ │ └── client.go # Whisper Python bridge
│ └── diarization/
│ ├── client.go # Diarization Python bridge
│ └── align.go # Speaker-segment alignment
├── pkg/
│ ├── audio/
│ │ └── audio.go # Audio file validation
│ ├── output/
│ │ ├── formatter.go # Output formatter interface
│ │ ├── srt.go # SRT format
│ │ ├── text.go # Text format
│ │ └── json.go # JSON format
│ └── progress/
│ └── spinner.go # Progress spinner
└── README.md
How It Works
- Transcription: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
- Diarization (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
- Alignment: Speaker segments are mapped to transcription segments by timestamp overlap
- Formatting: Results are formatted according to the selected output format (SRT by default)
License
MIT License - see LICENSE file for details.
Description
Languages
Go
97%
Shell
3%