Files
transcribe/README.md
2026-01-17 19:18:58 -06:00

167 lines
4.0 KiB
Markdown

# Transcribe - Audio Transcription Tool
A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.
## Features
- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
- Speaker diarization using voice embeddings (resemblyzer + clustering)
- Multiple output formats: SRT subtitles, plain text, JSON
- Batch processing of multiple audio files
- Automatic language detection
- Progress indicators with spinners
## Installation
### Prerequisites
- Go 1.20+
- Python 3.8+
- FFmpeg
### Python Dependencies
```bash
# Required for transcription
pip install openai-whisper
# Required for speaker diarization
pip install resemblyzer scikit-learn librosa
```
Note: If `resemblyzer` fails to install due to `webrtcvad`, install Python development headers first:
```bash
# Fedora/RHEL
sudo dnf install python3-devel
# Ubuntu/Debian
sudo apt install python3-dev
```
### Build from Source
```bash
go build -o transcribe
```
## Usage
Output file (`-o`) is required unless `--no-write` is specified.
### Basic Transcription
```bash
./transcribe audio.mp3 -o output.srt
```
### Choose Whisper Model
```bash
./transcribe audio.mp3 --model small -o output.srt
```
Available models: `tiny` (default), `base`, `small`, `medium`, `large`, `turbo`
### Output Formats
**SRT subtitles (default):**
```bash
./transcribe audio.mp3 -o subtitles.srt
```
**Plain text with timestamps:**
```bash
./transcribe audio.mp3 --format text -o output.txt
```
**JSON:**
```bash
./transcribe audio.mp3 --format json -o output.json
```
### Speaker Diarization
Enable automatic speaker detection:
```bash
./transcribe audio.mp3 --diarize -o output.srt
```
Specify number of speakers for better accuracy:
```bash
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
```
### Print to stdout
```bash
./transcribe audio.mp3 --no-write
```
### Full Example
Transcribe with speaker diarization:
```bash
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
```
Output:
```
1
00:00:00,000 --> 00:00:05,200
[Speaker 1] Hello, how are you?
2
00:00:05,200 --> 00:00:12,300
[Speaker 2] I'm doing well, thanks!
```
## CLI Reference
```
Usage:
transcribe <audio files...> [flags]
Flags:
--diarize Enable speaker diarization
-f, --format string Output format: srt, text, json (default "srt")
-h, --help help for transcribe
-m, --model string Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
--no-write Print output to stdout instead of file
-o, --output string Output file path (required)
-s, --speakers int Number of speakers (0 = auto-detect)
```
## Supported Audio Formats
MP3, WAV, FLAC, M4A, OGG, OPUS
## Architecture
```
transcribe/
├── cmd/
│ └── root.go # CLI commands and flags
├── internal/
│ ├── whisper/
│ │ └── client.go # Whisper Python bridge
│ └── diarization/
│ ├── client.go # Diarization Python bridge
│ └── align.go # Speaker-segment alignment
├── pkg/
│ ├── audio/
│ │ └── audio.go # Audio file validation
│ ├── output/
│ │ ├── formatter.go # Output formatter interface
│ │ ├── srt.go # SRT format
│ │ ├── text.go # Text format
│ │ └── json.go # JSON format
│ └── progress/
│ └── spinner.go # Progress spinner
└── README.md
```
## How It Works
1. **Transcription**: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
2. **Diarization** (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
3. **Alignment**: Speaker segments are mapped to transcription segments by timestamp overlap
4. **Formatting**: Results are formatted according to the selected output format (SRT by default)
## License
MIT License - see LICENSE file for details.