167 lines
4.0 KiB
Markdown
167 lines
4.0 KiB
Markdown
# Transcribe - Audio Transcription Tool
|
|
|
|
A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.
|
|
|
|
## Features
|
|
|
|
- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
|
|
- Speaker diarization using voice embeddings (resemblyzer + clustering)
|
|
- Multiple output formats: SRT subtitles, plain text, JSON
|
|
- Batch processing of multiple audio files
|
|
- Automatic language detection
|
|
- Progress indicators with spinners
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- Go 1.20+
|
|
- Python 3.8+
|
|
- FFmpeg
|
|
|
|
### Python Dependencies
|
|
```bash
|
|
# Required for transcription
|
|
pip install openai-whisper
|
|
|
|
# Required for speaker diarization
|
|
pip install resemblyzer scikit-learn librosa
|
|
```
|
|
|
|
Note: If `resemblyzer` fails to install due to `webrtcvad`, install Python development headers first:
|
|
```bash
|
|
# Fedora/RHEL
|
|
sudo dnf install python3-devel
|
|
|
|
# Ubuntu/Debian
|
|
sudo apt install python3-dev
|
|
```
|
|
|
|
### Build from Source
|
|
```bash
|
|
go build -o transcribe
|
|
```
|
|
|
|
## Usage
|
|
|
|
Output file (`-o`) is required unless `--no-write` is specified.
|
|
|
|
### Basic Transcription
|
|
```bash
|
|
./transcribe audio.mp3 -o output.srt
|
|
```
|
|
|
|
### Choose Whisper Model
|
|
```bash
|
|
./transcribe audio.mp3 --model small -o output.srt
|
|
```
|
|
|
|
Available models: `tiny` (default), `base`, `small`, `medium`, `large`, `turbo`
|
|
|
|
### Output Formats
|
|
|
|
**SRT subtitles (default):**
|
|
```bash
|
|
./transcribe audio.mp3 -o subtitles.srt
|
|
```
|
|
|
|
**Plain text with timestamps:**
|
|
```bash
|
|
./transcribe audio.mp3 --format text -o output.txt
|
|
```
|
|
|
|
**JSON:**
|
|
```bash
|
|
./transcribe audio.mp3 --format json -o output.json
|
|
```
|
|
|
|
### Speaker Diarization
|
|
|
|
Enable automatic speaker detection:
|
|
```bash
|
|
./transcribe audio.mp3 --diarize -o output.srt
|
|
```
|
|
|
|
Specify number of speakers for better accuracy:
|
|
```bash
|
|
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
|
|
```
|
|
|
|
### Print to stdout
|
|
```bash
|
|
./transcribe audio.mp3 --no-write
|
|
```
|
|
|
|
### Full Example
|
|
|
|
Transcribe with speaker diarization:
|
|
```bash
|
|
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
|
|
```
|
|
|
|
Output:
|
|
```
|
|
1
|
|
00:00:00,000 --> 00:00:05,200
|
|
[Speaker 1] Hello, how are you?
|
|
|
|
2
|
|
00:00:05,200 --> 00:00:12,300
|
|
[Speaker 2] I'm doing well, thanks!
|
|
```
|
|
|
|
## CLI Reference
|
|
|
|
```
|
|
Usage:
|
|
transcribe <audio files...> [flags]
|
|
|
|
Flags:
|
|
--diarize Enable speaker diarization
|
|
-f, --format string Output format: srt, text, json (default "srt")
|
|
-h, --help help for transcribe
|
|
-m, --model string Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
|
|
--no-write Print output to stdout instead of file
|
|
-o, --output string Output file path (required)
|
|
-s, --speakers int Number of speakers (0 = auto-detect)
|
|
```
|
|
|
|
## Supported Audio Formats
|
|
|
|
MP3, WAV, FLAC, M4A, OGG, OPUS
|
|
|
|
## Architecture
|
|
|
|
```
|
|
transcribe/
|
|
├── cmd/
|
|
│ └── root.go # CLI commands and flags
|
|
├── internal/
|
|
│ ├── whisper/
|
|
│ │ └── client.go # Whisper Python bridge
|
|
│ └── diarization/
|
|
│ ├── client.go # Diarization Python bridge
|
|
│ └── align.go # Speaker-segment alignment
|
|
├── pkg/
|
|
│ ├── audio/
|
|
│ │ └── audio.go # Audio file validation
|
|
│ ├── output/
|
|
│ │ ├── formatter.go # Output formatter interface
|
|
│ │ ├── srt.go # SRT format
|
|
│ │ ├── text.go # Text format
|
|
│ │ └── json.go # JSON format
|
|
│ └── progress/
|
|
│ └── spinner.go # Progress spinner
|
|
└── README.md
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Transcription**: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
|
|
2. **Diarization** (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
|
|
3. **Alignment**: Speaker segments are mapped to transcription segments by timestamp overlap
|
|
4. **Formatting**: Results are formatted according to the selected output format (SRT by default)
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details.
|