transcribe/README.md

# Transcribe - Audio Transcription Tool

A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.

## Features

- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
- Speaker diarization using voice embeddings (resemblyzer + clustering)
- Multiple output formats: SRT subtitles, plain text, JSON
- Batch processing of multiple audio files
- Automatic language detection
- Progress indicators with spinners

## Installation

### Prerequisites
- Go 1.20+
- Python 3.8+
- FFmpeg

### Python Dependencies
```bash
# Required for transcription
pip install openai-whisper

# Required for speaker diarization
pip install resemblyzer scikit-learn librosa
```

Note: If `resemblyzer` fails to install due to `webrtcvad`, install Python development headers first:
```bash
# Fedora/RHEL
sudo dnf install python3-devel

# Ubuntu/Debian
sudo apt install python3-dev
```

### Build from Source
```bash
go build -o transcribe
```

## Usage

Output file (`-o`) is required unless `--no-write` is specified.

### Basic Transcription
```bash
./transcribe audio.mp3 -o output.srt
```

### Choose Whisper Model
```bash
./transcribe audio.mp3 --model small -o output.srt
```

Available models: `tiny` (default), `base`, `small`, `medium`, `large`, `turbo`

### Output Formats

**SRT subtitles (default):**
```bash
./transcribe audio.mp3 -o subtitles.srt
```

**Plain text with timestamps:**
```bash
./transcribe audio.mp3 --format text -o output.txt
```

**JSON:**
```bash
./transcribe audio.mp3 --format json -o output.json
```

### Speaker Diarization

Enable automatic speaker detection:
```bash
./transcribe audio.mp3 --diarize -o output.srt
```

Specify number of speakers for better accuracy:
```bash
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
```

### Print to stdout
```bash
./transcribe audio.mp3 --no-write
```

### Full Example

Transcribe with speaker diarization:
```bash
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
```

Output:
```
1
00:00:00,000 --> 00:00:05,200
[Speaker 1] Hello, how are you?

2
00:00:05,200 --> 00:00:12,300
[Speaker 2] I'm doing well, thanks!
```

## CLI Reference

```
Usage:
  transcribe <audio files...> [flags]

Flags:
      --diarize           Enable speaker diarization
  -f, --format string     Output format: srt, text, json (default "srt")
  -h, --help              help for transcribe
  -m, --model string      Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
      --no-write          Print output to stdout instead of file
  -o, --output string     Output file path (required)
  -s, --speakers int      Number of speakers (0 = auto-detect)
```

## Supported Audio Formats

MP3, WAV, FLAC, M4A, OGG, OPUS

## Architecture

```
transcribe/
├── cmd/
│   └── root.go              # CLI commands and flags
├── internal/
│   ├── whisper/
│   │   └── client.go        # Whisper Python bridge
│   └── diarization/
│       ├── client.go        # Diarization Python bridge
│       └── align.go         # Speaker-segment alignment
├── pkg/
│   ├── audio/
│   │   └── audio.go         # Audio file validation
│   ├── output/
│   │   ├── formatter.go     # Output formatter interface
│   │   ├── srt.go           # SRT format
│   │   ├── text.go          # Text format
│   │   └── json.go          # JSON format
│   └── progress/
│       └── spinner.go       # Progress spinner
└── README.md
```

## How It Works

1. **Transcription**: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
2. **Diarization** (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
3. **Alignment**: Speaker segments are mapped to transcription segments by timestamp overlap
4. **Formatting**: Results are formatted according to the selected output format (SRT by default)

## License

MIT License - see LICENSE file for details.