2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00
2026-01-17 19:18:58 -06:00

Transcribe - Audio Transcription Tool

A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.

Features

  • Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
  • Speaker diarization using voice embeddings (resemblyzer + clustering)
  • Multiple output formats: SRT subtitles, plain text, JSON
  • Batch processing of multiple audio files
  • Automatic language detection
  • Progress indicators with spinners

Installation

Prerequisites

  • Go 1.20+
  • Python 3.8+
  • FFmpeg

Python Dependencies

# Required for transcription
pip install openai-whisper

# Required for speaker diarization
pip install resemblyzer scikit-learn librosa

Note: If resemblyzer fails to install due to webrtcvad, install Python development headers first:

# Fedora/RHEL
sudo dnf install python3-devel

# Ubuntu/Debian
sudo apt install python3-dev

Build from Source

go build -o transcribe

Usage

Output file (-o) is required unless --no-write is specified.

Basic Transcription

./transcribe audio.mp3 -o output.srt

Choose Whisper Model

./transcribe audio.mp3 --model small -o output.srt

Available models: tiny (default), base, small, medium, large, turbo

Output Formats

SRT subtitles (default):

./transcribe audio.mp3 -o subtitles.srt

Plain text with timestamps:

./transcribe audio.mp3 --format text -o output.txt

JSON:

./transcribe audio.mp3 --format json -o output.json

Speaker Diarization

Enable automatic speaker detection:

./transcribe audio.mp3 --diarize -o output.srt

Specify number of speakers for better accuracy:

./transcribe audio.mp3 --diarize --speakers 2 -o output.srt

Print to stdout

./transcribe audio.mp3 --no-write

Full Example

Transcribe with speaker diarization:

./transcribe interview.wav --model small --diarize -s 2 -o interview.srt

Output:

1
00:00:00,000 --> 00:00:05,200
[Speaker 1] Hello, how are you?

2
00:00:05,200 --> 00:00:12,300
[Speaker 2] I'm doing well, thanks!

CLI Reference

Usage:
  transcribe <audio files...> [flags]

Flags:
      --diarize           Enable speaker diarization
  -f, --format string     Output format: srt, text, json (default "srt")
  -h, --help              help for transcribe
  -m, --model string      Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
      --no-write          Print output to stdout instead of file
  -o, --output string     Output file path (required)
  -s, --speakers int      Number of speakers (0 = auto-detect)

Supported Audio Formats

MP3, WAV, FLAC, M4A, OGG, OPUS

Architecture

transcribe/
├── cmd/
│   └── root.go              # CLI commands and flags
├── internal/
│   ├── whisper/
│   │   └── client.go        # Whisper Python bridge
│   └── diarization/
│       ├── client.go        # Diarization Python bridge
│       └── align.go         # Speaker-segment alignment
├── pkg/
│   ├── audio/
│   │   └── audio.go         # Audio file validation
│   ├── output/
│   │   ├── formatter.go     # Output formatter interface
│   │   ├── srt.go           # SRT format
│   │   ├── text.go          # Text format
│   │   └── json.go          # JSON format
│   └── progress/
│       └── spinner.go       # Progress spinner
└── README.md

How It Works

  1. Transcription: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
  2. Diarization (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
  3. Alignment: Speaker segments are mapped to transcription segments by timestamp overlap
  4. Formatting: Results are formatted according to the selected output format (SRT by default)

License

MIT License - see LICENSE file for details.

Description
No description provided
Readme 40 KiB
Languages
Go 97%
Shell 3%