ysandler/transcribe

Fork 0

Go to file

ysandler b73d5b8078 feat: git init

2026-01-17 19:18:58 -06:00

cmd

feat: git init

2026-01-17 19:18:58 -06:00

internal

feat: git init

2026-01-17 19:18:58 -06:00

pkg

feat: git init

2026-01-17 19:18:58 -06:00

.gitignore

feat: git init

2026-01-17 19:18:58 -06:00

CLAUDE.md

feat: git init

2026-01-17 19:18:58 -06:00

go.mod

feat: git init

2026-01-17 19:18:58 -06:00

go.sum

feat: git init

2026-01-17 19:18:58 -06:00

install.sh

feat: git init

2026-01-17 19:18:58 -06:00

main.go

feat: git init

2026-01-17 19:18:58 -06:00

README.md

feat: git init

2026-01-17 19:18:58 -06:00

VERSION

feat: git init

2026-01-17 19:18:58 -06:00

README.md

Transcribe - Audio Transcription Tool

A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.

Features

Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
Speaker diarization using voice embeddings (resemblyzer + clustering)
Multiple output formats: SRT subtitles, plain text, JSON
Batch processing of multiple audio files
Automatic language detection
Progress indicators with spinners

Installation

Prerequisites

Go 1.20+
Python 3.8+
FFmpeg

Python Dependencies

# Required for transcription
pip install openai-whisper

# Required for speaker diarization
pip install resemblyzer scikit-learn librosa

Note: If resemblyzer fails to install due to webrtcvad, install Python development headers first:

# Fedora/RHEL
sudo dnf install python3-devel

# Ubuntu/Debian
sudo apt install python3-dev

Build from Source

go build -o transcribe

Usage

Output file (-o) is required unless --no-write is specified.

Basic Transcription

./transcribe audio.mp3 -o output.srt

Choose Whisper Model

./transcribe audio.mp3 --model small -o output.srt

Available models: tiny (default), base, small, medium, large, turbo

Output Formats

SRT subtitles (default):

./transcribe audio.mp3 -o subtitles.srt

Plain text with timestamps:

./transcribe audio.mp3 --format text -o output.txt

JSON:

./transcribe audio.mp3 --format json -o output.json

Speaker Diarization

Enable automatic speaker detection:

./transcribe audio.mp3 --diarize -o output.srt

Specify number of speakers for better accuracy:

./transcribe audio.mp3 --diarize --speakers 2 -o output.srt

Print to stdout

./transcribe audio.mp3 --no-write

Full Example

Transcribe with speaker diarization:

./transcribe interview.wav --model small --diarize -s 2 -o interview.srt

Output:

1
00:00:00,000 --> 00:00:05,200
[Speaker 1] Hello, how are you?

2
00:00:05,200 --> 00:00:12,300
[Speaker 2] I'm doing well, thanks!

CLI Reference

Usage:
  transcribe <audio files...> [flags]

Flags:
      --diarize           Enable speaker diarization
  -f, --format string     Output format: srt, text, json (default "srt")
  -h, --help              help for transcribe
  -m, --model string      Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
      --no-write          Print output to stdout instead of file
  -o, --output string     Output file path (required)
  -s, --speakers int      Number of speakers (0 = auto-detect)

Supported Audio Formats

MP3, WAV, FLAC, M4A, OGG, OPUS

Architecture

transcribe/
├── cmd/
│   └── root.go              # CLI commands and flags
├── internal/
│   ├── whisper/
│   │   └── client.go        # Whisper Python bridge
│   └── diarization/
│       ├── client.go        # Diarization Python bridge
│       └── align.go         # Speaker-segment alignment
├── pkg/
│   ├── audio/
│   │   └── audio.go         # Audio file validation
│   ├── output/
│   │   ├── formatter.go     # Output formatter interface
│   │   ├── srt.go           # SRT format
│   │   ├── text.go          # Text format
│   │   └── json.go          # JSON format
│   └── progress/
│       └── spinner.go       # Progress spinner
└── README.md

How It Works

Transcription: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
Diarization (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
Alignment: Speaker segments are mapped to transcription segments by timestamp overlap
Formatting: Results are formatted according to the selected output format (SRT by default)

License

MIT License - see LICENSE file for details.