feat: git init
This commit is contained in:
166
README.md
Normal file
166
README.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Transcribe - Audio Transcription Tool
|
||||
|
||||
A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.
|
||||
|
||||
## Features
|
||||
|
||||
- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
|
||||
- Speaker diarization using voice embeddings (resemblyzer + clustering)
|
||||
- Multiple output formats: SRT subtitles, plain text, JSON
|
||||
- Batch processing of multiple audio files
|
||||
- Automatic language detection
|
||||
- Progress indicators with spinners
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- Go 1.20+
|
||||
- Python 3.8+
|
||||
- FFmpeg
|
||||
|
||||
### Python Dependencies
|
||||
```bash
|
||||
# Required for transcription
|
||||
pip install openai-whisper
|
||||
|
||||
# Required for speaker diarization
|
||||
pip install resemblyzer scikit-learn librosa
|
||||
```
|
||||
|
||||
Note: If `resemblyzer` fails to install due to `webrtcvad`, install Python development headers first:
|
||||
```bash
|
||||
# Fedora/RHEL
|
||||
sudo dnf install python3-devel
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt install python3-dev
|
||||
```
|
||||
|
||||
### Build from Source
|
||||
```bash
|
||||
go build -o transcribe
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Output file (`-o`) is required unless `--no-write` is specified.
|
||||
|
||||
### Basic Transcription
|
||||
```bash
|
||||
./transcribe audio.mp3 -o output.srt
|
||||
```
|
||||
|
||||
### Choose Whisper Model
|
||||
```bash
|
||||
./transcribe audio.mp3 --model small -o output.srt
|
||||
```
|
||||
|
||||
Available models: `tiny` (default), `base`, `small`, `medium`, `large`, `turbo`
|
||||
|
||||
### Output Formats
|
||||
|
||||
**SRT subtitles (default):**
|
||||
```bash
|
||||
./transcribe audio.mp3 -o subtitles.srt
|
||||
```
|
||||
|
||||
**Plain text with timestamps:**
|
||||
```bash
|
||||
./transcribe audio.mp3 --format text -o output.txt
|
||||
```
|
||||
|
||||
**JSON:**
|
||||
```bash
|
||||
./transcribe audio.mp3 --format json -o output.json
|
||||
```
|
||||
|
||||
### Speaker Diarization
|
||||
|
||||
Enable automatic speaker detection:
|
||||
```bash
|
||||
./transcribe audio.mp3 --diarize -o output.srt
|
||||
```
|
||||
|
||||
Specify number of speakers for better accuracy:
|
||||
```bash
|
||||
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
|
||||
```
|
||||
|
||||
### Print to stdout
|
||||
```bash
|
||||
./transcribe audio.mp3 --no-write
|
||||
```
|
||||
|
||||
### Full Example
|
||||
|
||||
Transcribe with speaker diarization:
|
||||
```bash
|
||||
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
1
|
||||
00:00:00,000 --> 00:00:05,200
|
||||
[Speaker 1] Hello, how are you?
|
||||
|
||||
2
|
||||
00:00:05,200 --> 00:00:12,300
|
||||
[Speaker 2] I'm doing well, thanks!
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
```
|
||||
Usage:
|
||||
transcribe <audio files...> [flags]
|
||||
|
||||
Flags:
|
||||
--diarize Enable speaker diarization
|
||||
-f, --format string Output format: srt, text, json (default "srt")
|
||||
-h, --help help for transcribe
|
||||
-m, --model string Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
|
||||
--no-write Print output to stdout instead of file
|
||||
-o, --output string Output file path (required)
|
||||
-s, --speakers int Number of speakers (0 = auto-detect)
|
||||
```
|
||||
|
||||
## Supported Audio Formats
|
||||
|
||||
MP3, WAV, FLAC, M4A, OGG, OPUS
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
transcribe/
|
||||
├── cmd/
|
||||
│ └── root.go # CLI commands and flags
|
||||
├── internal/
|
||||
│ ├── whisper/
|
||||
│ │ └── client.go # Whisper Python bridge
|
||||
│ └── diarization/
|
||||
│ ├── client.go # Diarization Python bridge
|
||||
│ └── align.go # Speaker-segment alignment
|
||||
├── pkg/
|
||||
│ ├── audio/
|
||||
│ │ └── audio.go # Audio file validation
|
||||
│ ├── output/
|
||||
│ │ ├── formatter.go # Output formatter interface
|
||||
│ │ ├── srt.go # SRT format
|
||||
│ │ ├── text.go # Text format
|
||||
│ │ └── json.go # JSON format
|
||||
│ └── progress/
|
||||
│ └── spinner.go # Progress spinner
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Transcription**: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
|
||||
2. **Diarization** (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
|
||||
3. **Alignment**: Speaker segments are mapped to transcription segments by timestamp overlap
|
||||
4. **Formatting**: Results are formatted according to the selected output format (SRT by default)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details.
|
||||
Reference in New Issue
Block a user