feat: git init
This commit is contained in:
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
transcribe
|
||||
test/
|
||||
93
CLAUDE.md
Normal file
93
CLAUDE.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Transcribe Tool
|
||||
|
||||
Audio transcription CLI using OpenAI Whisper with speaker diarization.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Basic transcription (SRT output)
|
||||
./transcribe audio.mp3 -o output.srt
|
||||
|
||||
# With speaker diarization
|
||||
./transcribe audio.mp3 --diarize -o output.srt
|
||||
|
||||
# Specify model and speakers
|
||||
./transcribe audio.mp3 --model small --diarize -s 2 -o output.srt
|
||||
|
||||
# Print to stdout
|
||||
./transcribe audio.mp3 --no-write
|
||||
```
|
||||
|
||||
## Flags
|
||||
|
||||
| Flag | Short | Description | Default |
|
||||
|------|-------|-------------|---------|
|
||||
| `--output` | `-o` | Output file path | **required** |
|
||||
| `--format` | `-f` | `srt`, `text`, `json` | `srt` |
|
||||
| `--model` | `-m` | `tiny`, `base`, `small`, `medium`, `large`, `turbo` | `tiny` |
|
||||
| `--diarize` | | Enable speaker detection | off |
|
||||
| `--speakers` | `-s` | Number of speakers (0=auto) | `0` |
|
||||
| `--no-write` | | Print to stdout instead of file | off |
|
||||
|
||||
## Common Tasks
|
||||
|
||||
**Transcribe a meeting recording:**
|
||||
```bash
|
||||
./transcribe meeting.wav --model small -o meeting.srt
|
||||
```
|
||||
|
||||
**Transcribe interview with 2 speakers:**
|
||||
```bash
|
||||
./transcribe interview.mp3 --model small --diarize -s 2 -o interview.srt
|
||||
```
|
||||
|
||||
**Get JSON output for processing:**
|
||||
```bash
|
||||
./transcribe audio.mp3 --format json -o output.json
|
||||
```
|
||||
|
||||
**Quick preview (stdout):**
|
||||
```bash
|
||||
./transcribe audio.mp3 --no-write
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
**SRT (default):** Subtitle format with timestamps
|
||||
```
|
||||
1
|
||||
00:00:00,000 --> 00:00:05,200
|
||||
[Speaker 1] Hello, how are you?
|
||||
```
|
||||
|
||||
**Text:** Plain text with timestamps
|
||||
```
|
||||
[00:00.0 - 00:05.2] [Speaker 1] Hello, how are you?
|
||||
```
|
||||
|
||||
**JSON:** Full metadata including segments, words, duration
|
||||
|
||||
## Models
|
||||
|
||||
- `tiny` - Fastest, use for quick drafts
|
||||
- `small` - Good balance of speed/accuracy
|
||||
- `medium` - Better accuracy, slower
|
||||
- `large` - Best accuracy, slowest
|
||||
|
||||
## Supported Formats
|
||||
|
||||
MP3, WAV, FLAC, M4A, OGG, OPUS
|
||||
|
||||
## Build
|
||||
|
||||
```bash
|
||||
cd /home/yeho/Documents/tools/transcribe
|
||||
go build -o transcribe
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
```bash
|
||||
pip install openai-whisper # Required
|
||||
pip install resemblyzer scikit-learn librosa # For diarization
|
||||
```
|
||||
166
README.md
Normal file
166
README.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Transcribe - Audio Transcription Tool
|
||||
|
||||
A CLI tool for transcribing audio files using OpenAI's Whisper model with speaker diarization and multiple output formats.
|
||||
|
||||
## Features
|
||||
|
||||
- Multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
|
||||
- Speaker diarization using voice embeddings (resemblyzer + clustering)
|
||||
- Multiple output formats: SRT subtitles, plain text, JSON
|
||||
- Batch processing of multiple audio files
|
||||
- Automatic language detection
|
||||
- Progress indicators with spinners
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- Go 1.20+
|
||||
- Python 3.8+
|
||||
- FFmpeg
|
||||
|
||||
### Python Dependencies
|
||||
```bash
|
||||
# Required for transcription
|
||||
pip install openai-whisper
|
||||
|
||||
# Required for speaker diarization
|
||||
pip install resemblyzer scikit-learn librosa
|
||||
```
|
||||
|
||||
Note: If `resemblyzer` fails to install due to `webrtcvad`, install Python development headers first:
|
||||
```bash
|
||||
# Fedora/RHEL
|
||||
sudo dnf install python3-devel
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt install python3-dev
|
||||
```
|
||||
|
||||
### Build from Source
|
||||
```bash
|
||||
go build -o transcribe
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Output file (`-o`) is required unless `--no-write` is specified.
|
||||
|
||||
### Basic Transcription
|
||||
```bash
|
||||
./transcribe audio.mp3 -o output.srt
|
||||
```
|
||||
|
||||
### Choose Whisper Model
|
||||
```bash
|
||||
./transcribe audio.mp3 --model small -o output.srt
|
||||
```
|
||||
|
||||
Available models: `tiny` (default), `base`, `small`, `medium`, `large`, `turbo`
|
||||
|
||||
### Output Formats
|
||||
|
||||
**SRT subtitles (default):**
|
||||
```bash
|
||||
./transcribe audio.mp3 -o subtitles.srt
|
||||
```
|
||||
|
||||
**Plain text with timestamps:**
|
||||
```bash
|
||||
./transcribe audio.mp3 --format text -o output.txt
|
||||
```
|
||||
|
||||
**JSON:**
|
||||
```bash
|
||||
./transcribe audio.mp3 --format json -o output.json
|
||||
```
|
||||
|
||||
### Speaker Diarization
|
||||
|
||||
Enable automatic speaker detection:
|
||||
```bash
|
||||
./transcribe audio.mp3 --diarize -o output.srt
|
||||
```
|
||||
|
||||
Specify number of speakers for better accuracy:
|
||||
```bash
|
||||
./transcribe audio.mp3 --diarize --speakers 2 -o output.srt
|
||||
```
|
||||
|
||||
### Print to stdout
|
||||
```bash
|
||||
./transcribe audio.mp3 --no-write
|
||||
```
|
||||
|
||||
### Full Example
|
||||
|
||||
Transcribe with speaker diarization:
|
||||
```bash
|
||||
./transcribe interview.wav --model small --diarize -s 2 -o interview.srt
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
1
|
||||
00:00:00,000 --> 00:00:05,200
|
||||
[Speaker 1] Hello, how are you?
|
||||
|
||||
2
|
||||
00:00:05,200 --> 00:00:12,300
|
||||
[Speaker 2] I'm doing well, thanks!
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
```
|
||||
Usage:
|
||||
transcribe <audio files...> [flags]
|
||||
|
||||
Flags:
|
||||
--diarize Enable speaker diarization
|
||||
-f, --format string Output format: srt, text, json (default "srt")
|
||||
-h, --help help for transcribe
|
||||
-m, --model string Whisper model: tiny, base, small, medium, large, turbo (default "tiny")
|
||||
--no-write Print output to stdout instead of file
|
||||
-o, --output string Output file path (required)
|
||||
-s, --speakers int Number of speakers (0 = auto-detect)
|
||||
```
|
||||
|
||||
## Supported Audio Formats
|
||||
|
||||
MP3, WAV, FLAC, M4A, OGG, OPUS
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
transcribe/
|
||||
├── cmd/
|
||||
│ └── root.go # CLI commands and flags
|
||||
├── internal/
|
||||
│ ├── whisper/
|
||||
│ │ └── client.go # Whisper Python bridge
|
||||
│ └── diarization/
|
||||
│ ├── client.go # Diarization Python bridge
|
||||
│ └── align.go # Speaker-segment alignment
|
||||
├── pkg/
|
||||
│ ├── audio/
|
||||
│ │ └── audio.go # Audio file validation
|
||||
│ ├── output/
|
||||
│ │ ├── formatter.go # Output formatter interface
|
||||
│ │ ├── srt.go # SRT format
|
||||
│ │ ├── text.go # Text format
|
||||
│ │ └── json.go # JSON format
|
||||
│ └── progress/
|
||||
│ └── spinner.go # Progress spinner
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Transcription**: Audio is processed by Whisper (via Python subprocess) to generate timestamped text segments
|
||||
2. **Diarization** (optional): Voice embeddings are extracted using resemblyzer and clustered to identify speakers
|
||||
3. **Alignment**: Speaker segments are mapped to transcription segments by timestamp overlap
|
||||
4. **Formatting**: Results are formatted according to the selected output format (SRT by default)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details.
|
||||
172
cmd/root.go
Normal file
172
cmd/root.go
Normal file
@@ -0,0 +1,172 @@
|
||||
package cmd
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
|
||||
"transcribe/internal/diarization"
|
||||
"transcribe/internal/whisper"
|
||||
"transcribe/pkg/audio"
|
||||
"transcribe/pkg/output"
|
||||
"transcribe/pkg/progress"
|
||||
|
||||
"github.com/spf13/cobra"
|
||||
)
|
||||
|
||||
var Version = "dev"
|
||||
|
||||
var outputFile string
|
||||
var outputFormat string
|
||||
var diarize bool
|
||||
var numSpeakers int
|
||||
var modelSize string
|
||||
var noWrite bool
|
||||
|
||||
// rootCmd represents the base command when called without any subcommands
|
||||
var rootCmd = &cobra.Command{
|
||||
Use: "transcribe",
|
||||
Short: "A CLI tool for transcribing audio files with speaker diarization",
|
||||
Long: `Transcribe is a command-line tool that uses OpenAI's Whisper model to
|
||||
transcribe audio files. It supports multiple output formats (text, SRT, JSON)
|
||||
and speaker diarization using voice embeddings.
|
||||
|
||||
Output file (-o) is required unless --no-write is specified.
|
||||
|
||||
Output Formats:
|
||||
srt SRT subtitle format (default)
|
||||
text Plain text with timestamps
|
||||
json JSON with full metadata
|
||||
|
||||
Whisper Models (--model, -m):
|
||||
tiny Fastest, least accurate (default)
|
||||
base Fast, basic accuracy
|
||||
small Balanced speed/accuracy
|
||||
medium Good accuracy, slower
|
||||
large Best accuracy, slowest
|
||||
turbo Optimized for speed
|
||||
|
||||
Examples:
|
||||
# Basic transcription to SRT
|
||||
transcribe audio.mp3 -o output.srt
|
||||
|
||||
# Use a larger model
|
||||
transcribe audio.mp3 --model small -o output.srt
|
||||
|
||||
# Output as plain text
|
||||
transcribe audio.mp3 --format text -o output.txt
|
||||
|
||||
# Enable speaker diarization
|
||||
transcribe audio.mp3 --diarize -o output.srt
|
||||
|
||||
# Print to stdout instead of file
|
||||
transcribe audio.mp3 --no-write
|
||||
|
||||
# Full example: diarization + specific model
|
||||
transcribe audio.mp3 --model small --diarize -s 2 -o output.srt`,
|
||||
Run: func(cmd *cobra.Command, args []string) {
|
||||
if len(args) == 0 {
|
||||
fmt.Println("Please provide audio files to transcribe")
|
||||
_ = cmd.Help()
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// Require output file unless --no-write is set
|
||||
if outputFile == "" && !noWrite {
|
||||
fmt.Println("✗ Error: Output file required. Use -o <file> to specify output, or --no-write to print to stdout.")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// Validate all provided files
|
||||
for _, file := range args {
|
||||
if _, err := os.Stat(file); os.IsNotExist(err) {
|
||||
fmt.Printf("✗ Error: File '%s' does not exist\n", file)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
_, err := audio.NewAudioFile(file)
|
||||
if err != nil {
|
||||
fmt.Printf("✗ Error: File '%s' has unsupported format or error: %v\n", file, err)
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
// Create whisper client and transcribe
|
||||
whisperClient := whisper.NewClient(whisper.ModelSize(modelSize))
|
||||
whisperOptions := whisper.DefaultTranscriptionOptions()
|
||||
|
||||
// Create diarization client if needed
|
||||
var diarizationClient *diarization.Client
|
||||
var diarizationOptions *diarization.DiarizationOptions
|
||||
if diarize {
|
||||
diarizationClient = diarization.NewClient()
|
||||
diarizationOptions = &diarization.DiarizationOptions{
|
||||
NumSpeakers: numSpeakers,
|
||||
}
|
||||
}
|
||||
|
||||
// Create output formatter
|
||||
formatter := output.NewFormatter(output.FormatType(outputFormat))
|
||||
|
||||
for _, file := range args {
|
||||
// Transcription with spinner
|
||||
spinner := progress.NewSpinner(fmt.Sprintf("Transcribing %s (model: %s)...", file, modelSize))
|
||||
spinner.Start()
|
||||
result, err := whisperClient.Transcribe(file, whisperOptions)
|
||||
if err != nil {
|
||||
spinner.StopWithMessage(fmt.Sprintf("✗ Error transcribing %s: %v", file, err))
|
||||
continue
|
||||
}
|
||||
spinner.StopWithMessage(fmt.Sprintf("✓ Transcribed %s (%.1fs audio)", file, result.Duration))
|
||||
|
||||
// Run diarization if enabled
|
||||
if diarize {
|
||||
spinner := progress.NewSpinner("Detecting speakers...")
|
||||
spinner.Start()
|
||||
diarizationResult, err := diarizationClient.Diarize(file, diarizationOptions)
|
||||
if err != nil {
|
||||
spinner.StopWithMessage(fmt.Sprintf("✗ Diarization failed: %v", err))
|
||||
} else {
|
||||
spinner.StopWithMessage(fmt.Sprintf("✓ Detected %d speaker(s)", diarizationResult.NumSpeakers))
|
||||
diarization.AlignSpeakers(result, diarizationResult)
|
||||
}
|
||||
}
|
||||
|
||||
// Format output
|
||||
formattedOutput, err := formatter.Format(result)
|
||||
if err != nil {
|
||||
fmt.Printf("Error formatting output: %v\n", err)
|
||||
continue
|
||||
}
|
||||
|
||||
// Write to file or stdout
|
||||
if outputFile != "" {
|
||||
err := os.WriteFile(outputFile, []byte(formattedOutput), 0644)
|
||||
if err != nil {
|
||||
fmt.Printf("✗ Error writing output file: %v\n", err)
|
||||
} else {
|
||||
fmt.Printf("✓ Saved to %s\n", outputFile)
|
||||
}
|
||||
} else {
|
||||
fmt.Printf("\n%s\n", formattedOutput)
|
||||
}
|
||||
}
|
||||
},
|
||||
}
|
||||
|
||||
func init() {
|
||||
rootCmd.Version = Version
|
||||
rootCmd.PersistentFlags().StringVarP(&outputFile, "output", "o", "", "Output file path (required)")
|
||||
rootCmd.PersistentFlags().StringVarP(&outputFormat, "format", "f", "srt", "Output format: text, srt, json")
|
||||
rootCmd.PersistentFlags().BoolVar(&diarize, "diarize", false, "Enable speaker diarization")
|
||||
rootCmd.PersistentFlags().IntVarP(&numSpeakers, "speakers", "s", 0, "Number of speakers (0 = auto-detect)")
|
||||
rootCmd.PersistentFlags().StringVarP(&modelSize, "model", "m", "tiny", "Whisper model: tiny, base, small, medium, large, turbo")
|
||||
rootCmd.PersistentFlags().BoolVar(&noWrite, "no-write", false, "Print output to stdout instead of file")
|
||||
}
|
||||
|
||||
// Execute adds all child commands to the root command and sets flags appropriately.
|
||||
func Execute() {
|
||||
if err := rootCmd.Execute(); err != nil {
|
||||
fmt.Println(err)
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
28
go.mod
Normal file
28
go.mod
Normal file
@@ -0,0 +1,28 @@
|
||||
module transcribe
|
||||
|
||||
go 1.25.4
|
||||
|
||||
require (
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
|
||||
github.com/charmbracelet/bubbletea v1.3.10 // indirect
|
||||
github.com/charmbracelet/colorprofile v0.2.3-0.20250311203215-f60798e515dc // indirect
|
||||
github.com/charmbracelet/lipgloss v1.1.0 // indirect
|
||||
github.com/charmbracelet/x/ansi v0.10.1 // indirect
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13-0.20250311204145-2c3ea96c31dd // indirect
|
||||
github.com/charmbracelet/x/term v0.2.1 // indirect
|
||||
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f // indirect
|
||||
github.com/inconshreveable/mousetrap v1.1.0 // indirect
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
|
||||
github.com/mattn/go-isatty v0.0.20 // indirect
|
||||
github.com/mattn/go-localereader v0.0.1 // indirect
|
||||
github.com/mattn/go-runewidth v0.0.16 // indirect
|
||||
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6 // indirect
|
||||
github.com/muesli/cancelreader v0.2.2 // indirect
|
||||
github.com/muesli/termenv v0.16.0 // indirect
|
||||
github.com/rivo/uniseg v0.4.7 // indirect
|
||||
github.com/spf13/cobra v1.10.2 // indirect
|
||||
github.com/spf13/pflag v1.0.9 // indirect
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e // indirect
|
||||
golang.org/x/sys v0.36.0 // indirect
|
||||
golang.org/x/text v0.3.8 // indirect
|
||||
)
|
||||
51
go.sum
Normal file
51
go.sum
Normal file
@@ -0,0 +1,51 @@
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
|
||||
github.com/charmbracelet/bubbletea v1.3.10 h1:otUDHWMMzQSB0Pkc87rm691KZ3SWa4KUlvF9nRvCICw=
|
||||
github.com/charmbracelet/bubbletea v1.3.10/go.mod h1:ORQfo0fk8U+po9VaNvnV95UPWA1BitP1E0N6xJPlHr4=
|
||||
github.com/charmbracelet/colorprofile v0.2.3-0.20250311203215-f60798e515dc h1:4pZI35227imm7yK2bGPcfpFEmuY1gc2YSTShr4iJBfs=
|
||||
github.com/charmbracelet/colorprofile v0.2.3-0.20250311203215-f60798e515dc/go.mod h1:X4/0JoqgTIPSFcRA/P6INZzIuyqdFY5rm8tb41s9okk=
|
||||
github.com/charmbracelet/lipgloss v1.1.0 h1:vYXsiLHVkK7fp74RkV7b2kq9+zDLoEU4MZoFqR/noCY=
|
||||
github.com/charmbracelet/lipgloss v1.1.0/go.mod h1:/6Q8FR2o+kj8rz4Dq0zQc3vYf7X+B0binUUBwA0aL30=
|
||||
github.com/charmbracelet/x/ansi v0.10.1 h1:rL3Koar5XvX0pHGfovN03f5cxLbCF2YvLeyz7D2jVDQ=
|
||||
github.com/charmbracelet/x/ansi v0.10.1/go.mod h1:3RQDQ6lDnROptfpWuUVIUG64bD2g2BgntdxH0Ya5TeE=
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13-0.20250311204145-2c3ea96c31dd h1:vy0GVL4jeHEwG5YOXDmi86oYw2yuYUGqz6a8sLwg0X8=
|
||||
github.com/charmbracelet/x/cellbuf v0.0.13-0.20250311204145-2c3ea96c31dd/go.mod h1:xe0nKWGd3eJgtqZRaN9RjMtK7xUYchjzPr7q6kcvCCs=
|
||||
github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
|
||||
github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
|
||||
github.com/cpuguy83/go-md2man/v2 v2.0.6/go.mod h1:oOW0eioCTA6cOiMLiUPZOpcVxMig6NIQQ7OS05n1F4g=
|
||||
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f h1:Y/CXytFA4m6baUTXGLOoWe4PQhGxaX0KpnayAqC48p4=
|
||||
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f/go.mod h1:vw97MGsxSvLiUE2X8qFplwetxpGLQrlU1Q9AUEIzCaM=
|
||||
github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=
|
||||
github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0 h1:1nnpGOrhyZZuNyfu1QjKiUICQ74+3FNCN69Aj6K7nkY=
|
||||
github.com/lucasb-eyer/go-colorful v1.2.0/go.mod h1:R4dSotOR9KMtayYi1e77YzuveK+i7ruzyGqttikkLy0=
|
||||
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
|
||||
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
|
||||
github.com/mattn/go-localereader v0.0.1 h1:ygSAOl7ZXTx4RdPYinUpg6W99U8jWvWi9Ye2JC/oIi4=
|
||||
github.com/mattn/go-localereader v0.0.1/go.mod h1:8fBrzywKY7BI3czFoHkuzRoWE9C+EiG4R1k4Cjx5p88=
|
||||
github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6TULQc=
|
||||
github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
|
||||
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6 h1:ZK8zHtRHOkbHy6Mmr5D264iyp3TiX5OmNcI5cIARiQI=
|
||||
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6/go.mod h1:CJlz5H+gyd6CUWT45Oy4q24RdLyn7Md9Vj2/ldJBSIo=
|
||||
github.com/muesli/cancelreader v0.2.2 h1:3I4Kt4BQjOR54NavqnDogx/MIoWBFa0StPA8ELUXHmA=
|
||||
github.com/muesli/cancelreader v0.2.2/go.mod h1:3XuTXfFS2VjM+HTLZY9Ak0l6eUKfijIfMUZ4EgX0QYo=
|
||||
github.com/muesli/termenv v0.16.0 h1:S5AlUN9dENB57rsbnkPyfdGuWIlkmzJjbFf0Tf5FWUc=
|
||||
github.com/muesli/termenv v0.16.0/go.mod h1:ZRfOIKPFDYQoDFF4Olj7/QJbW60Ol/kL1pU3VfY/Cnk=
|
||||
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
|
||||
github.com/rivo/uniseg v0.4.7 h1:WUdvkW8uEhrYfLC4ZzdpI2ztxP1I582+49Oc5Mq64VQ=
|
||||
github.com/rivo/uniseg v0.4.7/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
|
||||
github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
|
||||
github.com/spf13/cobra v1.10.2 h1:DMTTonx5m65Ic0GOoRY2c16WCbHxOOw6xxezuLaBpcU=
|
||||
github.com/spf13/cobra v1.10.2/go.mod h1:7C1pvHqHw5A4vrJfjNwvOdzYu0Gml16OCs2GRiTUUS4=
|
||||
github.com/spf13/pflag v1.0.9 h1:9exaQaMOCwffKiiiYk6/BndUBv+iRViNW+4lEMi0PvY=
|
||||
github.com/spf13/pflag v1.0.9/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e h1:JVG44RsyaB9T2KIHavMF/ppJZNG9ZpyihvCd0w101no=
|
||||
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e/go.mod h1:RbqR21r5mrJuqunuUZ/Dhy/avygyECGrLceyNeo4LiM=
|
||||
go.yaml.in/yaml/v3 v3.0.4/go.mod h1:DhzuOOF2ATzADvBadXxruRBLzYTpT36CKvDb3+aBEFg=
|
||||
golang.org/x/sys v0.0.0-20210809222454-d867a43fc93e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.36.0 h1:KVRy2GtZBrk1cBYA7MKu5bEZFxQk4NIDV6RLVcC8o0k=
|
||||
golang.org/x/sys v0.36.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks=
|
||||
golang.org/x/text v0.3.8 h1:nAL+RVCQ9uMn3vJZbV+MRnydTJFPf8qqY42YiA6MrqY=
|
||||
golang.org/x/text v0.3.8/go.mod h1:E6s5w1FMmriuDzIBO73fBruAKo1PCIq6d2Q6DHfQ8WQ=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
27
install.sh
Executable file
27
install.sh
Executable file
@@ -0,0 +1,27 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
INSTALL_DIR="$HOME/.local/bin"
|
||||
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
VERSION="$(cat "$SCRIPT_DIR/VERSION")"
|
||||
|
||||
echo "Building transcribe (version: $VERSION)..."
|
||||
go build -ldflags "-X transcribe/cmd.Version=$VERSION" -o transcribe .
|
||||
|
||||
echo "Installing to $INSTALL_DIR..."
|
||||
mkdir -p "$INSTALL_DIR"
|
||||
cp transcribe "$INSTALL_DIR/"
|
||||
chmod +x "$INSTALL_DIR/transcribe"
|
||||
|
||||
if [[ ":$PATH:" != *":$HOME/.local/bin:"* ]]; then
|
||||
echo ""
|
||||
echo "Warning: ~/.local/bin is not in your PATH"
|
||||
echo "Add this to your shell rc file (e.g., ~/.bashrc or ~/.zshrc):"
|
||||
echo ' export PATH="$HOME/.local/bin:$PATH"'
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Installed successfully!"
|
||||
59
internal/diarization/align.go
Normal file
59
internal/diarization/align.go
Normal file
@@ -0,0 +1,59 @@
|
||||
package diarization
|
||||
|
||||
import (
|
||||
"transcribe/internal/whisper"
|
||||
)
|
||||
|
||||
// AlignSpeakers maps speaker segments to transcription segments by timestamp overlap
|
||||
func AlignSpeakers(transcription *whisper.TranscriptionResult, diarization *DiarizationResult) {
|
||||
if diarization == nil || len(diarization.Speakers) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
for i := range transcription.Segments {
|
||||
seg := &transcription.Segments[i]
|
||||
speaker := findSpeakerForSegment(seg.Start, seg.End, diarization.Speakers)
|
||||
seg.Speaker = speaker
|
||||
}
|
||||
}
|
||||
|
||||
// findSpeakerForSegment finds the speaker with the most overlap with the given time range
|
||||
func findSpeakerForSegment(start, end float64, speakers []SpeakerSegment) string {
|
||||
var bestSpeaker string
|
||||
var maxOverlap float64
|
||||
|
||||
for _, spk := range speakers {
|
||||
overlap := calculateOverlap(start, end, spk.Start, spk.End)
|
||||
if overlap > maxOverlap {
|
||||
maxOverlap = overlap
|
||||
bestSpeaker = spk.Speaker
|
||||
}
|
||||
}
|
||||
|
||||
return bestSpeaker
|
||||
}
|
||||
|
||||
// calculateOverlap returns the duration of overlap between two time ranges
|
||||
func calculateOverlap(start1, end1, start2, end2 float64) float64 {
|
||||
overlapStart := max(start1, start2)
|
||||
overlapEnd := min(end1, end2)
|
||||
|
||||
if overlapEnd > overlapStart {
|
||||
return overlapEnd - overlapStart
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func max(a, b float64) float64 {
|
||||
if a > b {
|
||||
return a
|
||||
}
|
||||
return b
|
||||
}
|
||||
|
||||
func min(a, b float64) float64 {
|
||||
if a < b {
|
||||
return a
|
||||
}
|
||||
return b
|
||||
}
|
||||
222
internal/diarization/client.go
Normal file
222
internal/diarization/client.go
Normal file
@@ -0,0 +1,222 @@
|
||||
package diarization
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os/exec"
|
||||
)
|
||||
|
||||
// SpeakerSegment represents a segment with speaker identification
|
||||
type SpeakerSegment struct {
|
||||
Speaker string `json:"speaker"` // "Speaker 1", "Speaker 2", etc.
|
||||
Start float64 `json:"start"`
|
||||
End float64 `json:"end"`
|
||||
}
|
||||
|
||||
// DiarizationResult contains the speaker diarization output
|
||||
type DiarizationResult struct {
|
||||
Speakers []SpeakerSegment `json:"speakers"`
|
||||
NumSpeakers int `json:"num_speakers"`
|
||||
}
|
||||
|
||||
// Client handles speaker diarization using resemblyzer
|
||||
type Client struct{}
|
||||
|
||||
// NewClient creates a new diarization client
|
||||
func NewClient() *Client {
|
||||
return &Client{}
|
||||
}
|
||||
|
||||
// DiarizationOptions contains options for diarization
|
||||
type DiarizationOptions struct {
|
||||
NumSpeakers int // Number of speakers (0 = auto-detect)
|
||||
}
|
||||
|
||||
// DefaultDiarizationOptions returns default diarization options
|
||||
func DefaultDiarizationOptions() *DiarizationOptions {
|
||||
return &DiarizationOptions{
|
||||
NumSpeakers: 0, // Auto-detect
|
||||
}
|
||||
}
|
||||
|
||||
// Diarize processes an audio file and returns speaker segments
|
||||
func (c *Client) Diarize(audioPath string, options *DiarizationOptions) (*DiarizationResult, error) {
|
||||
if options == nil {
|
||||
options = DefaultDiarizationOptions()
|
||||
}
|
||||
|
||||
// Build the Python command
|
||||
cmd := exec.Command("python3", "-c", c.buildPythonCommand(audioPath, options))
|
||||
|
||||
// Capture stdout and stderr
|
||||
var out bytes.Buffer
|
||||
var errBuf bytes.Buffer
|
||||
cmd.Stdout = &out
|
||||
cmd.Stderr = &errBuf
|
||||
|
||||
// Execute the command
|
||||
err := cmd.Run()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("diarization failed: %v, stderr: %s", err, errBuf.String())
|
||||
}
|
||||
|
||||
// Parse the JSON output
|
||||
var result DiarizationResult
|
||||
err = json.Unmarshal(out.Bytes(), &result)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to parse diarization output: %v, output: %s", err, out.String())
|
||||
}
|
||||
|
||||
return &result, nil
|
||||
}
|
||||
|
||||
// buildPythonCommand constructs the Python command for diarization
|
||||
func (c *Client) buildPythonCommand(audioPath string, options *DiarizationOptions) string {
|
||||
numSpeakersStr := "None"
|
||||
if options.NumSpeakers > 0 {
|
||||
numSpeakersStr = fmt.Sprintf("%d", options.NumSpeakers)
|
||||
}
|
||||
|
||||
pythonCode := fmt.Sprintf(`
|
||||
import json
|
||||
import sys
|
||||
import os
|
||||
import warnings
|
||||
import numpy as np
|
||||
|
||||
# Suppress warnings
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
# Redirect both stdout and stderr during imports to suppress library noise
|
||||
old_stdout = sys.stdout
|
||||
old_stderr = sys.stderr
|
||||
sys.stdout = open(os.devnull, 'w')
|
||||
sys.stderr = open(os.devnull, 'w')
|
||||
|
||||
from resemblyzer import VoiceEncoder, preprocess_wav
|
||||
from sklearn.cluster import SpectralClustering, AgglomerativeClustering
|
||||
import librosa
|
||||
|
||||
# Initialize voice encoder while stdout is suppressed (it prints loading message)
|
||||
encoder = VoiceEncoder()
|
||||
|
||||
# Restore stdout/stderr
|
||||
sys.stdout = old_stdout
|
||||
sys.stderr = old_stderr
|
||||
|
||||
# Configuration
|
||||
AUDIO_PATH = "%s"
|
||||
NUM_SPEAKERS = %s
|
||||
SEGMENT_DURATION = 1.5 # seconds per segment for embedding extraction
|
||||
HOP_DURATION = 0.75 # hop between segments
|
||||
|
||||
# Load audio
|
||||
audio, sr = librosa.load(AUDIO_PATH, sr=16000)
|
||||
duration = len(audio) / sr
|
||||
|
||||
# Extract embeddings for overlapping segments
|
||||
embeddings = []
|
||||
timestamps = []
|
||||
current_time = 0.0
|
||||
|
||||
while current_time + SEGMENT_DURATION <= duration:
|
||||
start_sample = int(current_time * sr)
|
||||
end_sample = int((current_time + SEGMENT_DURATION) * sr)
|
||||
segment = audio[start_sample:end_sample]
|
||||
|
||||
# Skip silent segments
|
||||
if np.abs(segment).mean() > 0.01:
|
||||
try:
|
||||
wav = preprocess_wav(segment, source_sr=sr)
|
||||
if len(wav) > 0:
|
||||
embedding = encoder.embed_utterance(wav)
|
||||
embeddings.append(embedding)
|
||||
timestamps.append((current_time, current_time + SEGMENT_DURATION))
|
||||
except:
|
||||
pass
|
||||
|
||||
current_time += HOP_DURATION
|
||||
|
||||
# Handle edge cases
|
||||
if len(embeddings) == 0:
|
||||
print(json.dumps({"speakers": [], "num_speakers": 0}))
|
||||
sys.exit(0)
|
||||
|
||||
embeddings = np.array(embeddings)
|
||||
|
||||
# Determine number of speakers
|
||||
if NUM_SPEAKERS is None or NUM_SPEAKERS <= 0:
|
||||
# Auto-detect using silhouette score
|
||||
from sklearn.metrics import silhouette_score
|
||||
best_n = 2
|
||||
best_score = -1
|
||||
for n in range(2, min(6, len(embeddings))):
|
||||
try:
|
||||
clustering = AgglomerativeClustering(n_clusters=n)
|
||||
labels = clustering.fit_predict(embeddings)
|
||||
score = silhouette_score(embeddings, labels)
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_n = n
|
||||
except:
|
||||
pass
|
||||
num_speakers = best_n
|
||||
else:
|
||||
num_speakers = NUM_SPEAKERS
|
||||
|
||||
# Cluster embeddings
|
||||
try:
|
||||
if len(embeddings) >= num_speakers:
|
||||
clustering = AgglomerativeClustering(n_clusters=num_speakers)
|
||||
labels = clustering.fit_predict(embeddings)
|
||||
else:
|
||||
labels = list(range(len(embeddings)))
|
||||
num_speakers = len(embeddings)
|
||||
except Exception as e:
|
||||
labels = [0] * len(embeddings)
|
||||
num_speakers = 1
|
||||
|
||||
# Build speaker segments with merging of consecutive same-speaker segments
|
||||
speaker_segments = []
|
||||
prev_speaker = None
|
||||
prev_start = None
|
||||
prev_end = None
|
||||
|
||||
for i, (start, end) in enumerate(timestamps):
|
||||
speaker = f"Speaker {labels[i] + 1}"
|
||||
|
||||
if speaker == prev_speaker and prev_end is not None:
|
||||
# Extend previous segment if same speaker and close in time
|
||||
if start - prev_end < 0.5:
|
||||
prev_end = end
|
||||
continue
|
||||
|
||||
# Save previous segment
|
||||
if prev_speaker is not None:
|
||||
speaker_segments.append({
|
||||
"speaker": prev_speaker,
|
||||
"start": prev_start,
|
||||
"end": prev_end
|
||||
})
|
||||
|
||||
prev_speaker = speaker
|
||||
prev_start = start
|
||||
prev_end = end
|
||||
|
||||
# Don't forget the last segment
|
||||
if prev_speaker is not None:
|
||||
speaker_segments.append({
|
||||
"speaker": prev_speaker,
|
||||
"start": prev_start,
|
||||
"end": prev_end
|
||||
})
|
||||
|
||||
print(json.dumps({
|
||||
"speakers": speaker_segments,
|
||||
"num_speakers": num_speakers
|
||||
}))
|
||||
`, audioPath, numSpeakersStr)
|
||||
|
||||
return pythonCode
|
||||
}
|
||||
162
internal/whisper/client.go
Normal file
162
internal/whisper/client.go
Normal file
@@ -0,0 +1,162 @@
|
||||
package whisper
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os/exec"
|
||||
)
|
||||
|
||||
// ModelSize represents the different Whisper model sizes
|
||||
type ModelSize string
|
||||
|
||||
const (
|
||||
ModelTiny ModelSize = "tiny"
|
||||
ModelBase ModelSize = "base"
|
||||
ModelSmall ModelSize = "small"
|
||||
ModelMedium ModelSize = "medium"
|
||||
ModelLarge ModelSize = "large"
|
||||
ModelTurbo ModelSize = "turbo"
|
||||
)
|
||||
|
||||
// TranscriptionResult contains the transcription output
|
||||
type TranscriptionResult struct {
|
||||
Text string `json:"text"`
|
||||
Segments []Segment `json:"segments"`
|
||||
Language string `json:"language"`
|
||||
Duration float64 `json:"duration"`
|
||||
}
|
||||
|
||||
// Segment represents a segment of transcription with timestamps
|
||||
type Segment struct {
|
||||
Start float64 `json:"start"`
|
||||
End float64 `json:"end"`
|
||||
Text string `json:"text"`
|
||||
Words []Word `json:"words,omitempty"`
|
||||
Speaker string `json:"speaker,omitempty"`
|
||||
}
|
||||
|
||||
// Word represents a word with timestamp
|
||||
type Word struct {
|
||||
Start float64 `json:"start"`
|
||||
End float64 `json:"end"`
|
||||
Word string `json:"word"`
|
||||
}
|
||||
|
||||
// Client is the Whisper client that handles transcription
|
||||
type Client struct {
|
||||
ModelPath string
|
||||
ModelSize ModelSize
|
||||
}
|
||||
|
||||
// NewClient creates a new Whisper client
|
||||
func NewClient(modelSize ModelSize) *Client {
|
||||
return &Client{
|
||||
ModelSize: modelSize,
|
||||
}
|
||||
}
|
||||
|
||||
// Transcribe processes an audio file and returns transcription
|
||||
func (c *Client) Transcribe(audioPath string, options *TranscriptionOptions) (*TranscriptionResult, error) {
|
||||
if options == nil {
|
||||
options = &TranscriptionOptions{}
|
||||
}
|
||||
|
||||
// Build the Python command
|
||||
cmd := exec.Command("python3", "-c", c.buildPythonCommand(audioPath, options))
|
||||
|
||||
// Capture stdout and stderr
|
||||
var out bytes.Buffer
|
||||
var errBuf bytes.Buffer
|
||||
cmd.Stdout = &out
|
||||
cmd.Stderr = &errBuf
|
||||
|
||||
// Execute the command
|
||||
err := cmd.Run()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("transcription failed: %v, stderr: %s", err, errBuf.String())
|
||||
}
|
||||
|
||||
// Parse the JSON output
|
||||
var result TranscriptionResult
|
||||
err = json.Unmarshal(out.Bytes(), &result)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to parse transcription output: %v", err)
|
||||
}
|
||||
|
||||
return &result, nil
|
||||
}
|
||||
|
||||
// buildPythonCommand constructs the Python command for Whisper
|
||||
func (c *Client) buildPythonCommand(audioPath string, options *TranscriptionOptions) string {
|
||||
// Convert Go bool to Python bool string
|
||||
verboseStr := "False"
|
||||
if options.Verbose {
|
||||
verboseStr = "True"
|
||||
}
|
||||
|
||||
// Handle language option
|
||||
langStr := "None"
|
||||
if options.Language != "" && options.Language != "auto" {
|
||||
langStr = fmt.Sprintf(`"%s"`, options.Language)
|
||||
}
|
||||
|
||||
pythonCode := fmt.Sprintf(`
|
||||
import whisper
|
||||
import json
|
||||
import sys
|
||||
import os
|
||||
import warnings
|
||||
|
||||
# Suppress warnings and stdout during transcription
|
||||
warnings.filterwarnings("ignore")
|
||||
old_stdout = sys.stdout
|
||||
sys.stdout = open(os.devnull, 'w')
|
||||
|
||||
# Load model
|
||||
model = whisper.load_model("%s")
|
||||
|
||||
# Transcribe
|
||||
result = model.transcribe("%s",
|
||||
language=%s,
|
||||
verbose=%s,
|
||||
temperature=%.1f,
|
||||
best_of=%d)
|
||||
|
||||
# Restore stdout for JSON output
|
||||
sys.stdout = old_stdout
|
||||
|
||||
# Output as JSON
|
||||
print(json.dumps({
|
||||
"text": result["text"],
|
||||
"language": result.get("language", ""),
|
||||
"duration": result.get("duration", 0.0),
|
||||
"segments": [{
|
||||
"start": seg["start"],
|
||||
"end": seg["end"],
|
||||
"text": seg["text"],
|
||||
"words": seg.get("words", [])
|
||||
} for seg in result.get("segments", [])]
|
||||
}))
|
||||
`, c.ModelSize, audioPath, langStr, verboseStr, options.Temperature, options.BestOf)
|
||||
|
||||
return pythonCode
|
||||
}
|
||||
|
||||
// TranscriptionOptions contains options for transcription
|
||||
type TranscriptionOptions struct {
|
||||
Language string // Language code or "auto"
|
||||
Verbose bool // Show progress bar
|
||||
Temperature float64 // Temperature for sampling (higher = more creative)
|
||||
BestOf int // Number of candidates when sampling with temperature > 0
|
||||
}
|
||||
|
||||
// DefaultTranscriptionOptions returns default transcription options
|
||||
func DefaultTranscriptionOptions() *TranscriptionOptions {
|
||||
return &TranscriptionOptions{
|
||||
Language: "auto",
|
||||
Verbose: false,
|
||||
Temperature: 0.0,
|
||||
BestOf: 5,
|
||||
}
|
||||
}
|
||||
9
main.go
Normal file
9
main.go
Normal file
@@ -0,0 +1,9 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"transcribe/cmd"
|
||||
)
|
||||
|
||||
func main() {
|
||||
cmd.Execute()
|
||||
}
|
||||
56
pkg/audio/audio.go
Normal file
56
pkg/audio/audio.go
Normal file
@@ -0,0 +1,56 @@
|
||||
package audio
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// SupportedAudioFormats lists the audio formats that can be processed
|
||||
type SupportedAudioFormats []string
|
||||
|
||||
var DefaultSupportedFormats = SupportedAudioFormats{
|
||||
".mp3",
|
||||
".wav",
|
||||
".flac",
|
||||
".m4a",
|
||||
".ogg",
|
||||
".opus",
|
||||
}
|
||||
|
||||
// IsSupported checks if a file has a supported audio format
|
||||
type AudioFile struct {
|
||||
Path string
|
||||
Format string
|
||||
Size int64
|
||||
}
|
||||
|
||||
func NewAudioFile(path string) (*AudioFile, error) {
|
||||
fileInfo, err := os.Stat(path)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
ext := filepath.Ext(path)
|
||||
if !IsSupported(ext) {
|
||||
return nil, errors.New("unsupported audio format: " + ext)
|
||||
}
|
||||
|
||||
return &AudioFile{
|
||||
Path: path,
|
||||
Format: ext,
|
||||
Size: fileInfo.Size(),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// IsSupported checks if the given extension is in supported formats
|
||||
func IsSupported(ext string) bool {
|
||||
ext = strings.ToLower(ext)
|
||||
for _, format := range DefaultSupportedFormats {
|
||||
if ext == format {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
33
pkg/output/formatter.go
Normal file
33
pkg/output/formatter.go
Normal file
@@ -0,0 +1,33 @@
|
||||
package output
|
||||
|
||||
import (
|
||||
"transcribe/internal/whisper"
|
||||
)
|
||||
|
||||
// Formatter interface for converting transcription results to various output formats
|
||||
type Formatter interface {
|
||||
Format(result *whisper.TranscriptionResult) (string, error)
|
||||
}
|
||||
|
||||
// FormatType represents the output format type
|
||||
type FormatType string
|
||||
|
||||
const (
|
||||
FormatText FormatType = "text"
|
||||
FormatSRT FormatType = "srt"
|
||||
FormatJSON FormatType = "json"
|
||||
)
|
||||
|
||||
// NewFormatter creates a formatter for the given format type
|
||||
func NewFormatter(format FormatType) Formatter {
|
||||
switch format {
|
||||
case FormatSRT:
|
||||
return &SRTFormatter{}
|
||||
case FormatJSON:
|
||||
return &JSONFormatter{}
|
||||
case FormatText:
|
||||
fallthrough
|
||||
default:
|
||||
return &TextFormatter{}
|
||||
}
|
||||
}
|
||||
19
pkg/output/json.go
Normal file
19
pkg/output/json.go
Normal file
@@ -0,0 +1,19 @@
|
||||
package output
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
|
||||
"transcribe/internal/whisper"
|
||||
)
|
||||
|
||||
// JSONFormatter formats transcription results as JSON
|
||||
type JSONFormatter struct{}
|
||||
|
||||
// Format converts transcription result to JSON format
|
||||
func (f *JSONFormatter) Format(result *whisper.TranscriptionResult) (string, error) {
|
||||
data, err := json.MarshalIndent(result, "", " ")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(data), nil
|
||||
}
|
||||
49
pkg/output/srt.go
Normal file
49
pkg/output/srt.go
Normal file
@@ -0,0 +1,49 @@
|
||||
package output
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"transcribe/internal/whisper"
|
||||
)
|
||||
|
||||
// SRTFormatter formats transcription results as SRT subtitles
|
||||
type SRTFormatter struct{}
|
||||
|
||||
// Format converts transcription result to SRT format
|
||||
func (f *SRTFormatter) Format(result *whisper.TranscriptionResult) (string, error) {
|
||||
var builder strings.Builder
|
||||
|
||||
for i, seg := range result.Segments {
|
||||
// Subtitle number (1-indexed)
|
||||
builder.WriteString(fmt.Sprintf("%d\n", i+1))
|
||||
|
||||
// Timestamps in SRT format: HH:MM:SS,mmm --> HH:MM:SS,mmm
|
||||
startTime := formatSRTTimestamp(seg.Start)
|
||||
endTime := formatSRTTimestamp(seg.End)
|
||||
builder.WriteString(fmt.Sprintf("%s --> %s\n", startTime, endTime))
|
||||
|
||||
// Text with optional speaker label
|
||||
text := strings.TrimSpace(seg.Text)
|
||||
if seg.Speaker != "" {
|
||||
text = fmt.Sprintf("[%s] %s", seg.Speaker, text)
|
||||
}
|
||||
builder.WriteString(text)
|
||||
builder.WriteString("\n\n")
|
||||
}
|
||||
|
||||
return strings.TrimSuffix(builder.String(), "\n"), nil
|
||||
}
|
||||
|
||||
// formatSRTTimestamp converts seconds to SRT timestamp format (HH:MM:SS,mmm)
|
||||
func formatSRTTimestamp(seconds float64) string {
|
||||
totalMs := int64(seconds * 1000)
|
||||
ms := totalMs % 1000
|
||||
totalSeconds := totalMs / 1000
|
||||
s := totalSeconds % 60
|
||||
totalMinutes := totalSeconds / 60
|
||||
m := totalMinutes % 60
|
||||
h := totalMinutes / 60
|
||||
|
||||
return fmt.Sprintf("%02d:%02d:%02d,%03d", h, m, s, ms)
|
||||
}
|
||||
41
pkg/output/text.go
Normal file
41
pkg/output/text.go
Normal file
@@ -0,0 +1,41 @@
|
||||
package output
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"transcribe/internal/whisper"
|
||||
)
|
||||
|
||||
// TextFormatter formats transcription results as plain text with timestamps
|
||||
type TextFormatter struct{}
|
||||
|
||||
// Format converts transcription result to plain text with timestamps
|
||||
func (f *TextFormatter) Format(result *whisper.TranscriptionResult) (string, error) {
|
||||
var builder strings.Builder
|
||||
|
||||
for _, seg := range result.Segments {
|
||||
// Format: [MM:SS - MM:SS] [Speaker] Text
|
||||
startTime := formatTextTimestamp(seg.Start)
|
||||
endTime := formatTextTimestamp(seg.End)
|
||||
|
||||
text := strings.TrimSpace(seg.Text)
|
||||
if seg.Speaker != "" {
|
||||
builder.WriteString(fmt.Sprintf("[%s - %s] [%s] %s\n", startTime, endTime, seg.Speaker, text))
|
||||
} else {
|
||||
builder.WriteString(fmt.Sprintf("[%s - %s] %s\n", startTime, endTime, text))
|
||||
}
|
||||
}
|
||||
|
||||
return strings.TrimSuffix(builder.String(), "\n"), nil
|
||||
}
|
||||
|
||||
// formatTextTimestamp converts seconds to MM:SS.s format
|
||||
func formatTextTimestamp(seconds float64) string {
|
||||
totalSeconds := int(seconds)
|
||||
m := totalSeconds / 60
|
||||
s := totalSeconds % 60
|
||||
tenths := int((seconds - float64(totalSeconds)) * 10)
|
||||
|
||||
return fmt.Sprintf("%02d:%02d.%d", m, s, tenths)
|
||||
}
|
||||
84
pkg/progress/spinner.go
Normal file
84
pkg/progress/spinner.go
Normal file
@@ -0,0 +1,84 @@
|
||||
package progress
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Spinner displays an animated spinner with a message
|
||||
type Spinner struct {
|
||||
message string
|
||||
frames []string
|
||||
interval time.Duration
|
||||
stop chan struct{}
|
||||
done chan struct{}
|
||||
mu sync.Mutex
|
||||
running bool
|
||||
}
|
||||
|
||||
// NewSpinner creates a new spinner with the given message
|
||||
func NewSpinner(message string) *Spinner {
|
||||
return &Spinner{
|
||||
message: message,
|
||||
frames: []string{"⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"},
|
||||
interval: 80 * time.Millisecond,
|
||||
stop: make(chan struct{}),
|
||||
done: make(chan struct{}),
|
||||
}
|
||||
}
|
||||
|
||||
// Start begins the spinner animation
|
||||
func (s *Spinner) Start() {
|
||||
s.mu.Lock()
|
||||
if s.running {
|
||||
s.mu.Unlock()
|
||||
return
|
||||
}
|
||||
s.running = true
|
||||
s.mu.Unlock()
|
||||
|
||||
go func() {
|
||||
i := 0
|
||||
for {
|
||||
select {
|
||||
case <-s.stop:
|
||||
// Clear the line and signal done
|
||||
fmt.Print("\r\033[K")
|
||||
close(s.done)
|
||||
return
|
||||
default:
|
||||
fmt.Printf("\r%s %s", s.frames[i%len(s.frames)], s.message)
|
||||
i++
|
||||
time.Sleep(s.interval)
|
||||
}
|
||||
}
|
||||
}()
|
||||
}
|
||||
|
||||
// Stop stops the spinner and clears the line
|
||||
func (s *Spinner) Stop() {
|
||||
s.mu.Lock()
|
||||
if !s.running {
|
||||
s.mu.Unlock()
|
||||
return
|
||||
}
|
||||
s.running = false
|
||||
s.mu.Unlock()
|
||||
|
||||
close(s.stop)
|
||||
<-s.done
|
||||
}
|
||||
|
||||
// StopWithMessage stops the spinner and prints a final message
|
||||
func (s *Spinner) StopWithMessage(message string) {
|
||||
s.Stop()
|
||||
fmt.Println(message)
|
||||
}
|
||||
|
||||
// UpdateMessage updates the spinner message while running
|
||||
func (s *Spinner) UpdateMessage(message string) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
s.message = message
|
||||
}
|
||||
Reference in New Issue
Block a user