mirror of
https://github.com/dograh-hq/dograh.git
synced 2026-06-07 07:55:16 +02:00
* feat: add stt evals * add smart turn as provider * chore: remove deprecations * chore: format files * fix: remove deprecated UserIdleProcessor * fix: remove deprecated TranscriptProcessor * chore: update pipecat submodule * feat: add evals visualisation * fix: trigger llm generation on client connected and pipeline started * chore: update pipecat * chore: update pipecat submodule * Add tests * fix: slow loading of workflow page * chore: update pipecat submodule * Show version after release * Fixes #99 * fix: provider check for websocket connection * Fixes #107 * Fix #96 * chore: fix documentation * fix: cloudonix campaign call error --------- Co-authored-by: Sabiha Khan <sabihak89@gmail.com>
135 lines
4.4 KiB
Markdown
135 lines
4.4 KiB
Markdown
# STT Evaluation Benchmark
|
|
|
|
Benchmark for comparing Speech-to-Text providers using **WebSocket streaming** with focus on:
|
|
- **Speaker diarization** - identifying who said what
|
|
- **Keyterm boosting** - improving recognition of specific terms (Deepgram)
|
|
|
|
## Providers
|
|
|
|
| Provider | Diarization | Keyterm Boost | Streaming |
|
|
|----------|-------------|---------------|-----------|
|
|
| Deepgram | Yes | Yes | WebSocket (v1/v2) |
|
|
| Speechmatics | Yes | Additional vocab | WebSocket RT |
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install websockets
|
|
|
|
# Set API keys
|
|
export DEEPGRAM_API_KEY="your-key"
|
|
export SPEECHMATICS_API_KEY="your-key"
|
|
```
|
|
|
|
**Note:** Requires `ffmpeg` installed for audio conversion to PCM16.
|
|
|
|
## Usage
|
|
|
|
Run from the project root directory:
|
|
|
|
```bash
|
|
# Test both providers with diarization
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize
|
|
|
|
# Test only Deepgram
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize --providers deepgram
|
|
|
|
# Test with keyterm boosting (Deepgram)
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize --keyterms "Dograh" "Pipecat"
|
|
|
|
# Use different sample rate (default: 8000 Hz)
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize --sample-rate 16000
|
|
|
|
# Show word-level timings
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize --show-words
|
|
|
|
# Save results to JSON
|
|
python -m evals.stt.benchmark audio/multi_speaker.m4a --diarize --save
|
|
```
|
|
|
|
## CLI Options
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `audio_file` | Path to audio file (relative to evals/stt/ or absolute) |
|
|
| `--providers` | Providers to test: `deepgram`, `speechmatics` (default: both) |
|
|
| `--diarize` | Enable speaker diarization |
|
|
| `--keyterms` | Keywords to boost (Deepgram) / additional vocab (Speechmatics) |
|
|
| `--language` | Language code (default: en) |
|
|
| `--sample-rate` | Audio sample rate for streaming (default: 8000) |
|
|
| `--show-words` | Show individual word timings |
|
|
| `--save` | Save results to JSON in `results/` |
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
evals/stt/
|
|
├── audio/ # Audio test files
|
|
│ └── multi_speaker.m4a
|
|
├── results/ # Saved benchmark results (JSON)
|
|
├── providers/ # STT provider implementations
|
|
│ ├── base.py # Base classes
|
|
│ ├── deepgram_provider.py # WebSocket streaming
|
|
│ └── speechmatics_provider.py # WebSocket streaming
|
|
├── audio_streamer.py # PCM16 audio file streamer
|
|
├── benchmark.py # Main runner script
|
|
└── README.md
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Audio Conversion**: The `AudioStreamer` converts any audio file to raw PCM16 using ffmpeg
|
|
2. **WebSocket Connection**: Providers connect to their respective WebSocket APIs
|
|
3. **Streaming**: Audio is sent in chunks (configurable sample rate, default 8kHz)
|
|
4. **Result Collection**: Transcripts and speaker info are collected from WebSocket responses
|
|
5. **Comparison**: Results are parsed into a common format for comparison
|
|
|
|
## Output Example
|
|
|
|
```
|
|
Audio file: /path/to/audio/multi_speaker.m4a
|
|
Providers: ['deepgram', 'speechmatics']
|
|
Diarization: True
|
|
Sample rate: 8000 Hz
|
|
|
|
============================================================
|
|
Provider: DEEPGRAM
|
|
============================================================
|
|
|
|
Duration: 45.32s
|
|
Speakers detected: 2 - ['0', '1']
|
|
|
|
Transcript:
|
|
Hello, welcome to the demo...
|
|
|
|
--- Speaker Segments ---
|
|
[0.0s] Speaker 0: Hello, welcome to the demo.
|
|
[2.5s] Speaker 1: Thanks for having me.
|
|
...
|
|
|
|
============================================================
|
|
COMPARISON SUMMARY
|
|
============================================================
|
|
|
|
Provider Duration Speakers Words
|
|
---------------------------------------------
|
|
deepgram 45.32 2 312
|
|
speechmatics 45.32 2 308
|
|
```
|
|
|
|
## Adding New Providers
|
|
|
|
1. Create a new file in `providers/` (e.g., `whisper_provider.py`)
|
|
2. Implement the `STTProvider` abstract class with WebSocket streaming
|
|
3. Use `AudioStreamer` for PCM16 conversion
|
|
4. Add to `providers/__init__.py`
|
|
5. Add to `benchmark.py` provider choices
|
|
|
|
## API Documentation
|
|
|
|
- Deepgram Streaming: https://developers.deepgram.com/docs/live-streaming-audio
|
|
- Deepgram Diarization: https://developers.deepgram.com/docs/diarization
|
|
- Deepgram Keyterms: https://developers.deepgram.com/docs/keyterm
|
|
- Speechmatics RT API: https://docs.speechmatics.com/rt-api-ref
|
|
- Speechmatics Diarization: https://docs.speechmatics.com/features/diarization
|