## rbb Features (for GPU acceleration and persistent cache)

To support voice_activity_detection the faster_whisper model has to be used:

- [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[v1.1.0](https://github.com/SYSTRAN/faster-whisper/releases/tag/v1.1.0)


Before starting the container, create a .env file with the content from the .env.example file.

The container then has to be started with the following commands:

```shell
docker run -d -p 9000:9000 \
  --env-file ./.env \
  --gpus all \
  -v $PWD/cache:/data/whisper \
  -v ISILON_transcript_files:/app/files \
  image_name
```

## Environment Variables

Key configuration options (see .env.example for default values):

- `ASR_ENGINE`: Engine selection (openai_whisper, faster_whisper, whisperx)
- `ASR_MODEL`: Model selection (tiny, base, small, medium, large-v3, etc.)
- `ASR_MODEL_PATH`: Custom path to store/load models
- `ASR_DEVICE`: Device selection (cuda, cpu)

## Request URL Query Params

| Name            | Values                                         | Description                                                    |
|-----------------|------------------------------------------------|----------------------------------------------------------------|
| file_name     | `text`                                          | Basename of Audio or video file to transcribe                              |
| output          | `text` (default), `json`, `vtt`, `srt`, `tsv` | Output format                                                  |
| task            | `transcribe`, `translate`                      | Task type - transcribe in source language or translate to English |
| language        | `en` (default is auto recognition)             | Source language code (see supported languages)                 |
| word_timestamps | false (default)                                | Enable word-level timestamps (Faster Whisper only)             |
| vad_filter      | false (default)                                | Enable voice activity detection filtering (Faster Whisper only) |
| encode          | true (default)                                 | Encode audio through FFmpeg before processing                  |
| diarize         | false (default)                                | Enable speaker diarization (WhisperX only)                     |
| min_speakers    | null (default)                                 | Minimum number of speakers for diarization (WhisperX only)     |
| max_speakers    | null (default)                                 | Maximum number of speakers for diarization (WhisperX only)     |

## Documentation

For complete documentation, visit:
[https://ahmetoner.github.io/whisper-asr-webservice](https://ahmetoner.github.io/whisper-asr-webservice)

## Info about NVIDIA libraries that need to be installed

[github.com](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu)


## Info About Speaker Diarization (Detect Different Speakers)

Set ASR_ENGINE=whisperx in .env-file

A hugging_face account and a token have to be created (https://huggingface.co/settings/tokens)

You need to get permission for two models:

[Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization-3.1)

[Segmentation]([Segmentation](https://huggingface.co/pyannote/segmentation-3.0))

Also see **Request URL Query Params** in this README

## Credits

- This software uses libraries from the [FFmpeg](http://ffmpeg.org) project under the [LGPLv2.1](http://www.gnu.org/licenses/old-licenses/lgpl-2.1.html)