Skip to main content
Version: Next

Speech SDK

Use speech-sdk (Apache 2.0) to reach additional cloud TTS providers (ElevenLabs, Cartesia, Hume, Deepgram, Google Gemini TTS, Inworld, and more) with your own provider API keys. Requests go from the OpenReader server directly to the provider's API; no extra account or proxy is involved.

Models use the provider/model format. The API key you enter belongs to the provider named by the model prefix: for elevenlabs/eleven_multilingual_v2 enter an ElevenLabs key, for cartesia/sonic-3.5 a Cartesia key, and so on.

Setup

Recommended (auth + admin): Settings → Admin → Shared providers

  1. Add a shared provider with type speech-sdk.
  2. Enter the API key for the provider you want to use.
  3. Set default model to a matching provider/model (for example elevenlabs/eleven_multilingual_v2).

Users select the enabled shared provider, model, and voice from Settings → TTS Provider.

Built-in models

  • openai/gpt-4o-mini-tts (works with your existing OpenAI API key)
  • elevenlabs/eleven_multilingual_v2
  • cartesia/sonic-3.5
  • deepgram/aura-2
  • google/gemini-2.5-flash-preview-tts
  • inworld/inworld-tts-1.5-max

You can also choose Other and enter any provider/model the SDK supports. Recognized prefixes: openai, elevenlabs, cartesia, hume, deepgram, google, inworld, minimax, fish-audio, murf, resemble, fal-ai, mistral, xai, smallest-ai.

Voice IDs

ElevenLabs and Cartesia identify voices by opaque IDs. The built-in lists map to these shared library voices:

ElevenLabs IDNameCartesia IDName
JBFqnCBsd6RMkjVDRZzbGeorgea0e99841-438c-4a64-b679-ae501e7d6091Barbershop Man
IKne3meq5aSn9XLyUdCDCharlie156fb8d2-335b-4950-9cb3-a2d33f0c0c2aBritish Lady
XB0fDUnXU5powFXDhCwaCharlotte694f9389-aac1-45b6-b726-9d9369183238California Girl
Xb7hH8MSUJpSbSDYk0k2Alice87748186-23bb-4571-8b8b-a73da9bf9c4fCommercial Lady
iP95p4xoKVk53GoZ742BChrisee7ea9f8-c0c1-498c-9f62-dc2da49a6f98Friendly Reading Man
nPczCjzI2devNBz1zQrbBrian248be419-c632-4f23-adf1-5324ed7dbf1dHannah
onwK4e9ZLuTAKqWW03F9Daniel
pFZP5JQG7iQjIQuC4BkuLily
pqHfZKP75CvOlQylNhV4Bill

Notes

  • One voice per request; Kokoro-style multi-voice mixing does not apply to this provider.
  • Playback speed is applied client-side, so cached audio segments stay valid when you change speed.
  • Providers without a built-in voice list fall back to a default entry, which lets the provider pick its default voice.
  • Word-by-word highlighting works the same as with every other provider (alignment runs in OpenReader, not the provider).
  • TTS requests are sent from the server, not the browser. The API key is never exposed to clients.

References