Voice Format
Telnyx.NaturalHD.astraaws.Polly.Generative.Luciaazure.en-US-AvaMultilingualNeuralelevenlabs.v3.Adam
aws.Polly.Generative.Lucia correctly.
Provider Summary
| Provider | Key | Models | Audio Delivery |
|---|---|---|---|
| Telnyx | telnyx | Natural, NaturalHD, KokoroTTS, Qwen3TTS | Streamed |
| AWS Polly | aws | standard, neural, generative, long-form | Concatenated |
| Azure Speech | azure | Neural voices | Concatenated |
| ElevenLabs | elevenlabs | v2, v3, MultiPL.v2 | Direct relay |
| Minimax | minimax | — | Streamed |
| Rime | rime | ArcanaV3 | Streamed |
| Resemble | resemble | Turbo (default) | Streamed |
| Inworld | inworld | inworld-tts-1.5-mini, inworld-tts-1.5-max | Streamed |
Streamed providers send audio in incremental frames — the
audio field on the text-bearing chunk is null. Concatenated providers return full audio in a single chunk. Direct relay means frames are forwarded to the upstream provider’s WebSocket.Telnyx
| Model | Description | Languages |
|---|---|---|
| Natural | Fast, low-latency synthesis | English |
| NaturalHD | Higher quality, supports multiple languages | en, fr, de, es, ar, hi, ja, he, pt |
| KokoroTTS | Lightweight model | — |
| Qwen3TTS | Voice cloning. Requires a cloned voice name as voice_id. | en, zh, fr, de, it, ja, ko, pt, ru, es |
voice_id is the clone name. Cloned voice usage may require identity verification.
AWS Polly
Voice format:aws.Polly.<Engine>.<VoiceId>
Engines: standard, neural, generative, long-form.
Example: aws.Polly.Generative.Lucia
Engine is parsed from the voice ID suffix (e.g., a voice ending in -longform maps to the long-form engine).
Supports SSML input via text_type: "ssml" in voice settings.
Voices: AWS Polly voice list
Azure Speech
Voice format:azure.<VoiceId>
Example: azure.en-US-AvaMultilingualNeural
Default voice: en-US-AvaMultilingualNeural. Default output format: audio-24khz-160kbitrate-mono-mp3.
Supports SSML input and audio effects (eq_car, eq_telecomhp8k).
Voices: Azure Speech voices
ElevenLabs
ElevenLabs connections are relayed directly to the ElevenLabs WebSocket API. Frames pass through without going through the standard text buffering pipeline.
Minimax
Supports voice cloning. Cloned voices are scoped to your organization. Voice settings:speed (float), vol (float), pitch (integer), language_boost (string).
Rime
Voice format:Rime.ArcanaV3.<VoiceId>
Resemble
Self-hosted synthesis engine. Voice settings:precision (PCM_16, PCM_24, PCM_32, MULAW), sample_rate (8000–48000), format (wav, mp3).
Default model: Turbo. Default format: mp3.
Inworld
Models:inworld-tts-1.5-mini (faster), inworld-tts-1.5-max (higher quality). Aliases: Mini, Max.
Encodings: MP3, LINEAR16. Default: LINEAR16 for WebSocket, MP3 for REST.