Client → Server
All client messages are JSON text frames.Text Frame
Text to synthesize.
" " (single space) for handshake. "" (empty string) for end-of-sequence.Provider-specific voice configuration. Only used in the handshake frame (
{"text": " "}). See Voice Settings.When
true, immediately synthesizes all buffered text without waiting for a sentence boundary. Default: false.When
true, stops the current synthesis worker and starts a new one. The original handshake is replayed automatically. Use for barge-in/interruption.Message Sequence
1. Handshake (required first message):Server → Client
All server messages are JSON text frames.Audio Chunk
Returned when synthesis produces audio for a complete sentence.Base64-encoded audio data.
null when the provider uses streamed delivery — audio arrives in separate streamed chunk frames instead. See note below.The text segment this audio corresponds to.
null for streamed audio chunks.false for audio chunks.true if audio was served from cache.Time in milliseconds from speech request to first audio frame. Only present on the first chunk of each synthesis.
Streamed Audio Chunk
For providers that stream audio incrementally (Telnyx Natural, NaturalHD, Qwen3TTS, Rime, Minimax, Resemble, Inworld), audio arrives in separate frames:text is always null). The concatenated audio chunk for these providers has audio: null — only the streamed chunks carry audio bytes.
For AWS Polly and Azure, audio is returned in the
audio field of the regular audio chunk frame. For all other providers, ignore the audio field on the text-bearing chunk and collect audio from the streamed frames.