Skip to main content

Client → Server

All client messages are JSON text frames.

Text Frame

text
string
required
Text to synthesize. " " (single space) for handshake. "" (empty string) for end-of-sequence.
voice_settings
object
Provider-specific voice configuration. Only used in the handshake frame ({"text": " "}). See Voice Settings.
flush
boolean
When true, immediately synthesizes all buffered text without waiting for a sentence boundary. Default: false.
force
boolean
When true, stops the current synthesis worker and starts a new one. The original handshake is replayed automatically. Use for barge-in/interruption.

Message Sequence

1. Handshake (required first message):
{"text": " "}
With optional voice settings:
{
  "text": " ",
  "voice_settings": {
    "voice_speed": 1.2
  }
}
2. Text (one or more):
{"text": "Hello, welcome to Telnyx."}
3. Flush (optional — force synthesis of buffered partial sentences):
{"text": "incomplete fragment", "flush": true}
4. Interrupt (optional — restart synthesis):
{"force": true}
5. End of sequence:
{"text": ""}

Server → Client

All server messages are JSON text frames.

Audio Chunk

Returned when synthesis produces audio for a complete sentence.
{
  "audio": "<base64-encoded-audio>",
  "text": "Hello, welcome to Telnyx.",
  "isFinal": false,
  "cached": false,
  "timeToFirstAudioFrameMs": 245
}
audio
string | null
Base64-encoded audio data. null when the provider uses streamed delivery — audio arrives in separate streamed chunk frames instead. See note below.
text
string | null
The text segment this audio corresponds to. null for streamed audio chunks.
isFinal
boolean
false for audio chunks.
cached
boolean
true if audio was served from cache.
timeToFirstAudioFrameMs
integer
Time in milliseconds from speech request to first audio frame. Only present on the first chunk of each synthesis.

Streamed Audio Chunk

For providers that stream audio incrementally (Telnyx Natural, NaturalHD, Qwen3TTS, Rime, Minimax, Resemble, Inworld), audio arrives in separate frames:
{
  "audio": "<base64-encoded-audio>",
  "text": null,
  "isFinal": false,
  "cached": false
}
These contain raw audio data (text is always null). The concatenated audio chunk for these providers has audio: null — only the streamed chunks carry audio bytes.
For AWS Polly and Azure, audio is returned in the audio field of the regular audio chunk frame. For all other providers, ignore the audio field on the text-bearing chunk and collect audio from the streamed frames.

Final Frame

Signals that synthesis is complete for the current text input:
{
  "audio": null,
  "text": "",
  "isFinal": true
}
The connection remains open after a final frame — send more text or close.

Error Frame

{
  "error": "Provider error message"
}
The connection closes shortly after an error frame.