Telnyx

Client → Server

All client messages are JSON text frames.

Text Frame

text

string

required

Text to synthesize. " " (single space) for handshake. "" (empty string) for end-of-sequence.

voice_settings

object

Provider-specific voice configuration. Only used in the handshake frame ({"text": " "}). See Voice Settings.

flush

boolean

When true, immediately synthesizes all buffered text without waiting for a sentence boundary. Default: false.

force

boolean

When true, stops the current synthesis worker and starts a new one. The original handshake is replayed automatically. Use for barge-in/interruption.

Message Sequence

1. Handshake (required first message):

{"text": " "}

With optional voice settings:

{
  "text": " ",
  "voice_settings": {
    "voice_speed": 1.2
  }
}

2. Text (one or more):

{"text": "Hello, welcome to Telnyx."}

3. Flush (optional — force synthesis of buffered partial sentences):

{"text": "incomplete fragment", "flush": true}

4. Interrupt (optional — restart synthesis):

{"force": true}

5. End of sequence:

{"text": ""}

Server → Client

All server messages are JSON text frames.

Audio Chunk

Returned when synthesis produces audio for a complete sentence.

{
  "audio": "<base64-encoded-audio>",
  "text": "Hello, welcome to Telnyx.",
  "isFinal": false,
  "cached": false,
  "timeToFirstAudioFrameMs": 245
}

audio

string | null

Base64-encoded audio data. null when the provider uses streamed delivery — audio arrives in separate streamed chunk frames instead. See note below.

text

string | null

The text segment this audio corresponds to. null for streamed audio chunks.

isFinal

boolean

false for audio chunks.

cached

boolean

true if audio was served from cache.

timeToFirstAudioFrameMs

integer

Time in milliseconds from speech request to first audio frame. Only present on the first chunk of each synthesis.

Streamed Audio Chunk

For providers that stream audio incrementally (Telnyx Natural, NaturalHD, Qwen3TTS, Rime, Minimax, Resemble, Inworld), audio arrives in separate frames:

{
  "audio": "<base64-encoded-audio>",
  "text": null,
  "isFinal": false,
  "cached": false
}

These contain raw audio data (text is always null). The concatenated audio chunk for these providers has audio: null — only the streamed chunks carry audio bytes.

For AWS Polly and Azure, audio is returned in the audio field of the regular audio chunk frame. For all other providers, ignore the audio field on the text-bearing chunk and collect audio from the streamed frames.

Final Frame

Signals that synthesis is complete for the current text input:

{
  "audio": null,
  "text": "",
  "isFinal": true
}

The connection remains open after a final frame — send more text or close.

Error Frame

{
  "error": "Provider error message"
}

The connection closes shortly after an error frame.

Documentation Index

​Client → Server

​Text Frame

​Message Sequence

​Server → Client

​Audio Chunk

​Streamed Audio Chunk

​Final Frame

​Error Frame

Client → Server

Text Frame

Message Sequence

Server → Client

Audio Chunk

Streamed Audio Chunk

Final Frame

Error Frame