Endpoint
Connection Lifecycle
1. Handshake
There are two ways to establish a connection:Direct WebSocket connection
You can connect directly to the WebSocket endpoint by passing all configuration as query parameters in thewss:// URL:
HTTP upgrade
Alternatively, initiate the connection as an HTTP GET request that upgrades to a WebSocket via the standard101 Switching Protocols handshake. This is what happens under the hood when a WebSocket client connects, and may be relevant if you need fine-grained control over the upgrade (e.g., setting custom headers in environments where the WebSocket library doesnβt expose them directly).
Initialization frame
Regardless of how the connection is established, send an initialization frame before any text:voice_settings to configure provider-specific parameters:
2. Streaming
Once initialized, text and audio flow concurrently β no request/response pairing. Text is buffered and synthesized at sentence boundaries. Client β Server| Frame type | Content |
|---|---|
| text | {"text": "Hello."} β text to synthesize |
| text | {"text": "...", "flush": true} β force immediate synthesis of buffered text |
| text | {"force": true} β interrupt current synthesis (barge-in), restart worker |
| text | {"text": ""} β end of sequence, flush remaining buffer and close |
| Message | Description |
|---|---|
| Audio chunk | {"audio":"<b64>","text":"Hello.","isFinal":false} |
| Streamed chunk | {"audio":"<b64>","text":null,"isFinal":false} (most providers) |
| Final frame | {"audio":null,"text":"","isFinal":true} β synthesis complete |
| Error | {"error":"..."} β connection closes after |
"flush": true to force synthesis of buffered partials.
Text preprocessing: Markdown formatting is automatically stripped before synthesis (headers, bold, italics, code blocks, links, lists, emoji). Useful when synthesizing LLM output. Pronunciation dictionary replacements are applied if pronunciation_dict_id is set.
Streamed vs. concatenated delivery: Most providers (Telnyx Natural/NaturalHD/Qwen3TTS, Rime, Minimax, Resemble, Inworld) stream audio in separate frames where text is null. AWS Polly and Azure return audio in the text-bearing chunk instead. See Messages for details.
3. Teardown
Send{"text": ""} (empty string) to flush remaining buffered text and close gracefully. The server finishes synthesis, sends any remaining audio and a final frame, then closes the WebSocket.