Overview
The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.WebSocket Endpoint
Connection URL
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
voice | string | Yes | Voice identifier (e.g., Telnyx.NaturalHD.astra) |
inactivity_timeout | integer | No | Time without message to keep the WebSocket open (default: 20 seconds) |
Authentication
Include your Telnyx API token as an Authorization header in the connection request:Example Connection
Connection Flow
The TTS WebSocket follows this lifecycle:- Connect - Establish WebSocket connection with authentication.
- Initialize - Send initialization frame with space character.
- Send Text - Send one or more text frames to synthesize.
- Receive Audio - Receive audio frames with base64-encoded mp3 data.
- Stop - Send empty text frame to signal completion.
- Close - Connection closes after processing completes.
Flow Diagram
Frame Types
Outbound Frames (Client → Server)
All outbound frames are JSON text messages with the following structure:1. Initialization Frame
Purpose: Initialize the TTS session Format:- Must be sent first after connection.
- Contains a single space character.
- Required to begin the session.
2. Text Frame
Purpose: Send text content to be synthesized into speech Format:- Can send multiple text frames in one session.
- Each frame is processed and synthesized separately.
- Audio is returned incrementally for each text frame.
3. Stop Frame
Purpose: Signal completion of text input and end the session Format:- Contains an empty string.
- Signals the server to finish processing.
- Should be sent after all text frames.
Inbound Frames (Server → Client)
The server sends JSON text messages containing synthesized audio data.Audio Frame
Purpose: Deliver synthesized audio data Format:| Property | Value |
|---|---|
| Format | mp3 |
| Sample Rate | 16 kHz |
| Bit Depth | 16-bit |
| Channels | Mono (1) |
| Encoding | Base64 |
- Multiple audio frames may be received for a single text input.
- Each audio chunk is a complete mp3 file with headers.
- Chunks should be concatenated in the order received.
- Use append mode when saving to file to preserve all audio.
Complete Example
Here’s a complete example showing all frame types in sequence:Configuration Summary
Required Configuration
Message Sequence
Demo Project
A complete Python implementation is available under the link.Video Demo
Watch this demonstration to see the Telnyx Text-to-Speech in action:Troubleshooting
| Issue | Solution |
|---|---|
| Connection fails | Verify token format: Bearer YOUR_TOKEN |
| No audio received | Ensure initialization frame sent first |
| Audio is garbled | Check base64 decoding and file append mode |
| Empty audio file | Confirm text frame contains valid content |