Overview
The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.Video Demos
Watch these demonstrations to see the Telnyx Text-to-Speech in action:Convert text to speech in REAL TIME | Python | TTS websocket streaming
Telnyx Text-to-Speech API Use-case Demo
Telnyx TTS Audio Reader
WebSocket Endpoint
Connection URL
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
voice | string | Yes | Voice identifier (e.g., Telnyx.NaturalHD.astra) |
inactivity_timeout | integer | No | Time without message to keep the WebSocket open (default: 20 seconds) |
Authentication
Include your Telnyx API token as an Authorization header in the connection request:Example Connection
Available Voices
Telnyx offers high-quality text-to-speech voices across multiple models, languages, and voice types. Use the interactive explorer below to browse and filter voices by model, language, and gender characteristics. You can search for specific voice names or IDs to quickly find the perfect voice for your application.Connection Flow
The TTS WebSocket follows this lifecycle:- Connect - Establish WebSocket connection with authentication.
- Initialize - Send initialization frame with space character.
- Send Text - Send one or more text frames to synthesize.
- Receive Audio - Receive audio frames with base64-encoded mp3 data.
- Stop - Send empty text frame to signal completion.
- Close - Connection closes after processing completes.
Flow Diagram
Frame Types
Outbound Frames (Client → Server)
All outbound frames are JSON text messages with the following structure:1. Initialization Frame
Purpose: Initialize the TTS session Format:- Must be sent first after connection.
- Contains a single space character.
- Required to begin the session.
2. Text Frame
Purpose: Send text content to be synthesized into speech Format:- Can send multiple text frames in one session.
- Each frame is processed and synthesized separately.
- Audio is returned incrementally for each text frame.
3. Stop Frame
Purpose: Signal completion of text input and end the session Format:- Contains an empty string.
- Signals the server to finish processing.
- Should be sent after all text frames.
Inbound Frames (Server → Client)
The server sends JSON text messages containing synthesized audio data.Audio Frame
Purpose: Deliver synthesized audio data Format:| Property | Value |
|---|---|
| Format | mp3 |
| Sample Rate | 16 kHz |
| Bit Depth | 16-bit |
| Channels | Mono (1) |
| Encoding | Base64 |
- Multiple audio frames may be received for a single text input.
- Each audio chunk is a complete mp3 file with headers.
- Chunks should be concatenated in the order received.
- Use append mode when saving to file to preserve all audio.
Complete Example
Here’s a complete example showing all frame types in sequence:Configuration Summary
Required Configuration
Message Sequence
Demo Project
A complete Python implementation is available under the link.Troubleshooting
| Issue | Solution |
|---|---|
| Connection fails | Verify token format: Bearer YOUR_TOKEN |
| No audio received | Ensure initialization frame sent first |
| Audio is garbled | Check base64 decoding and file append mode |
| Empty audio file | Confirm text frame contains valid content |