Text-to-Speech WebSocket Streaming
Overview
The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.
WebSocket Endpoint
Connection URL
wss://api.telnyx.com/v2/text-to-speech/speech?voice={voice_id}
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
voice | string | Yes | Voice identifier (e.g., Telnyx.NaturalHD.astra) |
inactivity_timeout | integer | No | Time without message to keep the WebSocket open (default: 20 seconds) |
Authentication
Include your Telnyx API token as an Authorization header in the connection request:
Authorization: Bearer YOUR_TELNYX_TOKEN
Example Connection
import websockets
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}
websocket = await websockets.connect(url, extra_headers=headers)
Available Voices
The list of available voice can be found by querying the following endpoint
Connection Flow
The TTS WebSocket follows this lifecycle:
- Connect - Establish WebSocket connection with authentication
- Initialize - Send initialization frame with space character
- Send Text - Send one or more text frames to synthesize
- Receive Audio - Receive audio frames with base64-encoded mp3 data
- Stop - Send empty text frame to signal completion
- Close - Connection closes after processing completes
Flow Diagram
Client Server
| |
|------- Connect -------------->|
|<------ Connected -------------|
| |
|------- Init Frame ----------->|
| {"text": " "} |
| |
|------- Text Frame ----------->|
| {"text": "Hello"} |
| |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
| |
|------- Stop Frame ----------->|
| {"text": ""} |
| |
|<------ Close -----------------|
Frame Types
Outbound Frames (Client → Server)
All outbound frames are JSON text messages with the following structure:
1. Initialization Frame
Purpose: Initialize the TTS session
Format:
{
"text": " "
}
Example:
import json
init_frame = {"text": " "}
await websocket.send(json.dumps(init_frame))
Notes:
- Must be sent first after connection
- Contains a single space character
- Required to begin the session
2. Text Frame
Purpose: Send text content to be synthesized into speech
Format:
{
"text": "Your text content here"
}
Example:
text_frame = {"text": "Hello, this is a test of the Telnyx TTS service."}
await websocket.send(json.dumps(text_frame))
Multiple Text Frames:
# You can send multiple text frames sequentially
frames = [
{"text": "First sentence."},
{"text": "Second sentence."},
{"text": "Third sentence."}
]
for frame in frames:
await websocket.send(json.dumps(frame))
await asyncio.sleep(0.5)
Notes:
- Can send multiple text frames in one session
- Each frame is processed and synthesized separately
- Audio is returned incrementally for each text frame
3. Stop Frame
Purpose: Signal completion of text input and end the session
Format:
{
"text": ""
}
Example:
stop_frame = {"text": ""}
await websocket.send(json.dumps(stop_frame))
Notes:
- Contains an empty string
- Signals the server to finish processing
- Should be sent after all text frames
Inbound Frames (Server → Client)
The server sends JSON text messages containing synthesized audio data.
Audio Frame
Purpose: Deliver synthesized audio data
Format:
{
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAAB9AAACABAAZGF0YQAAAAA="
}
Processing Audio:
import base64
async for message in websocket:
data = json.loads(message)
if "audio" in data:
# Decode base64 audio
audio_bytes = base64.b64decode(data["audio"])
# Save or process audio
with open("output.mp3", "ab") as f:
f.write(audio_bytes)
Audio Specifications:
| Property | Value |
|---|---|
| Format | mp3 |
| Sample Rate | 16 kHz |
| Bit Depth | 16-bit |
| Channels | Mono (1) |
| Encoding | Base64 |
Notes:
- Multiple audio frames may be received for a single text input
- Each audio chunk is a complete mp3 file with headers
- Chunks should be concatenated in the order received
- Use append mode when saving to file to preserve all audio
Complete Example
Here's a complete example showing all frame types in sequence:
import asyncio
import json
import base64
import websockets
async def tts_example():
# 1. Connect to WebSocket
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}
async with websockets.connect(url, extra_headers=headers) as ws:
print("Connected to TTS WebSocket")
# 2. Send initialization frame
init_frame = {"text": " "}
await ws.send(json.dumps(init_frame))
print("Sent: Initialization frame")
# 3. Send text frame
text_frame = {"text": "Hello, welcome to Telnyx Text-to-Speech streaming."}
await ws.send(json.dumps(text_frame))
print("Sent: Text frame")
# 4. Receive audio frames
audio_count = 0
async for message in ws:
data = json.loads(message)
if "audio" in data:
audio_count += 1
audio_bytes = base64.b64decode(data["audio"])
# Append audio chunks to file
with open("output.mp3", "ab") as f:
f.write(audio_bytes)
print(f"Received: Audio frame #{audio_count} ({len(audio_bytes)} bytes)")
# After receiving audio, send stop frame
if audio_count >= 10: # Adjust based on your needs
# 5. Send stop frame
stop_frame = {"text": ""}
await ws.send(json.dumps(stop_frame))
print("Sent: Stop frame")
print("Connection closed")
asyncio.run(tts_example())
Expected Output:
Connected to TTS WebSocket
Sent: Initialization frame
Sent: Text frame
Received: Audio frame #1 (8192 bytes)
Received: Audio frame #2 (6144 bytes)
Received: Audio frame #3 (4096 bytes)
Sent: Stop frame
Connection closed
Configuration Summary
Required Configuration
# WebSocket URL
ENDPOINT = "wss://api.telnyx.com/v2/text-to-speech/speech"
VOICE_ID = "Telnyx.NaturalHD.astra"
URL = f"{ENDPOINT}?voice={VOICE_ID}"
# Authentication Header
HEADERS = {
"Authorization": f"Bearer {TELNYX_TOKEN}"
}
Message Sequence
# 1. Initialization
{"text": " "}
# 2. Text to synthesize (can send multiple)
{"text": "Your text here"}
# 3. Stop signal
{"text": ""}
Demo Project
A complete Python implementation is available under the link.
Troubleshooting
| Issue | Solution |
|---|---|
| Connection fails | Verify token format: Bearer YOUR_TOKEN |
| No audio received | Ensure initialization frame sent first |
| Audio is garbled | Check base64 decoding and file append mode |
| Empty audio file | Confirm text frame contains valid content |