Open a WebSocket connection to stream text and receive synthesized audio in real time. Authentication is provided via the standard Authorization: Bearer <API_KEY> header. Send JSON frames with text to synthesize; receive JSON frames containing base64-encoded audio chunks.
Supported providers: aws, telnyx, azure, murfai, minimax, rime, resemble, elevenlabs.
Connection flow:
{"text": " "} (single space) with optional voice_settings to initialize the session.{"text": "Hello world"}.isFinal: true indicates the end of audio for the current text.To interrupt and restart synthesis mid-stream, send {"force": true} — the current worker is stopped and a new one is started.
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Voice identifier in the format provider.model_id.voice_id or provider.voice_id (e.g. telnyx.NaturalHD.Telnyx_Alloy or azure.en-US-AvaMultilingualNeural). When provided, the provider, model_id, and voice_id are extracted automatically. Takes precedence over individual provider/model_id/voice_id parameters.
TTS provider. Defaults to telnyx if not specified. Ignored when voice is provided.
aws, telnyx, azure, elevenlabs, minimax, murfai, rime, resemble Model identifier for the chosen provider. Examples: Natural, NaturalHD (Telnyx); Polly.Generative (AWS).
Voice identifier for the chosen provider.
When true, bypass the audio cache and generate fresh audio.
Audio output format override. Supported for Telnyx Natural/NaturalHD models only. Accepted values: pcm, wav.
pcm, wav Client-provided socket identifier for tracking. If not provided, one is generated server-side.
WebSocket connection established. Communication proceeds via JSON frames.
Client → Server: See ClientTextFrame schema.
Server → Client: See AudioChunkFrame, FinalFrame, and ErrorFrame schemas.
Client-to-server frame containing text to synthesize.
Text to convert to speech. Send " " (single space) as an initial handshake with optional voice_settings. Subsequent messages contain the actual text to synthesize.
Provider-specific voice settings sent with the initial handshake. Contents vary by provider — e.g. {"speed": 1.2} for Minimax, {"voice_speed": 1.5} for Telnyx.
When true, stops the current synthesis worker and starts a new one. Used to interrupt speech mid-stream and begin synthesizing new text.