Open a WebSocket connection to stream text and receive synthesized audio in real time. Authentication is provided via the standard Authorization: Bearer <API_KEY> header. Send JSON frames with text to synthesize; receive JSON frames containing base64-encoded audio chunks.
Supported providers: aws, telnyx, azure, murfai, minimax, rime, resemble, elevenlabs, inworld.
Connection flow:
{"text": " "} (single space) with optional voice_settings to initialize the session.{"text": "Hello world"}.isFinal: true indicates the end of audio for the current text.To interrupt and restart synthesis mid-stream, send {"force": true} — the current worker is stopped and a new one is started.
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Voice identifier in the format provider.model_id.voice_id or provider.voice_id (e.g. telnyx.NaturalHD.Telnyx_Alloy or azure.en-US-AvaMultilingualNeural). When provided, the provider, model_id, and voice_id are extracted automatically. Takes precedence over individual provider/model_id/voice_id parameters.
TTS provider. Defaults to telnyx if not specified. Ignored when voice is provided.
aws, telnyx, azure, elevenlabs, minimax, murfai, rime, resemble, inworld Model identifier for the chosen provider. Examples: Natural, NaturalHD (Telnyx); Polly.Generative (AWS).
Voice identifier for the chosen provider.
When true, bypass the audio cache and generate fresh audio.
Audio output format override. Supported for Telnyx Natural/NaturalHD models only. Accepted values: pcm, wav.
pcm, wav Client-provided socket identifier for tracking. If not provided, one is generated server-side.
WebSocket connection established. Communication proceeds via JSON frames.
Client → Server: See ClientTextFrame schema.
Server → Client: See AudioChunkFrame, FinalFrame, and ErrorFrame schemas.
Client-to-server frame containing text to synthesize.
Text to convert to speech. Send " " (single space) as an initial handshake with optional voice_settings. Subsequent messages contain the actual text to synthesize.
Provider-specific voice settings sent with the initial handshake. Contents vary by provider — e.g. {"speed": 1.2} for Minimax, {"voice_speed": 1.5} for Telnyx.
When true, stops the current synthesis worker and starts a new one. Used to interrupt speech mid-stream and begin synthesizing new text.