Skip to main content

How It Works

You send text in, audio streams back over the same HTTP connection. No polling, no callbacks. The response uses HTTP chunked transfer encoding — audio chunks arrive as they’re synthesized. Your client can begin playback immediately without waiting for the full file. The connection stays open until synthesis completes or 30 seconds pass with no new chunks. This makes REST suitable for real-time playback, not just batch file generation. For multi-turn conversational use cases where you’re continuously feeding text, use WebSocket Streaming instead.

Text Preprocessing

Before synthesis, text passes through two stages:
  1. Markdown stripping — headers, bold, italics, code blocks, links, lists, emoji are converted to plain text.
  2. Pronunciation dictionary — if pronunciation_dict_id is set, custom word replacements are applied.

API Reference

The full OpenAPI spec for these endpoints is available in the auto-generated API Reference. Note: the OAS is currently being cleaned up — some fields and provider-specific schemas may be incomplete.