How It Works
You send text in, audio streams back over the same HTTP connection. No polling, no callbacks. The response uses HTTP chunked transfer encoding — audio chunks arrive as they’re synthesized. Your client can begin playback immediately without waiting for the full file. The connection stays open until synthesis completes or 30 seconds pass with no new chunks. This makes REST suitable for real-time playback, not just batch file generation. For multi-turn conversational use cases where you’re continuously feeding text, use WebSocket Streaming instead.Text Preprocessing
Before synthesis, text passes through two stages:- Markdown stripping — headers, bold, italics, code blocks, links, lists, emoji are converted to plain text.
- Pronunciation dictionary — if
pronunciation_dict_idis set, custom word replacements are applied.