Skip to main content

Binary Output (default)

When output_type is omitted or "binary_output":
  • Status: 200
  • Content-Type: audio/mpeg (or appropriate format for the provider)
  • Body: Raw audio bytes
Supports HTTP chunked transfer encoding — audio streams as it’s synthesized. The streaming response has a 30-second inactivity timeout between chunks.

Base64 Output

When output_type is "base64_output":
  • Status: 200
  • Content-Type: application/json
  • Body:
{
  "base64_audio": "<base64-encoded-audio>"
}

Audio ID Output

When output_type is "audio_id":
  • Status: 200
  • Body: JSON with an audio_id
Retrieve the audio later:
GET /v2/text-to-speech/speech/:audio_id
This is useful for async workflows where synthesis is triggered separately from playback.

Errors

400 Bad Request

Invalid parameters, unsupported provider, or malformed voice format.
{
  "error": "invalid voice structure"
}

401 Unauthorized

Missing or invalid API key.

403 Forbidden

  • Ultra model restricted on public WebSocket (use REST instead)
  • Cloned voice requires identity verification

422 Unprocessable Entity

Validation error (e.g., missing text field, invalid voice settings).

502 Bad Gateway

Upstream TTS provider error.