output_type request field controls what comes back.
Streaming Audio (default)
Withoutput_type: "binary_output" (or omitted), the response is raw audio over HTTP chunked transfer encoding:
Base64
Withoutput_type: "base64_output", the full audio is returned as a JSON payload after synthesis completes:
Async (audio_id)
Withoutput_type: "audio_id", synthesis runs in the background. You get a URL back immediately:
GET /v2/text-to-speech/speech/:audio_id. If the audio is still synthesizing, the GET response itself streams chunks as they become available.