> ## Documentation Index
> Fetch the complete documentation index at: https://developers.telnyx.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Stream text to speech over WebSocket

> AsyncAPI specification for the Telnyx Text-to-Speech WebSocket endpoint.
Real-time speech synthesis by streaming text and receiving audio chunks.

## Supported Providers
- `telnyx` - Telnyx native voices (Natural, NaturalHD, Qwen3TTS)
- `aws` - Amazon Polly
- `azure` - Microsoft Azure TTS
- `elevenlabs` - ElevenLabs voices
- `minimax` - MiniMax voices
- `rime` - Rime voices
- `resemble` - Resemble AI voices
- `xai` - xAI voices (Eve, Ara, Rex, Sal, Leo)
- `inworld` - Inworld AI voices

## Connection Flow
1. Open WebSocket connection to `wss://api.telnyx.com/v2/text-to-speech/speech` with query parameters.
2. Send an initial handshake message `{"text": " "}` (single space) with optional `voice_settings`.
3. Send text messages as `{"text": "Hello world"}`.
4. Receive audio chunks as JSON frames with base64-encoded audio.
5. A final frame with `isFinal: true` indicates the end of audio for the current text.

## Authentication
Requires authentication via a Bearer token (Telnyx API v2 key).


## AsyncAPI

````yaml https://telnyx-openapi-ng.s3.us-east-1.amazonaws.com/text-to-speech-websockets/text-to-speech-websockets.yml text-to-speech-websockets
id: text-to-speech-websockets
title: Stream text to speech over WebSocket
description: >
  AsyncAPI specification for the Telnyx Text-to-Speech WebSocket endpoint.

  Real-time speech synthesis by streaming text and receiving audio chunks.


  ## Supported Providers

  - `telnyx` - Telnyx native voices (Natural, NaturalHD, Qwen3TTS)

  - `aws` - Amazon Polly

  - `azure` - Microsoft Azure TTS

  - `elevenlabs` - ElevenLabs voices

  - `minimax` - MiniMax voices

  - `rime` - Rime voices

  - `resemble` - Resemble AI voices

  - `xai` - xAI voices (Eve, Ara, Rex, Sal, Leo)

  - `inworld` - Inworld AI voices


  ## Connection Flow

  1. Open WebSocket connection to
  `wss://api.telnyx.com/v2/text-to-speech/speech` with query parameters.

  2. Send an initial handshake message `{"text": " "}` (single space) with
  optional `voice_settings`.

  3. Send text messages as `{"text": "Hello world"}`.

  4. Receive audio chunks as JSON frames with base64-encoded audio.

  5. A final frame with `isFinal: true` indicates the end of audio for the
  current text.


  ## Authentication

  Requires authentication via a Bearer token (Telnyx API v2 key).
servers:
  - id: text-to-speech-websockets
    protocol: wss
    host: api.telnyx.com
    bindings:
      - protocol: ws
        version: 0.1.0
        value: {}
        schemaProperties: []
    variables: []
address: /v2/text-to-speech/speech
parameters: []
bindings:
  - protocol: ws
    version: 0.1.0
    value:
      query:
        type: object
        description: Query parameters passed when opening the WebSocket connection.
        properties:
          voice:
            type: string
            description: |
              Voice identifier in the format `provider.model_id.voice_id` or
              `provider.voice_id` (e.g. `telnyx.NaturalHD.Telnyx_Alloy`,
              `azure.en-US-AvaMultilingualNeural`, or `xai.eve`).
              When provided, the provider, model_id, and voice_id are extracted
              automatically. Takes precedence over individual parameters.
          sample_rate:
            type: integer
            description: Sample rate in Hz for audio output.
            default: 16000
          disable_cache:
            type: boolean
            description: When true, bypass the audio cache and generate fresh audio.
            default: false
          audio_format:
            type: string
            description: >
              Audio output format. Available values depend on the provider.

              See docs/voice/tts/websocket-streaming/parameters/audio-formats
              for more details.
            enum:
              - mp3
              - wav
              - linear16
              - mulaw
              - alaw
              - ogg_vorbis
    schemaProperties:
      - name: query
        type: object
        description: Query parameters passed when opening the WebSocket connection.
        required: false
        properties:
          - name: voice
            type: string
            description: |
              Voice identifier in the format `provider.model_id.voice_id` or
              `provider.voice_id` (e.g. `telnyx.NaturalHD.Telnyx_Alloy`,
              `azure.en-US-AvaMultilingualNeural`, or `xai.eve`).
              When provided, the provider, model_id, and voice_id are extracted
              automatically. Takes precedence over individual parameters.
            required: false
          - name: sample_rate
            type: integer
            description: Sample rate in Hz for audio output.
            required: false
          - name: disable_cache
            type: boolean
            description: When true, bypass the audio cache and generate fresh audio.
            required: false
          - name: audio_format
            type: string
            description: >
              Audio output format. Available values depend on the provider.

              See docs/voice/tts/websocket-streaming/parameters/audio-formats
              for more details.
            enumValues:
              - mp3
              - wav
              - linear16
              - mulaw
              - alaw
              - ogg_vorbis
            required: false
operations:
  - &ref_3
    id: sendText
    title: Send Text to Synthesize
    description: Client sends text messages to be synthesized into speech.
    type: receive
    messages:
      - &ref_5
        id: clientText
        contentType: application/json
        payload:
          - name: Text Frame
            description: >
              Client-to-server frame containing text to synthesize.

              The initial handshake message should be {"text": " "} (single
              space)

              with optional voice_settings. Subsequent messages contain actual
              text.

              To interrupt synthesis mid-stream, send {"force": true}.
            type: object
            properties:
              - name: text
                type: string
                description: >
                  Text to convert to speech. Send a single space as an initial

                  handshake with optional voice_settings. Subsequent messages
                  contain

                  the actual text to synthesize.
                required: false
              - name: voice_settings
                type: object
                description: >
                  Provider-specific voice settings sent with the initial
                  handshake.

                  Contents vary by provider. For xAI, see `VoiceSettingsXAI`.
                required: false
              - name: force
                type: boolean
                description: >
                  When true, stops the current synthesis worker and starts a new
                  one.

                  Used to interrupt speech mid-stream and begin synthesizing new
                  text.

                  Can be sent alone or combined with text.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          additionalProperties: false
          anyOf:
            - required:
                - text
              x-parser-schema-id: <anonymous-schema-5>
            - required:
                - force
              x-parser-schema-id: <anonymous-schema-6>
          properties:
            text:
              type: string
              description: >
                Text to convert to speech. Send a single space as an initial

                handshake with optional voice_settings. Subsequent messages
                contain

                the actual text to synthesize.
              x-parser-schema-id: <anonymous-schema-2>
            voice_settings:
              type: object
              description: >
                Provider-specific voice settings sent with the initial
                handshake.

                Contents vary by provider. For xAI, see `VoiceSettingsXAI`.
              additionalProperties: true
              x-parser-schema-id: <anonymous-schema-3>
            force:
              type: boolean
              description: >
                When true, stops the current synthesis worker and starts a new
                one.

                Used to interrupt speech mid-stream and begin synthesizing new
                text.

                Can be sent alone or combined with text.
              x-parser-schema-id: <anonymous-schema-4>
          x-parser-schema-id: <anonymous-schema-1>
        title: Text Frame
        description: |
          Client-to-server frame containing text to synthesize.
          The initial handshake message should be {"text": " "} (single space)
          with optional voice_settings. Subsequent messages contain actual text.
          To interrupt synthesis mid-stream, send {"force": true}.
        example: |-
          {
            "text": " ",
            "voice_settings": {
              "voice_speed": 1.2
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: clientText
    bindings: []
    extensions: &ref_2
      - id: x-parser-unique-object-id
        value: text-to-speech-websockets
  - &ref_4
    id: receiveAudio
    title: Receive Audio Chunks
    description: Server sends audio chunks and status frames to the client.
    type: send
    messages:
      - &ref_6
        id: serverAudioChunk
        contentType: application/json
        payload:
          - name: Audio Chunk Frame
            description: >
              Server-to-client frame containing a base64-encoded audio chunk.

              For providers that stream audio in real-time (Telnyx
              Natural/NaturalHD,

              Rime, Minimax, Resemble, Inworld), text will be null because audio
              is

              streamed before full text alignment is available, and cached will
              be false.

              For other providers, text contains the corresponding text segment.
            type: object
            properties:
              - name: audio
                type: &ref_0
                  - string
                  - 'null'
                description: >-
                  Base64-encoded audio data. May be null for concatenated-audio
                  frames from streaming providers.
                required: false
              - name: text
                type: &ref_1
                  - string
                  - 'null'
                description: >-
                  The text segment this audio chunk corresponds to. Null for
                  real-time streamed chunks.
                required: false
              - name: isFinal
                type: boolean
                description: Always false for audio chunk frames.
                required: true
              - name: cached
                type: boolean
                description: Whether this audio was served from cache.
                required: false
              - name: timeToFirstAudioFrameMs
                type: integer
                description: >-
                  Milliseconds from the start-of-speech request to the first
                  audio frame. Only present on the first chunk.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - isFinal
          properties:
            audio:
              type: *ref_0
              description: >-
                Base64-encoded audio data. May be null for concatenated-audio
                frames from streaming providers.
              x-parser-schema-id: <anonymous-schema-8>
            text:
              type: *ref_1
              description: >-
                The text segment this audio chunk corresponds to. Null for
                real-time streamed chunks.
              x-parser-schema-id: <anonymous-schema-9>
            isFinal:
              type: boolean
              description: Always false for audio chunk frames.
              const: false
              x-parser-schema-id: <anonymous-schema-10>
            cached:
              type: boolean
              description: Whether this audio was served from cache.
              x-parser-schema-id: <anonymous-schema-11>
            timeToFirstAudioFrameMs:
              type: integer
              description: >-
                Milliseconds from the start-of-speech request to the first audio
                frame. Only present on the first chunk.
              x-parser-schema-id: <anonymous-schema-12>
          x-parser-schema-id: <anonymous-schema-7>
        title: Audio Chunk Frame
        description: >
          Server-to-client frame containing a base64-encoded audio chunk.

          For providers that stream audio in real-time (Telnyx
          Natural/NaturalHD,

          Rime, Minimax, Resemble, Inworld), text will be null because audio is

          streamed before full text alignment is available, and cached will be
          false.

          For other providers, text contains the corresponding text segment.
        example: |-
          {
            "audio": "QmFzZTY0RW5jb2RlZEF1ZGlv",
            "text": "Hello world",
            "isFinal": false,
            "cached": false,
            "timeToFirstAudioFrameMs": 245
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: serverAudioChunk
      - &ref_7
        id: serverFinal
        contentType: application/json
        payload:
          - name: Final Frame
            description: >
              Server-to-client frame indicating synthesis is complete for the
              current text.

              The connection remains open for additional text messages.
            type: object
            properties:
              - name: audio
                type: 'null'
                description: Always null for the final frame.
                required: true
              - name: text
                type: string
                description: Always empty string for the final frame.
                required: true
              - name: isFinal
                type: boolean
                description: Always true.
                required: true
              - name: timeToFirstAudioFrameMs
                type: integer
                description: Present if this was the first response frame.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          additionalProperties: false
          required:
            - isFinal
            - audio
            - text
          properties:
            audio:
              type: 'null'
              description: Always null for the final frame.
              x-parser-schema-id: <anonymous-schema-14>
            text:
              type: string
              const: ''
              description: Always empty string for the final frame.
              x-parser-schema-id: <anonymous-schema-15>
            isFinal:
              type: boolean
              description: Always true.
              const: true
              x-parser-schema-id: <anonymous-schema-16>
            timeToFirstAudioFrameMs:
              type: integer
              description: Present if this was the first response frame.
              x-parser-schema-id: <anonymous-schema-17>
          x-parser-schema-id: <anonymous-schema-13>
        title: Final Frame
        description: >
          Server-to-client frame indicating synthesis is complete for the
          current text.

          The connection remains open for additional text messages.
        example: |-
          {
            "audio": null,
            "text": "",
            "isFinal": true
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: serverFinal
      - &ref_8
        id: serverError
        contentType: application/json
        payload:
          - name: Error Frame
            description: |
              Server-to-client frame indicating an error during synthesis.
              The connection will be closed shortly after sending this frame.
            type: object
            properties:
              - name: error
                type: string
                description: Error message describing what went wrong.
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - error
          properties:
            error:
              type: string
              description: Error message describing what went wrong.
              x-parser-schema-id: <anonymous-schema-19>
          x-parser-schema-id: <anonymous-schema-18>
        title: Error Frame
        description: |
          Server-to-client frame indicating an error during synthesis.
          The connection will be closed shortly after sending this frame.
        example: |-
          {
            "error": "Invalid voice_id specified"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: serverError
    bindings: []
    extensions: *ref_2
sendOperations:
  - *ref_3
receiveOperations:
  - *ref_4
sendMessages:
  - *ref_5
receiveMessages:
  - *ref_6
  - *ref_7
  - *ref_8
extensions:
  - id: x-parser-unique-object-id
    value: text-to-speech-websockets
securitySchemes:
  - id: bearerAuth
    name: bearerAuth
    type: http
    description: Telnyx API v2 Bearer token authentication.
    scheme: bearer
    extensions: []

````