> ## Documentation Index
> Fetch the complete documentation index at: https://developers.telnyx.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Stream speech to text over WebSocket

> AsyncAPI specification for the Telnyx Speech-to-Text WebSocket endpoint.
Real-time speech transcription by streaming audio and receiving transcript frames.

## Supported Engines
- `Azure` - Microsoft Azure Speech Services
- `Deepgram` - Deepgram Nova models
- `Google` - Google Cloud Speech-to-Text
- `Telnyx` - Telnyx native transcription (OpenAI Whisper models)
- `xAI` - xAI Grok STT
- `AssemblyAI` - AssemblyAI Universal-Streaming
- `Speechmatics` - Speechmatics real-time transcription
- `Soniox` - Soniox real-time transcription

## Connection Flow
1. Open WebSocket connection to `wss://api.telnyx.com/v2/speech-to-text/transcription` with query parameters.
2. Send binary audio frames (mp3 or wav format).
3. Receive JSON transcript frames with `transcript`, `is_final`, and `confidence` fields.
4. Close connection when done.

## Authentication
Requires authentication via a Bearer token (Telnyx API v2 key).


## AsyncAPI

````yaml https://telnyx-openapi-ng.s3.us-east-1.amazonaws.com/speech-to-text-websockets/speech-to-text-websockets.yml speech-to-text-websockets
id: speech-to-text-websockets
title: Stream speech to text over WebSocket
description: >
  AsyncAPI specification for the Telnyx Speech-to-Text WebSocket endpoint.

  Real-time speech transcription by streaming audio and receiving transcript
  frames.


  ## Supported Engines

  - `Azure` - Microsoft Azure Speech Services

  - `Deepgram` - Deepgram Nova models

  - `Google` - Google Cloud Speech-to-Text

  - `Telnyx` - Telnyx native transcription (OpenAI Whisper models)

  - `xAI` - xAI Grok STT

  - `AssemblyAI` - AssemblyAI Universal-Streaming

  - `Speechmatics` - Speechmatics real-time transcription

  - `Soniox` - Soniox real-time transcription


  ## Connection Flow

  1. Open WebSocket connection to
  `wss://api.telnyx.com/v2/speech-to-text/transcription` with query parameters.

  2. Send binary audio frames (mp3 or wav format).

  3. Receive JSON transcript frames with `transcript`, `is_final`, and
  `confidence` fields.

  4. Close connection when done.


  ## Authentication

  Requires authentication via a Bearer token (Telnyx API v2 key).
servers:
  - id: speech-to-text-websockets
    protocol: wss
    host: api.telnyx.com
    bindings:
      - protocol: ws
        version: 0.1.0
        value: {}
        schemaProperties: []
    variables: []
address: /v2/speech-to-text/transcription
parameters: []
bindings:
  - protocol: ws
    version: 0.1.0
    value:
      query:
        type: object
        description: Query parameters passed when opening the WebSocket connection.
        required:
          - transcription_engine
          - input_format
        properties:
          transcription_engine:
            type: string
            description: The transcription engine to use for processing the audio stream.
            enum:
              - Azure
              - Deepgram
              - Google
              - Telnyx
              - xAI
              - AssemblyAI
              - Speechmatics
              - Soniox
          input_format:
            type: string
            description: The format of input audio stream.
            enum:
              - mp3
              - wav
          language:
            type: string
            description: |
              The language spoken in the audio stream (e.g., en-US, es-ES).
          interim_results:
            type: boolean
            description: >-
              Whether to receive interim transcription results before final
              results.
            default: false
          model:
            type: string
            description: >
              The specific model to use within the selected transcription
              engine.

              Available models vary by engine:

              - Azure: `fast`

              - Deepgram: `deepgram/nova-2`, `deepgram/nova-3`

              - Google: `latest_long`, `latest_short`, `command_and_search`,
              `phone_call`, `video`, `default`, `medical_conversation`,
              `medical_dictation`

              - Telnyx: `openai/whisper-tiny`, `openai/whisper-large-v3-turbo`

              - xAI: `xai/grok-stt`

              - AssemblyAI: `assemblyai/universal-streaming`

              - Speechmatics: `speechmatics/standard`

              - Soniox: `soniox/stt-rt-v4`
          endpointing:
            type: integer
            description: >
              Silence duration (in milliseconds) that triggers end-of-speech
              detection.

              When set, the engine uses this value to determine when a speaker
              has stopped talking.

              Supported by Deepgram, Google, xAI, Speechmatics, and Soniox.
              Other engines ignore this parameter.

              For Soniox, the value is interpreted as `max_endpoint_delay_ms`
              (range: 500–3000 ms).
          redact:
            type: string
            description: >
              Enable redaction of sensitive information (e.g., PCI data, SSN)
              from transcription results.

              Supported values depend on the transcription engine.
          keyterm:
            type: string
            description: |
              A key term to boost in the transcription.
              The engine will be more likely to recognize this term.
              Can be specified multiple times for multiple terms.
          keywords:
            type: string
            description: |
              Comma-separated list of keywords to boost in the transcription.
              The engine will prioritize recognition of these words.
    schemaProperties:
      - name: query
        type: object
        description: Query parameters passed when opening the WebSocket connection.
        required: false
        properties:
          - name: transcription_engine
            type: string
            description: The transcription engine to use for processing the audio stream.
            enumValues:
              - Azure
              - Deepgram
              - Google
              - Telnyx
              - xAI
              - AssemblyAI
              - Speechmatics
              - Soniox
            required: true
          - name: input_format
            type: string
            description: The format of input audio stream.
            enumValues:
              - mp3
              - wav
            required: true
          - name: language
            type: string
            description: |
              The language spoken in the audio stream (e.g., en-US, es-ES).
            required: false
          - name: interim_results
            type: boolean
            description: >-
              Whether to receive interim transcription results before final
              results.
            required: false
          - name: model
            type: string
            description: >
              The specific model to use within the selected transcription
              engine.

              Available models vary by engine:

              - Azure: `fast`

              - Deepgram: `deepgram/nova-2`, `deepgram/nova-3`

              - Google: `latest_long`, `latest_short`, `command_and_search`,
              `phone_call`, `video`, `default`, `medical_conversation`,
              `medical_dictation`

              - Telnyx: `openai/whisper-tiny`, `openai/whisper-large-v3-turbo`

              - xAI: `xai/grok-stt`

              - AssemblyAI: `assemblyai/universal-streaming`

              - Speechmatics: `speechmatics/standard`

              - Soniox: `soniox/stt-rt-v4`
            required: false
          - name: endpointing
            type: integer
            description: >
              Silence duration (in milliseconds) that triggers end-of-speech
              detection.

              When set, the engine uses this value to determine when a speaker
              has stopped talking.

              Supported by Deepgram, Google, xAI, Speechmatics, and Soniox.
              Other engines ignore this parameter.

              For Soniox, the value is interpreted as `max_endpoint_delay_ms`
              (range: 500–3000 ms).
            required: false
          - name: redact
            type: string
            description: >
              Enable redaction of sensitive information (e.g., PCI data, SSN)
              from transcription results.

              Supported values depend on the transcription engine.
            required: false
          - name: keyterm
            type: string
            description: |
              A key term to boost in the transcription.
              The engine will be more likely to recognize this term.
              Can be specified multiple times for multiple terms.
            required: false
          - name: keywords
            type: string
            description: |
              Comma-separated list of keywords to boost in the transcription.
              The engine will prioritize recognition of these words.
            required: false
operations:
  - &ref_1
    id: sendAudio
    title: Send Audio to Transcribe
    description: Client sends binary audio frames to be transcribed into text.
    type: receive
    messages:
      - &ref_3
        id: clientAudio
        contentType: application/octet-stream
        payload:
          - type: string
            format: binary
            description: >
              Client-to-server binary frame containing audio data to transcribe.

              Audio should be in mp3 or wav format as specified in the
              input_format query parameter.
            x-parser-schema-id: <anonymous-schema-1>
            name: Audio Frame
        headers: []
        jsonPayloadSchema:
          type: string
          format: binary
          description: Binary audio data in mp3 or wav format.
          x-parser-schema-id: <anonymous-schema-1>
        title: Audio Frame
        description: >
          Client-to-server binary frame containing audio data to transcribe.

          Audio should be in mp3 or wav format as specified in the input_format
          query parameter.
        example: '{}'
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: clientAudio
    bindings: []
    extensions: &ref_0
      - id: x-parser-unique-object-id
        value: speech-to-text-websockets
  - &ref_2
    id: receiveTranscript
    title: Receive Transcript Frames
    description: Server sends transcript frames and error frames to the client.
    type: send
    messages:
      - &ref_4
        id: serverTranscript
        contentType: application/json
        payload:
          - name: Transcript Frame
            description: >
              Server-to-client frame containing a transcription result.

              When interim_results is enabled, you may receive multiple interim
              results

              (is_final=false) before the final result (is_final=true) for each
              utterance.
            type: object
            properties:
              - name: type
                type: string
                description: Frame type identifier.
                required: true
              - name: transcript
                type: string
                description: The transcribed text from the audio.
                required: true
              - name: is_final
                type: boolean
                description: |
                  Whether this is a final transcription result.
                  When false, this is an interim result that may be refined.
                required: false
              - name: confidence
                type: number
                description: Confidence score of the transcription, ranging from 0 to 1.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - type
            - transcript
          properties:
            type:
              type: string
              const: transcript
              description: Frame type identifier.
              x-parser-schema-id: <anonymous-schema-3>
            transcript:
              type: string
              description: The transcribed text from the audio.
              x-parser-schema-id: <anonymous-schema-4>
            is_final:
              type: boolean
              description: |
                Whether this is a final transcription result.
                When false, this is an interim result that may be refined.
              x-parser-schema-id: <anonymous-schema-5>
            confidence:
              type: number
              description: Confidence score of the transcription, ranging from 0 to 1.
              minimum: 0
              maximum: 1
              x-parser-schema-id: <anonymous-schema-6>
          x-parser-schema-id: <anonymous-schema-2>
        title: Transcript Frame
        description: >
          Server-to-client frame containing a transcription result.

          When interim_results is enabled, you may receive multiple interim
          results

          (is_final=false) before the final result (is_final=true) for each
          utterance.
        example: |-
          {
            "type": "transcript",
            "transcript": "Hello, this is",
            "is_final": false,
            "confidence": 0.85
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: serverTranscript
      - &ref_5
        id: serverError
        contentType: application/json
        payload:
          - name: Error Frame
            description: |
              Server-to-client frame indicating an error during transcription.
              The connection may be closed shortly after sending this frame.
            type: object
            properties:
              - name: type
                type: string
                description: Frame type identifier.
                required: true
              - name: error
                type: string
                description: Error message describing what went wrong.
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - type
            - error
          properties:
            type:
              type: string
              const: error
              description: Frame type identifier.
              x-parser-schema-id: <anonymous-schema-8>
            error:
              type: string
              description: Error message describing what went wrong.
              x-parser-schema-id: <anonymous-schema-9>
          x-parser-schema-id: <anonymous-schema-7>
        title: Error Frame
        description: |
          Server-to-client frame indicating an error during transcription.
          The connection may be closed shortly after sending this frame.
        example: |-
          {
            "type": "error",
            "error": "Invalid transcription_engine specified"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: serverError
    bindings: []
    extensions: *ref_0
sendOperations:
  - *ref_1
receiveOperations:
  - *ref_2
sendMessages:
  - *ref_3
receiveMessages:
  - *ref_4
  - *ref_5
extensions:
  - id: x-parser-unique-object-id
    value: speech-to-text-websockets
securitySchemes:
  - id: bearerAuth
    name: bearerAuth
    type: http
    description: Telnyx API v2 Bearer token authentication.
    scheme: bearer
    extensions: []

````