Overview
The Telnyx Speech-to-Text (STT) WebSocket API provides real-time audio transcription. This streaming endpoint allows you to send audio and receive transcription results incrementally, enabling low-latency voice transcription for real-time applications. This guide shows how to build a Python client to stream audio to Telnyx’s Speech-to-Text (STT) engine using WebSocket andasyncio.
Prerequisites
- Python 3.8+.
- A Telnyx API key.
websocketslibrary:pip install websockets.
Connection flow
The Speech-to-Text streaming process involves opening a secure WebSocket connection, streaming audio chunks, and receiving transcription events in real time.WebSocket endpoint
The Telnyx STT service uses a WebSocket endpoint. You authenticate by passing your API key in theAuthorization header.
The connection URL follows this format:
wss://api.telnyx.com/v2/speech-to-text/transcription
Query parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
transcription_engine | string | No | STT engine to use (e.g., Deepgram, Azure). |
input_format | string | No | Audio input format (e.g., mp3, wav). |
Supported engines
Telnyx offers several speech-to-text engines to process audio into transcription:| Engine | Description |
|---|---|
| Telnyx | In-house Telnyx speech-to-text engine with significantly better transcription accuracy and lower latency. |
| Google speech-to-text engine that offers additional features like interim results. | |
| Deepgram | Deepgram speech-to-text engine with 3 models (nova-2, nova-3, and flux) that can be set using the transcription_model setting. |
| Azure | Azure speech-to-text engine with strong support for multiple languages and accents. |
Authenticating and connecting
Create a class to handle the connection using thewebsockets library.
Streaming audio
To transcribe audio, send binary frames to the WebSocket. The server processes these chunks in real time.Receiving transcripts
The server sends JSON messages back with transcription results. Add a method to listen for these messages concurrently while sending audio. Key fields to look for in the response:transcript: The text transcription.is_final: Boolean indicating if the sentence is complete.confidence: The confidence score of the transcription.
Complete example
Here’s how to orchestrate the bi-directional stream usingasyncio. Use asyncio.sleep to simulate real-time streaming when reading from a file.
Crucial Step: After sending all your audio, wait for a few seconds before closing the connection. This gives the server time to send the final transcription results.