Overview
The Telnyx Speech-to-Text (STT) WebSocket API provides real-time audio transcription. This streaming endpoint allows you to send audio and receive transcription results incrementally, enabling low-latency voice transcription for real-time applications. This guide shows how to build a Python client to stream audio to Telnyx’s Speech-to-Text (STT) engine using WebSockets andasyncio.
Prerequisites
- Python 3.8+.
- A Telnyx API Key.
websocketslibrary:pip install websockets.
Connection flow
The Speech-to-Text streaming process involves opening a secure WebSocket connection, streaming audio chunks, and receiving transcription events in real-time.WebSocket endpoint
The Telnyx STT service uses a WebSocket endpoint. You authenticate by passing your API Token in theAuthorization header.
The connection URL follows this format:
wss://api.telnyx.com/v2/speech-to-text/transcription
Query parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
transcription_engine | string | No | STT engine to use (e.g., Deepgram, Azure). |
input_format | string | No | Audio input format (e.g., mp3, wav). |
Supported engines
Telnyx offers several speech-to-text engines to process audio into transcription:| Engine | Description |
|---|---|
| Google speech-to-text engine that offers additional features like interim results. | |
| Telnyx | In-house Telnyx speech-to-text engine with significantly better transcription accuracy and lower latency. |
| Deepgram | Deepgram speech-to-text engine with 3 models (nova-2, nova-3, and flux) that can be set using the transcription_model setting. |
| Azure | Azure speech-to-text engine with strong support for multiple languages and accents. |
Authenticating and connecting
Create a class to handle the connection using thewebsockets library.
Streaming audio
To transcribe audio, send binary frames to the WebSocket. The server processes these chunks in real-time.Receiving transcripts
The server sends JSON messages back with transcription results. Add a method to listen for these messages concurrently while sending audio. Key fields to look for in the response:transcript: The text transcription.is_final: Boolean indicating if the sentence is complete.confidence: The confidence score of the transcription.
Complete example
Here’s how to orchestrate the bi-directional stream usingasyncio. Use asyncio.sleep to simulate real-time streaming when reading from a file.
Crucial Step: After sending all your audio, wait for a few seconds before closing the connection. This gives the server time to send the final transcription results.