Skip to main content

Overview

The Telnyx Speech-to-Text (STT) WebSocket API provides real-time audio transcription. This streaming endpoint allows you to send audio and receive transcription results incrementally, enabling low-latency voice transcription for real-time applications. This guide shows how to build a Python client to stream audio to Telnyx’s Speech-to-Text (STT) engine using WebSockets and asyncio.

Prerequisites

  • Python 3.8+.
  • A Telnyx API Key.
  • websockets library: pip install websockets.

Connection flow

The Speech-to-Text streaming process involves opening a secure WebSocket connection, streaming audio chunks, and receiving transcription events in real-time.
Client                                   Server
   |                                        |
   |------- Connect ----------------------->|
   |<------ Connected ----------------------|
   |                                        |
   |------- Audio Frame (Binary) ---------->|
   |                                        |
   |------- Audio Frame (Binary) ---------->|
   |                                        |
   |<------ Transcript (JSON) --------------|
   |        {"transcript": "Hello"}         |
   |                                        |
   |------- Audio Frame (Binary) ---------->|
   |                                        |
   |<------ Transcript (JSON) --------------|
   |        {"transcript": "Hello world"}   |
   |                                        |
   |------- Disconnect -------------------->|
   |<------ Close --------------------------|

WebSocket endpoint

The Telnyx STT service uses a WebSocket endpoint. You authenticate by passing your API Token in the Authorization header. The connection URL follows this format: wss://api.telnyx.com/v2/speech-to-text/transcription

Query parameters

ParameterTypeRequiredDescription
transcription_enginestringNoSTT engine to use (e.g., Deepgram, Azure).
input_formatstringNoAudio input format (e.g., mp3, wav).

Supported engines

Telnyx offers several speech-to-text engines to process audio into transcription:
EngineDescription
GoogleGoogle speech-to-text engine that offers additional features like interim results.
TelnyxIn-house Telnyx speech-to-text engine with significantly better transcription accuracy and lower latency.
DeepgramDeepgram speech-to-text engine with 3 models (nova-2, nova-3, and flux) that can be set using the transcription_model setting.
AzureAzure speech-to-text engine with strong support for multiple languages and accents.

Authenticating and connecting

Create a class to handle the connection using the websockets library.
import asyncio
import websockets
import os

class STTClient:
    def __init__(self, token, engine="Deepgram", input_format="mp3"):
        self.url = f"wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine={engine}&input_format={input_format}"
        self.headers = {"Authorization": f"Bearer {token}"}
        self.ws = None

    async def connect(self):
        self.ws = await websockets.connect(self.url, extra_headers=self.headers)
        print("Connected to Telnyx STT")

    async def disconnect(self):
        if self.ws:
            await self.ws.close()

Streaming audio

To transcribe audio, send binary frames to the WebSocket. The server processes these chunks in real-time.
    async def send_audio(self, audio_chunk):
        if self.ws:
            await self.ws.send(audio_chunk)

Receiving transcripts

The server sends JSON messages back with transcription results. Add a method to listen for these messages concurrently while sending audio. Key fields to look for in the response:
  • transcript: The text transcription.
  • is_final: Boolean indicating if the sentence is complete.
  • confidence: The confidence score of the transcription.
import json

    async def receive_messages(self):
        async for message in self.ws:
            data = json.loads(message)
            
            if "transcript" in data:
                print(f"Transcript: {data['transcript']} (Final: {data.get('is_final')})")
            elif "error" in data:
                print(f"Error: {data['error']}")

Complete example

Here’s how to orchestrate the bi-directional stream using asyncio. Use asyncio.sleep to simulate real-time streaming when reading from a file. Crucial Step: After sending all your audio, wait for a few seconds before closing the connection. This gives the server time to send the final transcription results.
async def run_transcription(api_token, audio_file_path):
    client = STTClient(api_token)
    await client.connect()

    # Task 1: Listen for transcripts
    receive_task = asyncio.create_task(client.receive_messages())

    # Task 2: Send audio chunks
    with open(audio_file_path, "rb") as f:
        while True:
            chunk = f.read(2048) # Read 2KB chunks
            if not chunk:
                break
            await client.send_audio(chunk)
            await asyncio.sleep(0.1) # Simulate delay

    # Wait for final results
    print("Audio sent. Waiting for final transcripts...")
    await asyncio.sleep(20) 
    
    await client.disconnect()
    # Cancel the listener since connection is closed
    receive_task.cancel()

# Run the async loop
# asyncio.run(run_transcription("YOUR_TOKEN", "audio.mp3"))

Additional resources