Open a WebSocket connection to stream audio and receive transcriptions in real-time. Authentication is provided via the standard Authorization: Bearer <API_KEY> header.
Supported engines: Azure, Deepgram, Google, Telnyx.
Connection flow:
transcript, is_final, and confidence fields.Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
The transcription engine to use for processing the audio stream.
Azure, Deepgram, Google, Telnyx The format of input audio stream.
mp3, wav The language spoken in the audio stream.
Whether to receive interim transcription results.
The specific model to use within the selected transcription engine.
fast Silence duration (in milliseconds) that triggers end-of-speech detection. When set, the engine uses this value to determine when a speaker has stopped talking. Not all engines support this parameter.
Enable redaction of sensitive information (e.g., PCI data, SSN) from transcription results. Supported values depend on the transcription engine.
A key term to boost in the transcription. The engine will be more likely to recognize this term. Can be specified multiple times for multiple terms.
Comma-separated list of keywords to boost in the transcription. The engine will prioritize recognition of these words.
Client sends binary audio frames (mp3 or wav format) over the WebSocket. See SttClientEvent schema.
Binary audio data in mp3 or wav format.
WebSocket connection established. Communication proceeds via binary audio frames (client) and JSON transcript frames (server).
Client → Server: Binary audio data (mp3/wav).
Server → Client: See TranscriptFrame and SttErrorFrame schemas.
Union of all server-to-client WebSocket events for STT streaming.
Frame type identifier.
"transcript"The transcribed text from the audio.
Whether this is a final transcription result. When false, this is an interim result that may be refined.
Confidence score of the transcription, ranging from 0 to 1.