Skip to main content
The WebSocket carries two frame types: binary frames (audio) from client to server, and JSON text frames in both directions.

Client → Server

Audio Data

Binary WebSocket frames containing raw audio bytes. No base64, no JSON wrapping. Recommended chunk size: 2048–8192 bytes. Smaller chunks reduce latency; larger chunks reduce round trips.
[binary frame: audio bytes]

Control Messages

JSON text frames with a type field.
{"type": "Finalize"}
TypeEffectEngine support
FinalizeFlush audio buffer, force a final transcriptDeepgram only
CloseStreamEnd session, close connection gracefullyAll
KeepAliveReset idle timeoutDeepgram only
Unknown text frames are silently ignored.

Server → Client

All server messages are JSON text frames.

Transcription Result

Emitted for each recognized speech segment (partial or final).
{
  "transcript": "Hello, how are you today?",
  "is_final": true,
  "speech_final": true,
  "confidence": 0.98
}
FieldTypePresentDescription
transcriptstringAlwaysTranscribed text
is_finalbooleanAlwaystrue = finalized segment. false = interim (may revise).
speech_finalbooleanDeepgramtrue = speaker stopped talking
confidencefloatWhen available0.0–1.0 confidence score
utterance_endbooleanDeepgramtrue = silence-triggered utterance boundary

Utterance End

Emitted on speaker pause (Deepgram). Empty transcript, is_final: true.
{
  "transcript": "",
  "is_final": true,
  "utterance_end": true
}

Error

Emitted on validation or connection errors. Connection closes shortly after.
{
  "errors": [
    {
      "code": "40002",
      "title": "Unsupported format",
      "detail": "Format 'flac' is not supported by engine 'Azure'",
      "source": {"parameter": "input_format"}
    }
  ]
}
FieldTypeDescription
errorsarrayOne or more error objects
errors[].codestringError code (see Errors)
errors[].titlestringShort description
errors[].detailstringHuman-readable explanation
errors[].source.parameterstringQuery parameter that caused the error

Message Flow

interim_results=false (default) — server sends only final transcripts:
Client:  [binary audio frames]
Server:  {"transcript": "Hello, how are you today?", "is_final": true, "speech_final": true, "confidence": 0.98}
Client:  [binary audio frames]
Server:  {"transcript": "I'm doing well.", "is_final": true, "speech_final": true, "confidence": 0.95}
Client:  {"type": "CloseStream"}
         [connection closed]
interim_results=true — server sends partials, then final:
Client:  [binary audio frames]
Server:  {"transcript": "Hello", "is_final": false, "speech_final": false}
Server:  {"transcript": "Hello, how are", "is_final": false, "speech_final": false}
Server:  {"transcript": "Hello, how are you today?", "is_final": true, "speech_final": true, "confidence": 0.98}
Partials are best-effort and may revise. Only is_final: true results are stable.