Transcribe speech to text. This endpoint is consistent with the OpenAI Transcription API and may be used with the OpenAI JS or Python SDK.
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
ID of the model to use. distil-whisper/distil-large-v2 is lower latency but English-only. openai/whisper-large-v3-turbo is multi-lingual but slightly higher latency. deepgram/nova-3 supports English variants (en, en-US, en-GB, en-AU, en-NZ, en-IN) and only accepts mp3/wav files.
distil-whisper/distil-large-v2, openai/whisper-large-v3-turbo, deepgram/nova-3 "distil-whisper/distil-large-v2"
The audio file object to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. File uploads are limited to 100 MB. Cannot be used together with file_url. Note: deepgram/nova-3 only supports mp3 and wav formats.
Link to audio file in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. Support for hosted files is limited to 100MB. Cannot be used together with file. Note: deepgram/nova-3 only supports mp3 and wav formats.
"https://example.com/file.mp3"
The format of the transcript output. Use verbose_json to take advantage of timestamps.
json, verbose_json "json"
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Currently segment is supported.
segment "segment"
The language of the audio to be transcribed. For deepgram/nova-3, only English variants are supported: en, en-US, en-GB, en-AU, en-NZ, en-IN. For openai/whisper-large-v3-turbo, supports multiple languages. distil-whisper/distil-large-v2 does not support language parameter.
"en-US"
Additional model-specific configuration parameters. Only allowed with deepgram/nova-3 model. Can include Deepgram-specific options such as smart_format, punctuate, diarize, utterance, numerals, and language. If language is provided both as a top-level parameter and in model_config, the top-level parameter takes precedence.
{ "smart_format": true, "punctuate": true }Successful Response
Response fields vary by model. distil-whisper/distil-large-v2 returns text, duration, and segments in verbose_json mode. openai/whisper-large-v3-turbo returns text only. deepgram/nova-3 returns text and, depending on model_config, may include words with per-word timestamps and speaker labels.
The transcribed text for the audio file.
The duration of the audio file in seconds. Returned by distil-whisper/distil-large-v2 and deepgram/nova-3 when response_format is verbose_json. Not returned by openai/whisper-large-v3-turbo.
Segments of the transcribed text and their corresponding details. Returned by distil-whisper/distil-large-v2 when response_format is verbose_json. Not returned by openai/whisper-large-v3-turbo.
Word-level timestamps and optional speaker labels. Only returned by deepgram/nova-3 when word-level output is enabled via model_config.