Transcribe speech to text. This endpoint is consistent with the OpenAI Transcription API and may be used with the OpenAI JS or Python SDK.
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
ID of the model to use. distil-whisper/distil-large-v2 is lower latency but English-only. openai/whisper-large-v3-turbo is multi-lingual but slightly higher latency. deepgram/nova-3 supports English variants (en, en-US, en-GB, en-AU, en-NZ, en-IN) and only accepts mp3/wav files.
distil-whisper/distil-large-v2, openai/whisper-large-v3-turbo, deepgram/nova-3 "distil-whisper/distil-large-v2"
The audio file object to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. File uploads are limited to 100 MB. Cannot be used together with file_url. Note: deepgram/nova-3 only supports mp3 and wav formats.
Link to audio file in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. Support for hosted files is limited to 100MB. Cannot be used together with file. Note: deepgram/nova-3 only supports mp3 and wav formats.
"https://example.com/file.mp3"
The format of the transcript output. Use verbose_json to take advantage of timestamps.
json, verbose_json "json"
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Currently segment is supported.
segment "segment"
The language of the audio to be transcribed. For deepgram/nova-3, only English variants are supported: en, en-US, en-GB, en-AU, en-NZ, en-IN. For openai/whisper-large-v3-turbo, supports multiple languages. distil-whisper/distil-large-v2 does not support language parameter.
"en-US"
Additional model-specific configuration parameters. Only allowed with deepgram/nova-3 model. Can include Deepgram-specific options such as smart_format, punctuate, diarize, utterance, numerals, and language. If language is provided both as a top-level parameter and in model_config, the top-level parameter takes precedence.
{ "smart_format": true, "punctuate": true }Successful Response
The transcribed text for the audio file.
The duration of the audio file in seconds. This is only included if response_format is set to verbose_json.
Segments of the transcribed text and their corresponding details. This is only included if response_format is set to verbose_json.