Skip to main content

What voice cloning does

Voice cloning captures a speaker’s vocal characteristics — timbre, cadence, accent, pronunciation — from a short audio sample and applies them to new speech synthesis. The clone is a representation of the voice, not a recording of it. The system learns patterns from your audio and encodes them into parameters that guide TTS. This means:
  • The cloned voice can say things the original speaker never said
  • Clone quality is bounded by what the model can learn from your sample
  • Poor recordings, background noise, or inconsistent delivery degrade the clone

What cloning doesn’t do

A clone is not a recording. It’s a statistical approximation of a voice — the model extracts patterns (formant frequencies, prosodic tendencies, spectral characteristics) and applies them during synthesis. This means:
  • Output passes through the TTS model, which has its own characteristics. A clone sounds like the speaker, but through the lens of the model.
  • Quality has a ceiling set by your source audio. No amount of API parameters will fix a noisy or inconsistent recording.
  • The clone may not handle speech styles far from the original sample well. A voice cloned from calm narration may sound different when asked to express strong emotion.

Two ways to create a clone

MethodWhat it doesWhen to use
Upload audioSend an audio file directlyYou have a recording ready
From a voice designSave a previously generated design as a cloneYou used Design a Voice to create it
Both produce the same output: a voice clone with a voice ID you can use in production.

Recording best practices

  1. Match your recording to your use case. Don’t read a monotone script if you want an expressive clone. The AI replicates what it hears — including energy, emotion, and pacing.
  2. Speak clearly, avoid background noise. Use a decent microphone in a quiet space. Background noise gets cloned too. You don’t need a 10Kmica10K mic — a 100-300 USB condenser in a quiet room is sufficient.
  3. Avoid long pauses. The cloned voice will mimic pauses between sentences. Keep speech flowing naturally.
  4. Trim your recording. Speech from start to finish, no dead air at the beginning or end.
  5. Speak in the target language. If you want the clone to speak Spanish, record in Spanish.
  6. Keep it consistent. Same tone, accent, and energy throughout. Wide fluctuations confuse the model. The AI clones everything — including stutters, “uhms”, and inconsistencies.
  7. Aim for the right volume. Target -23 to -18 dB RMS with peaks no higher than -3 dB. Too quiet = noise floor issues. Too loud = clipping.
  8. Audio codec doesn’t matter much. MP3 at 128 kbps or above is fine. WAV is ideal but higher bitrate MP3 won’t noticeably hurt quality.
  9. Optimal duration by model:
    • Qwen3TTS: 5–10 seconds. Auto-trims to 10s. More isn’t better.
    • Ultra: Up to 10 seconds.
    • Minimax: 1–2 minutes is the sweet spot. Longer recordings capture more vocal range, but beyond 3 minutes yields diminishing returns.