What voice cloning does
Voice cloning captures a speaker’s vocal characteristics — timbre, cadence, accent, pronunciation — from a short audio sample and applies them to new speech synthesis. The clone is a representation of the voice, not a recording of it. The system learns patterns from your audio and encodes them into parameters that guide TTS. This means:- The cloned voice can say things the original speaker never said
- Clone quality is bounded by what the model can learn from your sample
- Poor recordings, background noise, or inconsistent delivery degrade the clone
What cloning doesn’t do
A clone is not a recording. It’s a statistical approximation of a voice — the model extracts patterns (formant frequencies, prosodic tendencies, spectral characteristics) and applies them during synthesis. This means:- Output passes through the TTS model, which has its own characteristics. A clone sounds like the speaker, but through the lens of the model.
- Quality has a ceiling set by your source audio. No amount of API parameters will fix a noisy or inconsistent recording.
- The clone may not handle speech styles far from the original sample well. A voice cloned from calm narration may sound different when asked to express strong emotion.
Two ways to create a clone
| Method | What it does | When to use |
|---|---|---|
| Upload audio | Send an audio file directly | You have a recording ready |
| From a voice design | Save a previously generated design as a clone | You used Design a Voice to create it |
Recording best practices
- Match your recording to your use case. Don’t read a monotone script if you want an expressive clone. The AI replicates what it hears — including energy, emotion, and pacing.
- Speak clearly, avoid background noise. Use a decent microphone in a quiet space. Background noise gets cloned too. You don’t need a 100-300 USB condenser in a quiet room is sufficient.
- Avoid long pauses. The cloned voice will mimic pauses between sentences. Keep speech flowing naturally.
- Trim your recording. Speech from start to finish, no dead air at the beginning or end.
- Speak in the target language. If you want the clone to speak Spanish, record in Spanish.
- Keep it consistent. Same tone, accent, and energy throughout. Wide fluctuations confuse the model. The AI clones everything — including stutters, “uhms”, and inconsistencies.
- Aim for the right volume. Target -23 to -18 dB RMS with peaks no higher than -3 dB. Too quiet = noise floor issues. Too loud = clipping.
- Audio codec doesn’t matter much. MP3 at 128 kbps or above is fine. WAV is ideal but higher bitrate MP3 won’t noticeably hurt quality.
- Optimal duration by model:
- Qwen3TTS: 5–10 seconds. Auto-trims to 10s. More isn’t better.
- Ultra: Up to 10 seconds.
- Minimax: 1–2 minutes is the sweet spot. Longer recordings capture more vocal range, but beyond 3 minutes yields diminishing returns.