Voice cloning¶
Some TTS models can only synthesize speech for a fixed set of voices they were
trained on. Others also support voice cloning, i.e., they can generate a new
voice on-the-fly from a provided reference audio file. These Coqui TTS models
allow voice cloning (the model config will contain supports_cloning=True):
All voice conversion models also perform voice cloning, but with speech as the input instead of text.
Important
Voice cloning raises several ethical concerns and must not be used to impersonate individuals without their consent, deceive others, or spread misinformation (deepfakes). We strongly encourage you to respect the privacy and rights of individuals, and to always ensure that your use of Coqui TTS is transparent, responsible, and respectful of others.
Usage¶
Changed in version 0.27.0: Coqui can now cache cloned voices for easy reuse. Implementation details of this may change in future versions.
Reference audio for voice cloning is passed via the speaker_wav argument,
which may be a single file or a list of files. If a custom speaker ID is also
passed in speaker, the resulting voice will be cached in voice_dir. If that
voice already exists, it is overwritten. Subsequent calls can then use that
speaker without having to provide reference audio again.
voice_dir defaults to a subfolder voices/ in the folder of the model
checkpoint. For models used by name, e.g.
tts_models/multilingual/multi-dataset/xtts_v2, this would be
~/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/voices/
(on Linux, see the FAQ for
default model locations on other platforms).
Python API¶
import torch
from TTS.api import TTS
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize a TTS model with voice cloning support
api = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
api.tts_to_file(
text="Hello world",
speaker_wav=["my/cloning/audio.wav", "my/cloning/audio2.wav"],
speaker="MySpeaker1",
language="en",
)
# 2. The voice can now be reused without providing reference audio
api.tts_to_file(
text="Hello world",
speaker="MySpeaker1",
language="en",
)
Command-line interface¶
# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--text "Hello world" \
--language_idx "en" \
--speaker_wav "my/cloning/audio.wav" "my/cloning/audio2.wav" \
--speaker_idx "MySpeaker1"
# 2. The voice can now be reused without providing reference audio
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--text "Hello world" \
--language_idx "en" \
--speaker_idx "MySpeaker1"
Metadata¶
Some metadata is stored within the voice files to check what models they are compatible with (Coqui does not enforce any checks yet). Letâs create a voice with the LJSpeech files stored in the repository:
tts --model_name "voice_conversion_models/multilingual/multi-dataset/knnvc" \
--source_wav source.wav \
--target_wav tests/data/ljspeech/wavs/*.wav \
--speaker_idx LJ \
--voice_dir wavlm-voices
We can then print the metadata:
import torch
voice = torch.load("wavlm-voices/LJ.pth", map_location="cpu")
print(voice["metadata"])
{
'model': {'name': 'wavlm', 'layer': 6},
'speaker_id': 'LJ',
'source_files': [
'tests/data/ljspeech/wavs/LJ001-0001.wav',
'tests/data/ljspeech/wavs/LJ001-0002.wav',
...,
],
'created_at': '2025-06-25T12:17+00:00',
'coqui_version': '0.27.0',
}
For developers¶
To add voice cloning support to a model, it needs to inherit from
CloningMixin (in addition to
BaseTTS or
BaseVC). You then only have to implement a
model-specific _clone_voice() method that returns speaker embeddings and
model-specific metadata. For example, for XTTS:
def _clone_voice(
self, speaker_wav: str | os.PathLike[Any] | list[str | os.PathLike[Any]], **generate_kwargs: Any
) -> tuple[dict[str, Any], dict[str, Any]]:
gpt_conditioning_latents, speaker_embedding = self.get_conditioning_latents(
audio_path=speaker_wav,
**generate_kwargs,
)
voice = {"gpt_conditioning_latents": gpt_conditioning_latents, "speaker_embedding": speaker_embedding}
metadata = {"name": self.config["model"]}
return voice, metadata
Then in your modelâs synthesize() method you use the mixinâs
clone_voice() to access the
voice data, while all caching is handled automatically.