Voice cloning¶

Some TTS models can only synthesize speech for a fixed set of voices they were trained on. Others also support voice cloning, i.e., they can generate a new voice on-the-fly from a provided reference audio file. These Coqui TTS models allow voice cloning (the model config will contain supports_cloning=True):

YourTTS (and other d-vector based models)
XTTS
Tortoise
Bark

All voice conversion models also perform voice cloning, but with speech as the input instead of text.

Important

Voice cloning raises several ethical concerns and must not be used to impersonate individuals without their consent, deceive others, or spread misinformation (deepfakes). We strongly encourage you to respect the privacy and rights of individuals, and to always ensure that your use of Coqui TTS is transparent, responsible, and respectful of others.

Usage¶

Changed in version 0.27.0: Coqui can now cache cloned voices for easy reuse. Implementation details of this may change in future versions.

Reference audio for voice cloning is passed via the speaker_wav argument, which may be a single file or a list of files. If a custom speaker ID is also passed in speaker, the resulting voice will be cached in voice_dir. If that voice already exists, it is overwritten. Subsequent calls can then use that speaker without having to provide reference audio again.

voice_dir defaults to a subfolder voices/ in the folder of the model checkpoint. For models used by name, e.g. tts_models/multilingual/multi-dataset/xtts_v2, this would be ~/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/voices/ (on Linux, see the FAQ for default model locations on other platforms).

Python API¶

import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize a TTS model with voice cloning support
api = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
api.tts_to_file(
  text="Hello world",
  speaker_wav=["my/cloning/audio.wav", "my/cloning/audio2.wav"],
  speaker="MySpeaker1",
  language="en",
)

# 2. The voice can now be reused without providing reference audio
api.tts_to_file(
  text="Hello world",
  speaker="MySpeaker1",
  language="en",
)

Command-line interface¶

# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "Hello world" \
    --language_idx "en" \
    --speaker_wav "my/cloning/audio.wav" "my/cloning/audio2.wav" \
    --speaker_idx "MySpeaker1"

# 2. The voice can now be reused without providing reference audio
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "Hello world" \
    --language_idx "en" \
    --speaker_idx "MySpeaker1"

Metadata¶

Some metadata is stored within the voice files to check what models they are compatible with (Coqui does not enforce any checks yet). Let’s create a voice with the LJSpeech files stored in the repository:

tts --model_name "voice_conversion_models/multilingual/multi-dataset/knnvc" \
    --source_wav source.wav \
    --target_wav tests/data/ljspeech/wavs/*.wav \
    --speaker_idx LJ \
    --voice_dir wavlm-voices

We can then print the metadata:

import torch
voice = torch.load("wavlm-voices/LJ.pth", map_location="cpu")
print(voice["metadata"])

{
  'model': {'name': 'wavlm', 'layer': 6},
  'speaker_id': 'LJ',
  'source_files': [
    'tests/data/ljspeech/wavs/LJ001-0001.wav',
    'tests/data/ljspeech/wavs/LJ001-0002.wav',
    ...,
  ],
  'created_at': '2025-06-25T12:17+00:00',
  'coqui_version': '0.27.0',
}

For developers¶

To add voice cloning support to a model, it needs to inherit from CloningMixin (in addition to BaseTTS or BaseVC). You then only have to implement a model-specific _clone_voice() method that returns speaker embeddings and model-specific metadata. For example, for XTTS:

    def _clone_voice(
        self, speaker_wav: str | os.PathLike[Any] | list[str | os.PathLike[Any]], **generate_kwargs: Any
    ) -> tuple[dict[str, Any], dict[str, Any]]:
        gpt_conditioning_latents, speaker_embedding = self.get_conditioning_latents(
            audio_path=speaker_wav,
            **generate_kwargs,
        )
        voice = {"gpt_conditioning_latents": gpt_conditioning_latents, "speaker_embedding": speaker_embedding}
        metadata = {"name": self.config["model"]}
        return voice, metadata

Then in your model’s synthesize() method you use the mixin’s clone_voice() to access the voice data, while all caching is handled automatically.

Voice cloning¶

Usage¶

Python API¶

Command-line interface¶

Metadata¶

For developers¶

API doc¶

CloningMixin¶

VoiceMetadata¶