# Voice cloning

Some TTS models can only synthesize speech for a fixed set of voices they were
trained on. Others also support _voice cloning_, i.e., they can generate a new
voice on-the-fly from a provided reference audio file. These Coqui TTS models
allow voice cloning (the model config will contain `supports_cloning=True`):

- [YourTTS](models/vits.md) (and other d-vector based models)
- [XTTS](models/xtts.md)
- [Tortoise](models/tortoise.md)
- [Bark](models/bark.md)

All [voice conversion models](vc.md) also perform voice cloning, but with speech as the
input instead of text.

```{important}
Voice cloning raises several ethical concerns and must not be used to
impersonate individuals without their consent, deceive others, or spread
misinformation ([deepfakes](https://en.wikipedia.org/wiki/Deepfake)). We
strongly encourage you to respect the privacy and rights of individuals, and to
always ensure that your use of Coqui TTS is transparent, responsible, and
respectful of others.
```

## Usage

```{versionchanged} 0.27.0
Coqui can now cache cloned voices for easy reuse. Implementation details of this
may change in future versions.
```

Reference audio for voice cloning is passed via the `speaker_wav` argument,
which may be a single file or a list of files. If a custom speaker ID is also
passed in `speaker`, the resulting voice will be cached in `voice_dir`. If that
voice already exists, it is overwritten. Subsequent calls can then use that
`speaker` without having to provide reference audio again.

`voice_dir` defaults to a subfolder `voices/` in the folder of the model
checkpoint. For models used by name, e.g.
`tts_models/multilingual/multi-dataset/xtts_v2`, this would be
`~/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/voices/`
(on Linux, see the [FAQ](faq.md#where-does-coqui-store-downloaded-models) for
default model locations on other platforms).

### Python API

```python
import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize a TTS model with voice cloning support
api = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
api.tts_to_file(
  text="Hello world",
  speaker_wav=["my/cloning/audio.wav", "my/cloning/audio2.wav"],
  speaker="MySpeaker1",
  language="en",
)

# 2. The voice can now be reused without providing reference audio
api.tts_to_file(
  text="Hello world",
  speaker="MySpeaker1",
  language="en",
)
```

### Command-line interface

```bash
# 1. Clone the voice from `speaker_wav` and cache it under a custom speaker ID
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "Hello world" \
    --language_idx "en" \
    --speaker_wav "my/cloning/audio.wav" "my/cloning/audio2.wav" \
    --speaker_idx "MySpeaker1"

# 2. The voice can now be reused without providing reference audio
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "Hello world" \
    --language_idx "en" \
    --speaker_idx "MySpeaker1"
```

### Metadata

Some metadata is stored within the voice files to check what models they are
compatible with (Coqui does not enforce any checks yet). Let's create a voice
with the LJSpeech files stored in the repository:

```bash
tts --model_name "voice_conversion_models/multilingual/multi-dataset/knnvc" \
    --source_wav source.wav \
    --target_wav tests/data/ljspeech/wavs/*.wav \
    --speaker_idx LJ \
    --voice_dir wavlm-voices
```

We can then print the metadata:

```python
import torch
voice = torch.load("wavlm-voices/LJ.pth", map_location="cpu")
print(voice["metadata"])
```

```python
{
  'model': {'name': 'wavlm', 'layer': 6},
  'speaker_id': 'LJ',
  'source_files': [
    'tests/data/ljspeech/wavs/LJ001-0001.wav',
    'tests/data/ljspeech/wavs/LJ001-0002.wav',
    ...,
  ],
  'created_at': '2025-06-25T12:17+00:00',
  'coqui_version': '0.27.0',
}
```

## For developers

To add voice cloning support to a model, it needs to inherit from
{py:class}`~TTS.utils.voices.CloningMixin` (in addition to
{py:class}`~TTS.tts.models.base_tts.BaseTTS` or
{py:class}`~TTS.vc.models.base_vc.BaseVC`). You then only have to implement a
model-specific `_clone_voice()` method that returns speaker embeddings and
model-specific metadata. For example, for [XTTS](models/xtts.md):

```{eval-rst}
.. literalinclude:: ../../TTS/tts/models/xtts.py
    :pyobject: Xtts._clone_voice
```

Then in your model's `synthesize()` method you use the mixin's
{py:func}`~TTS.utils.voices.CloningMixin.clone_voice` to access the
voice data, while all caching is handled automatically.

### API doc

#### CloningMixin

```{eval-rst}
.. autoclass:: TTS.utils.voices.CloningMixin
    :members:
```

#### VoiceMetadata

```{eval-rst}
.. autoclass:: TTS.utils.voices.VoiceMetadata
    :members:
```