# Voice conversion ## Overview Voice conversion (VC) converts the voice in a speech signal from one speaker to that of another speaker while preserving the linguistic content. Coqui supports both voice conversion on its own, as well as applying it after speech synthesis to enable multi-speaker output with single-speaker TTS models. ### Python API Converting the voice in `source_wav` to the voice of `target_wav` (the latter can also be a list of files): ```python from TTS.api import TTS tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda") tts.voice_conversion_to_file( source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav" ) ``` You can cache the target voice by setting a custom ID in `speaker` and reusing that later instead of passing the reference audio again. ```python from TTS.api import TTS tts = TTS("voice_conversion_models/multilingual/multi-dataset/openvoice_v2").to("cuda") # First call with target audio tts.voice_conversion_to_file( source_wav="my/source.wav", target_wav="my/target.wav", speaker="MySpeaker" file_path="output.wav" ) # Later calls don't need target audio tts.voice_conversion_to_file( source_wav="my/source.wav", speaker="MySpeaker" file_path="output.wav" ) ``` Voice cloning by combining TTS and VC. The FreeVC model is used for voice conversion after synthesizing speech. ```python tts = TTS("tts_models/de/thorsten/tacotron2-DDC") tts.tts_with_vc_to_file( "Wie sage ich auf Italienisch, dass ich dich liebe?", speaker_wav=["target1.wav", "target2.wav"], file_path="output.wav" ) ``` Some models, including [XTTS](models/xtts.md), support voice cloning directly and a separate voice conversion step is not necessary: ```python tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda") wav = tts.tts( text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en" ) ``` ### CLI ```sh tts --model_name "voice_conversion_models/multilingual/multi-dataset/knnvc" \ --source_wav "source.wav" \ --target_wav "target1.wav" "target2.wav" \ --out_path "output.wav" ``` You can also cache the target speaker by assigning a custom ID in `--speaker_idx`, so that target audio is not required for subsequent calls. ## Pretrained models Coqui includes the following pretrained voice conversion models. Training is not supported. ### FreeVC - `voice_conversion_models/multilingual/vctk/freevc24` Adapted from: https://github.com/OlaWod/FreeVC ### kNN-VC - `voice_conversion_models/multilingual/multi-dataset/knnvc` At least 1-5 minutes of target speaker data are recommended. Adapted from: https://github.com/bshall/knn-vc ### OpenVoice - `voice_conversion_models/multilingual/multi-dataset/openvoice_v1` - `voice_conversion_models/multilingual/multi-dataset/openvoice_v2` Adapted from: https://github.com/myshell-ai/OpenVoice