Voice conversion¶
Overview¶
Voice conversion (VC) converts the voice in a speech signal from one speaker to that of another speaker while preserving the linguistic content. Coqui supports both voice conversion on its own, as well as applying it after speech synthesis to enable multi-speaker output with single-speaker TTS models.
Python API¶
Converting the voice in source_wav to the voice of target_wav (the latter
can also be a list of files):
from TTS.api import TTS
tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda")
tts.voice_conversion_to_file(
source_wav="my/source.wav",
target_wav="my/target.wav",
file_path="output.wav"
)
You can cache the target voice by setting a custom ID in speaker and reusing
that later instead of passing the reference audio again.
from TTS.api import TTS
tts = TTS("voice_conversion_models/multilingual/multi-dataset/openvoice_v2").to("cuda")
# First call with target audio
tts.voice_conversion_to_file(
source_wav="my/source.wav",
target_wav="my/target.wav",
speaker="MySpeaker"
file_path="output.wav"
)
# Later calls don't need target audio
tts.voice_conversion_to_file(
source_wav="my/source.wav",
speaker="MySpeaker"
file_path="output.wav"
)
Voice cloning by combining TTS and VC. The FreeVC model is used for voice conversion after synthesizing speech.
tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
tts.tts_with_vc_to_file(
"Wie sage ich auf Italienisch, dass ich dich liebe?",
speaker_wav=["target1.wav", "target2.wav"],
file_path="output.wav"
)
Some models, including XTTS, support voice cloning directly and a separate voice conversion step is not necessary:
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
wav = tts.tts(
text="Hello world!",
speaker_wav="my/cloning/audio.wav",
language="en"
)
CLI¶
tts --model_name "voice_conversion_models/multilingual/multi-dataset/knnvc" \
--source_wav "source.wav" \
--target_wav "target1.wav" "target2.wav" \
--out_path "output.wav"
You can also cache the target speaker by assigning a custom ID in
--speaker_idx, so that target audio is not required for subsequent calls.
Pretrained models¶
Coqui includes the following pretrained voice conversion models. Training is not supported.
FreeVC¶
voice_conversion_models/multilingual/vctk/freevc24
Adapted from: https://github.com/OlaWod/FreeVC
kNN-VC¶
voice_conversion_models/multilingual/multi-dataset/knnvc
At least 1-5 minutes of target speaker data are recommended.
Adapted from: https://github.com/bshall/knn-vc
OpenVoice¶
voice_conversion_models/multilingual/multi-dataset/openvoice_v1voice_conversion_models/multilingual/multi-dataset/openvoice_v2
Adapted from: https://github.com/myshell-ai/OpenVoice