LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models

Zhichao Wang¹, Yuanzhe Chen², Lei Xie¹, Qiao Tian², Yuping Wang² ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ² Speech, Audio, and Music Intelligence (SAMI) Group, ByteDance Inc., Shanghai, China

Abstract
System Description
Demos
Speakers From Internet

LibriLight Results

1. Abstract

Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM -- Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling through shallow fusion. Finally, a prefix LM reconstructs fine acoustic tokens from the coarse and results in the converted speech. Experiments demonstrate that LM-VC outperforms competitive systems in speech naturalness and speaker similarity.

2. System Description

Comparison Systems

YourTTS^[1]: a current SOTA model in the literature for zero-shot VC

AudioLM-VC: a varint of AuidoLM^[2] as LM-based zero-shot VC baseline

LM-VC: our proposed zero-shot VC model

3. Demos

The converted audio samples with comparison and ablation systems on the zero-shot VC task.

Target Speaker Prompt	Source Speech	Zero-shot VC Methods			Abalation
Target Speaker Prompt	Source Speech	YourTTS	AudioLM-VC	LM-VC (Proposed)	w/o MPLM	w/o ELM

4. Speakers from Internet

Zero-shot VC demos for celebrities and characters in Game "Genshin" collected from the Internet.

Source Speech	Target Speaker Prompt
Source Speech	Tom Hiddleston	Emma Watson	Ningguang	Zhongli	Kaeya	Kamizato Ayato	Sangonomiya Kokomi

5. LibriLight Results

We also provide demos of LM-VC trained on a larger dataset -- LibriLight (60K hours).

Speaker From Testset

Target Speaker Prompt	Source Speech	LM-VC (Proposed)

Speaker From Internet

Source Speech	Target Speaker Prompt
Source Speech	Barack Obama	Emma Watson	Sangonomiya Kokomi

References

E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G ̈olge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in International Conference on Machine Learning (ICML), 2022, pp. 2709–2720.
Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: a language modeling approach to audio generation,” Arxiv, 2022