One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation

Zhichao Wang1, Qicong Xie1, Tao Li1, Hongqiang Du1, Lei Xie1, Pengcheng Zhu2, Mengxiao Bi2
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2 Fuxi AI Lab, NetEase Inc., Hangzhou, China

Contents

1. Abstract

One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.



2. System Description

Comparison Systems Ablation Systems Speaker & Reference Audio

3. Demos -- Comparison Analysis

3.1 CMU testset results

Target Speaker Source speech Method
AGAINVC VQMIVC GSE GSE-finetune P3(proposed)
p340
p363
bdl
slt

3.2 ESD testset results(style transfer)

  • Target speakers: slt
  • Target audio:
Emotion Source speech Method
AGAINVC VQMIVC GSE GSE-finetune P3(proposed)
Happy
Angry
Sad
Surprise

4. Demos -- Ablation Analysis

Target Speaker Source speech Method
BL P1 P2 P3(proposed)
bdl
slt
p340
p363

5. Demos & Visualization -- Varying Duration

Target Speaker Source Audio 1s 3s 6s 9s 15s

6. Demos & Visualization -- Over-fitting on Spectrograms

  • Target speaker:
Source Audio BL P1 P2 P3