Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion

Abstract: Current voice conversion (VC) methods can successfully convert the audio's timbre, but there are still limitations to the modeling of source audio's prosodic style resulting in difficulties in the application of some scenes. This study proposes a source style transfer framework based on recognition-synthesis framework using bottleneck feature and mel spectrum as input. We manage to solve these limitations by explicit and implicit hybrid modeling, including three methods. First, we trained our model conditioned on prosodic features to improve the model's stability and the abilities to represent and control prosody, as the explicit modeling prosody module. Second, to solve the problem of timbre leakage in prosody and extract a comprehensive prosody representation, we design an implicit modeling prosody module using an adversarial training strategy with mel spectrum and bottleneck feature as input. Third, we use the modified self-attention based encoder to excavate the sentential context and prosodic information. Experiments show that our approach is superior to the baseline. The proposed system is quite effective in improving source style transfer performance that our system has better abilities in modeling prosody representations and controlling synthesized prosody, and audio quality and speaker similarity are well maintained.

System Description

CS: comparison system which may cause speaker leakage and unstability

BL: baseline system using lf0,vuv and energy

P1: adopt VAE with auxiliary speaker classifier based on BL system

P2: use SA-WA encoder instead of CBHG encoder based on P1.

P3(proposed): the final system we proposed with SA-WA encoder and prosody module


1. Examples target speaker speech:

speaker samples
db1

2. The results of source style transfer in TTS testset:

Test Set Testing scenarios Source Audio Synthesized Speech
CS BL P1 P2 P3(proposed)
1 male-news
2 male-news
3 male-novel
4 male-poetry
5 female-emotion-anger
6 female-emotion-anger
7 female-emotion-disgust
8 female-emotion-disgust
9 female-emotion-happy
10 female-emotion-happy
11 female-emotion-sad
12 female-emotion-sad
13 female-emotion-surprise
14 female-emotion-surprise
15 sing

3. The results of source style transfer in dubbing testset:

Test Set Testing scenarios Source Audio Synthesized Speech
CS BL P1 P2 P3(proposed)
1 film and television dubbing
2 film and television dubbing
3 film and television dubbing
4 film and television dubbing
5 film and television dubbing
6 film and television dubbing
7 film and television dubbing
8 film and television dubbing
9 film and television dubbing
10 film and television dubbing

4. The results of lf0 control:

Source Audio 0.5 1 1.5

5. The results of energy control:

Source Audio 0.5 1 1.5