Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion

Abstract: Current voice conversion (VC) methods can successfully convert the audio's timbre, but there are still limitations to the modeling of source audio's prosodic style resulting in difficulties in the application of some scenes. This study proposes a source style transfer framework based on recognition-synthesis framework using bottleneck feature and mel spectrum as input. We manage to solve these limitations by explicit and implicit hybrid modeling, including three methods. First, we trained our model conditioned on prosodic features to improve the model's stability and the abilities to represent and control prosody, as the explicit modeling prosody module. Second, to solve the problem of timbre leakage in prosody and extract a comprehensive prosody representation, we design an implicit modeling prosody module using an adversarial training strategy with mel spectrum and bottleneck feature as input. Third, we use the modified self-attention based encoder to excavate the sentential context and prosodic information. Experiments show that our approach is superior to the baseline. The proposed system is quite effective in improving source style transfer performance that our system has better abilities in modeling prosody representations and controlling synthesized prosody, and audio quality and speaker similarity are well maintained.

System Description

CS: comparison system which may cause speaker leakage and unstability

BL: baseline system using lf0,vuv and energy

P1: adopt VAE with auxiliary speaker classifier based on BL system

P2: use SA-WA encoder instead of CBHG encoder based on P1.

P3(proposed): the final system we proposed with SA-WA encoder and prosody module

1. Examples target speaker speech:

speaker	samples
db1

2. The results of source style transfer in TTS testset:

Test Set	Testing scenarios	Source Audio	Synthesized Speech
Test Set	Testing scenarios	Source Audio	CS	BL	P1	P2	P3(proposed)
1	male-news
2	male-news
3	male-novel
4	male-poetry
5	female-emotion-anger
6	female-emotion-anger
7	female-emotion-disgust
8	female-emotion-disgust
9	female-emotion-happy
10	female-emotion-happy
11	female-emotion-sad
12	female-emotion-sad
13	female-emotion-surprise
14	female-emotion-surprise
15	sing

3. The results of source style transfer in dubbing testset:

Test Set	Testing scenarios	Source Audio	Synthesized Speech
Test Set	Testing scenarios	Source Audio	CS	BL	P1	P2	P3(proposed)
1	film and television dubbing
2	film and television dubbing
3	film and television dubbing
4	film and television dubbing
5	film and television dubbing
6	film and television dubbing
7	film and television dubbing
8	film and television dubbing
9	film and television dubbing
10	film and television dubbing

4. The results of lf0 control:

Source Audio	0.5	1	1.5

5. The results of energy control:

Source Audio	0.5	1	1.5