Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
Abstract: Current voice conversion (VC) methods can successfully convert the audio's timbre, but there are still limitations to the modeling of source audio's prosodic style resulting in difficulties in the application of some scenes. This study proposes a source style transfer framework based on recognition-synthesis framework using bottleneck feature and mel spectrum as input. We manage to solve these limitations by explicit and implicit hybrid modeling, including three methods. First, we trained our model conditioned on prosodic features to improve the model's stability and the abilities to represent and control prosody, as the explicit modeling prosody module. Second, to solve the problem of timbre leakage in prosody and extract a comprehensive prosody representation, we design an implicit modeling prosody module using an adversarial training strategy with mel spectrum and bottleneck feature as input. Third, we use the modified self-attention based encoder to excavate the sentential context and prosodic information. Experiments show that our approach is superior to the baseline. The proposed system is quite effective in improving source style transfer performance that our system has better abilities in modeling prosody representations and controlling synthesized prosody, and audio quality and speaker similarity are well maintained.
System Description
CS: comparison system which may cause speaker leakage and unstability
BL: baseline system using lf0,vuv and energy
P1: adopt VAE with auxiliary speaker classifier based on BL system
P2: use SA-WA encoder instead of CBHG encoder based on P1.
P3(proposed): the final system we proposed with SA-WA encoder and prosody module
1. Examples target speaker speech:
speaker
samples
db1
2. The results of source style transfer in TTS testset:
Test Set
Testing scenarios
Source Audio
Synthesized Speech
CS
BL
P1
P2
P3(proposed)
1
male-news
2
male-news
3
male-novel
4
male-poetry
5
female-emotion-anger
6
female-emotion-anger
7
female-emotion-disgust
8
female-emotion-disgust
9
female-emotion-happy
10
female-emotion-happy
11
female-emotion-sad
12
female-emotion-sad
13
female-emotion-surprise
14
female-emotion-surprise
15
sing
3. The results of source style transfer in dubbing testset: