Audio samples from "MELSPECTROGRAM AUGMENTATION FOR SEQUENCE TO SEQUENCE VOICE CONVERSION"




Audio samples from

"Melspectrogram Augmentation For Sequence To Sequence Voice Conversion"



Paper: arXiv

Authors: Yeongtae Hwang, Hyemin Cho, Hongsun Yang, Insoo Oh, and Seong-Whan Lee

Abstract: For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment.. In addition, we proposed new policies (i.e., frequency warping, loudness and time length control) for more data variations . Moreover, to find the appropriate hyperparameters of augmentation policies without training the VC model, we proposed hyperparameter search strategy and the new metric for reducing experimental cost, namely deformation per deteriorating ratio. We compared the effect of these Mel-spectrogram augmentation methods based on various sizes of training set and augmentation policies. In the experimental results, the time axis warping based policies (i.e., time length control and time warping.) showed better performance than other policies. These results indicate that the use of the Mel-spectrogram augmentation is more beneficial for training the VC model.

Contents




Melspectrogram augmentation

We adopted policies proposed in SpecAugment, i.e. time masking, frequency masking and time warping to deform the time-axis, partial loss of time-axis and partial loss of frequency-axis. For more variety of Mel-spectrogram variants, we propose new policies that are frequency warping, loudness control and time length control to adjust the pitch, loudness, and speed of speech.

The following audio samples are the examples of Mel-spectrogram augmentation by modifying parameters.
- The audios were decoded with Griffin-Lim vocoder using processed Mel-spectrograms as inputs.


sentence : 우리는 오늘, 우리 조선이 독립국이며 조선인이 자주민임을 선언합니다.


Original

It is a sample of the Korean single speaker (KSS) dataset. It also decoded in the same manner without augmentation.

(1) Time masking (Number of Masking=2)

t=2 t=4 t=6 t=8 t=10 t=12 t=14 t=16

(2) Frequency masking (Number of Masking=3)

f=2 f=4 f=6 f=8 f=10 f=12 f=14 f=16

(3) Time warping

w=0.2 w=0.4 w=0.6 w=0.8 w=0.10 w=0.12 w=0.14 w=0.16

(4) Frequency warping

h=2 h=4 h=6 h=8 h=10 h=12 h=14 h=16

(5) Loudness control

λ=0.02 λ=0.04 λ=0.08 λ=0.16 λ=0.32 λ=0.64

(6) Time length control

l=0.02 l=0.04 l=0.06 l=0.08 l=0.10 l=0.12 l=0.14 l=0.16


The effect of the size of training data.

The whole The size of training data used in the experiment is approximately 8 hours each for the source speaker and the target speaker. We experimented by reducing the number of training data to half of it each time from the whole training set till it reaches to 1/16 training set.

Followings are source audios.

sentence 1: 이 열차는 잠시 후 서울역에 도착합니다.

sentence 2: 한 가지 여쭤봐도 될까요?

sentence 3: 저는 동양의 역사에 관심이 있어요.

sentence 4: 연기가 너무 많이 나는 것 같아요.

sentence 5: 다음 번엔 공부 더 열심히 할게요.



The following audio samples were synthesized from the Seq2Seq VC model by learning with the different sizes of training data without applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence 1: 이 열차는 잠시 후 서울역에 도착합니다.

The size of training data

1 1/2 1/4 1/8 1/16

10**5 iteration


sentence 2: 한 가지 여쭤봐도 될까요?

The size of training data

1 1/2 1/4 1/8 1/16

10**5 iteration


sentence 3: 저는 동양의 역사에 관심이 있어요.

The size of training data

1 1/2 1/4 1/8 1/16

10**5 iteration


sentence 4: 연기가 너무 많이 나는 것 같아요.

The size of training data

1 1/2 1/4 1/8 1/16

10**5 iteration


sentence 5: 다음 번엔 공부 더 열심히 할게요.

The size of training data

1 1/2 1/4 1/8 1/16

10**5 iteration



The effect of each policy.

In experiments for each policy with 1/16 training set, time warping based policies showed better character error rate than other policies. Those are time length control, time length control both, and time warping.

The following audio samples were synthesized from the Seq2Seq VC model by learning with the 1/16 size of training set with applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence1: 이 열차는 잠시 후 서울역에 도착합니다.

Time Length Control both Time Length Control Time Masking Time Warping Frequency Masking Frequency Warping Loudness Control

10**5 iteration


sentence 2: 한 가지 여쭤봐도 될까요?

10**5 iteration


sentence 3: 저는 동양의 역사에 관심이 있어요.

10**5 iteration


sentence 4: 연기가 너무 많이 나는 것 같아요.

10**5 iteration


sentence 5: 다음 번엔 공부 더 열심히 할게요.

10**5 iteration



The effect of policies based on time warping with various sizes of training data.

The following audio samples were synthesized from the Seq2Seq VC model by learning the different sizes of training data without applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence : 이 열차는 잠시 후 서울역에 도착합니다.


The size of training data

1 1/2 1/4 1/8 1/16

TLC both

TLC

Time Warping



Future work: prosody control with Mel-spectrogram processing

These policies can be used to control prosody as well as augmentation. The following is an example of applying "time length control" and "frequency warping" to the "평" part of Mel-spectorgram.

sentence : 아버지는 한국어를 연구하는데 '평'생을 바치셨다.


Original Time Length Control + Frequency Warping

Sentence 1:

** python notebook demo will be added soon. **