Audio samples from

"Melspectrogram Augmentation For Sequence To Sequence Voice Conversion"

Paper: arXiv

Authors: Yeongtae Hwang, Hyemin Cho, Hongsun Yang, Insoo Oh, and Seong-Whan Lee

Abstract: For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment.. In addition, we proposed new policies (i.e., frequency warping, loudness and time length control) for more data variations . Moreover, to find the appropriate hyperparameters of augmentation policies without training the VC model, we proposed hyperparameter search strategy and the new metric for reducing experimental cost, namely deformation per deteriorating ratio. We compared the effect of these Mel-spectrogram augmentation methods based on various sizes of training set and augmentation policies. In the experimental results, the time axis warping based policies (i.e., time length control and time warping.) showed better performance than other policies. These results indicate that the use of the Mel-spectrogram augmentation is more beneficial for training the VC model.

Melspectrogram augmentation

We adopted policies proposed in SpecAugment, i.e. time masking, frequency masking and time warping to deform the time-axis, partial loss of time-axis and partial loss of frequency-axis. For more variety of Mel-spectrogram variants, we propose new policies that are frequency warping, loudness control and time length control to adjust the pitch, loudness, and speed of speech.

The following audio samples are the examples of Mel-spectrogram augmentation by modifying parameters.
- The audios were decoded with Griffin-Lim vocoder using processed Mel-spectrograms as inputs.

sentence : 우리는 오늘, 우리 조선이 독립국이며 조선인이 자주민임을 선언합니다.

Original	It is a sample of the Korean single speaker (KSS) dataset. It also decoded in the same manner without augmentation.

(1) Time masking (Number of Masking=2)	t=2	t=4	t=6	t=8	t=10	t=12	t=14	t=16

(2) Frequency masking (Number of Masking=3)	f=2	f=4	f=6	f=8	f=10	f=12	f=14	f=16

(3) Time warping	w=0.2	w=0.4	w=0.6	w=0.8	w=0.10	w=0.12	w=0.14	w=0.16

(4) Frequency warping	h=2	h=4	h=6	h=8	h=10	h=12	h=14	h=16

(5) Loudness control	λ=0.02	λ=0.04	λ=0.08	λ=0.16	λ=0.32	λ=0.64

(6) Time length control	l=0.02	l=0.04	l=0.06	l=0.08	l=0.10	l=0.12	l=0.14	l=0.16

The effect of the size of training data.

The whole The size of training data used in the experiment is approximately 8 hours each for the source speaker and the target speaker. We experimented by reducing the number of training data to half of it each time from the whole training set till it reaches to 1/16 training set.

Followings are source audios.

sentence 1: 이 열차는 잠시 후 서울역에 도착합니다.
sentence 2: 한 가지 여쭤봐도 될까요?
sentence 3: 저는 동양의 역사에 관심이 있어요.
sentence 4: 연기가 너무 많이 나는 것 같아요.
sentence 5: 다음 번엔 공부 더 열심히 할게요.

The following audio samples were synthesized from the Seq2Seq VC model by learning with the different sizes of training data without applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence 1: 이 열차는 잠시 후 서울역에 도착합니다.

The size of training data	1	1/2	1/4	1/8	1/16
105 iteration**

sentence 2: 한 가지 여쭤봐도 될까요?

The size of training data	1	1/2	1/4	1/8	1/16
105 iteration**

sentence 3: 저는 동양의 역사에 관심이 있어요.

The size of training data	1	1/2	1/4	1/8	1/16
105 iteration**

sentence 4: 연기가 너무 많이 나는 것 같아요.

The size of training data	1	1/2	1/4	1/8	1/16
105 iteration**

sentence 5: 다음 번엔 공부 더 열심히 할게요.

The size of training data	1	1/2	1/4	1/8	1/16
105 iteration**

The effect of each policy.

In experiments for each policy with 1/16 training set, time warping based policies showed better character error rate than other policies. Those are time length control, time length control both, and time warping.

The following audio samples were synthesized from the Seq2Seq VC model by learning with the 1/16 size of training set with applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence1: 이 열차는 잠시 후 서울역에 도착합니다.

	Time Length Control both	Time Length Control	Time Masking	Time Warping	Frequency Masking	Frequency Warping	Loudness Control
105 iteration**

sentence 2: 한 가지 여쭤봐도 될까요?

10**5 iteration

sentence 3: 저는 동양의 역사에 관심이 있어요.

10**5 iteration

sentence 4: 연기가 너무 많이 나는 것 같아요.

10**5 iteration

sentence 5: 다음 번엔 공부 더 열심히 할게요.

10**5 iteration

The effect of policies based on time warping with various sizes of training data.

The following audio samples were synthesized from the Seq2Seq VC model by learning the different sizes of training data without applying Mel-spectrogram augmentation.
- The audios were decoded with Wavenet vocoder using synthesized Mel-spectrograms as inputs.

sentence : 이 열차는 잠시 후 서울역에 도착합니다.

The size of training data	1	1/2	1/4	1/8	1/16
TLC both
TLC
Time Warping

Future work: prosody control with Mel-spectrogram processing

These policies can be used to control prosody as well as augmentation. The following is an example of applying "time length control" and "frequency warping" to the "평" part of Mel-spectorgram.

sentence : 아버지는 한국어를 연구하는데 '평'생을 바치셨다.

	Original	Time Length Control + Frequency Warping
Sentence 1:

** python notebook demo will be added soon. **