PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Abstract

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks.

Contents

 

Multi Speakers (LibriTTS Dataset)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


Zero-shot TTS (ARDiT-TTS + Vocoder / 24,000 Hz)

To further demonstrate the effectiveness of our model for two-stage TTS, we added the results for multi-speaker zero-shot TTS. We utilized an autoregressive diffusion transformer-based zero-shot TTS model, ARDiT-TTS for TTS model which used the same configuration of Mel-spectrogram for 24 kHz audio. We requested the generated Mel-spectrogram of ARDiT-TTS from the authors and they kindly sent us the Mel-spectrogram of 500 samples for the LibriTTS-test-subsets. We have attached the UTMOS results for each vocoder, and we will conduct the MOS for this experiment. Although GAN-based models have shown their powerful generative performance for the original Mel-spectrogram converted from GT audio, these results show that they have low robustness for the generated Mel-spectrogram from the TTS models. We used the official implementation and checkpoints of BigVGAN and BigVSAN.

evaluation table
UTMOS results for zero-shot TTS (500 samples).

BigVSAN
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)


Zero-shot TTS according to Sampling Steps (ARDiT-TTS + PeriodWave(FreeU) / 24,000 Hz)

evaluation table
UTMOS results for zero-shot TTS (500 samples).

PeriodWave (step 1)
PeriodWave (step 2)
PeriodWave (step 4)
PeriodWave (step 8)
PeriodWave (step 16)


Single Speaker (LJSpeech Dataset)


Ground Truth
HiFi-GAN
BigVGAN-base
BigVGAN
PriorGrad (step 50)
FreGrad (step 50)
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


LJSpeech TTS (Glow TTS + Vocoder / 22,050 Hz)


evaluation table
Ground Truth
HiFi-GAN
BigVGAN-base
BigVGAN
PriorGrad (step 50)
FreGrad (step 50)
PeriodWave (step 16)
PeriodWave + FreeU(step 16)
MB-PeriodWave (𝜏 = 0.333)
MB-PeriodWave (𝜏 = 0.5)
MB-PeriodWave (𝜏 = 0.667)


LJSpeech (per sampling step)


evaluation table
Objective evaluation results on LJSpeech according to different sampling steps. We utilized PeriodWave trained for 1M steps.
Ground Truth
PeriodWave (step 1)
PeriodWave (step 2)
PeriodWave (step 4)
PeriodWave (step 8)
PeriodWave (step 16)
PeriodWave (step 32)
PeriodWave (step 64)


Out-Of-Distrubution: MUSDB18-HQ (Bass)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


Out-Of-Distrubution: MUSDB18-HQ (Drums)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


Out-Of-Distrubution: MUSDB18-HQ (Mixture)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


Out-Of-Distrubution: MUSDB18-HQ (others)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)


Out-Of-Distrubution: MUSDB18-HQ (Vocals)


Ground Truth
UnivNet
Vocos
BigVGAN-base
BigVGAN
PeriodWave (step 16)
PeriodWave + FreeU (step 16)
MB-PeriodWave (step 16)