PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Anonymous

ICLR Rebuttal: Neural Audio Codec Decoding from EnCodec Tokens

We utilized the universal test sets provided in RFWave. These samples consists of speech, vocal, and sound effect. The samples of PeriodWave is from a single-band PeriodWave models trained by the tokens of EnCodec instead of Mel-spectrogram. Following MBD, we used the same Encodec settings to obtain the tokens, using a maximum bandwidth of 6.0 kbps (Q=8, 75Hz for 24,000 Hz waveform).

GT	EnCodec	Vocos	MBD	RFWave	PeriodWave

Multi Speakers (LibriTTS Dataset)

All PeriodWave samples were generated in 16 steps.

GT	UnivNet	Vocos	BigVGAN-base	BigVGAN	PeriodWave	PeriodWave + FreeU	MB-PeriodWave

Zero-shot TTS (using ARDiT-TTS)

To further demonstrate the effectiveness of our model for two-stage TTS, we added the results for multi-speaker zero-shot TTS. We utilized an autoregressive diffusion transformer-based zero-shot TTS model, ARDiT-TTS for TTS model which used the same configuration of Mel-spectrogram for 24 kHz audio. We requested the generated Mel-spectrogram of ARDiT-TTS from the authors and they kindly sent us the Mel-spectrogram of 500 samples for the LibriTTS-test-subsets. We have attached the UTMOS results for each vocoder, and we will conduct the MOS for this experiment. Although GAN-based models have shown their powerful generative performance for the original Mel-spectrogram converted from GT audio, these results show that they have low robustness for the generated Mel-spectrogram from the TTS models. We used the official implementation and checkpoints of BigVGAN and BigVSAN.

evaluation table — UTMOS results for zero-shot TTS (500 samples).

BigVSAN	BigVGAN	PeriodWave	PeriodWave + FreeU

Zero-shot TTS based on sampling steps
using (ARDiT-TTS + PeriodWave(FreeU) / 24,000 Hz).

To further demonstrate the effectiveness of our model for two-stage TTS, we added the results for multi-speaker zero-shot TTS. We utilized an autoregressive diffusion transformer-based zero-shot TTS model, ARDiT-TTS for TTS model which used the same configuration of Mel-spectrogram for 24 kHz audio. We requested the generated Mel-spectrogram of ARDiT-TTS from the authors and they kindly sent us the Mel-spectrogram of 500 samples for the LibriTTS-test-subsets. We have attached the UTMOS results for each vocoder, and we will conduct the MOS for this experiment. Although GAN-based models have shown their powerful generative performance for the original Mel-spectrogram converted from GT audio, these results show that they have low robustness for the generated Mel-spectrogram from the TTS models. We used the official implementation and checkpoints of BigVGAN and BigVSAN.

PeriodWave (step 1)	PeriodWave (step 2)	PeriodWave (step 4)	PeriodWave (step 8)	PeriodWave (step 16)

Neural Audio Codec Decoding from Discrete Tokens

We conducted experiments for parallel and streaming generation from discrete tokens. We used Mimi of Moshi, a state-of-the-art neural audio codec that operates at 12.5 Hz. Note that the number of codebooks in Mimi can be up to 32, but Moshi utilized a Q=8 quantizer for speech language models, so we also used the same discrete tokens from the eight quantizer as an input to our model instead of Mel-spectrogram.