최대 1 분 소요


TTS/STT

Text to Sound / Sound to Text

  • Tacotron
    seq2seq 구조를 이용하여 text를 학습하여 mel spectogram을 생성, 이를 sound로 변환하는 방법이다.

img

img

  • Tacotron2
    Tacotron2에서는 생성한 mel spectrogram을 WaveNet에 넣는 과정을 추가했다.

img

하지만 convolution을 어떻게 하든 절대값(abs)을 취해 날아간 위상 정보는 생성이 불가능하다!

  • MelGAN
    사라진 위상정보를 이미지 생성 방법인 GAN을 이용하여 새롭게 생성할 수 있다.

img

Reference

  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark and Rif A. Saurous. TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS. arXiv:1703.10135v2 [cs.CL] 6 Apr 2017.

  • Jonathan Shen1, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis and Yonghui Wu. NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS. arXiv:1712.05884v2 [cs.CL] 16 Feb 2018.

  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior Koray and Kavukcuoglu. WAVENET: A GENERATIVE MODEL FOR RAW AUDIO. arXiv:1609.03499v2 [cs.SD] 19 Sep 2016.

  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio and Aaron Courville. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arXiv:1910.06711v3 [eess.AS] 9 Dec 2019.

댓글남기기