DOI QR코드

DOI QR Code

One-shot multi-speaker text-to-speech using RawNet3 speaker representation

RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템

  • Sohee Han (School of Electrical Engineering, Korea Advanced Institute of Science and Technology) ;
  • Jisub Um (School of Electrical Engineering, Korea Advanced Institute of Science and Technology) ;
  • Hoirin Kim (School of Electrical Engineering, Korea Advanced Institute of Science and Technology)
  • 한소희 (한국과학기술원 전기및전자공학부) ;
  • 엄지섭 (한국과학기술원 전기및전자공학부) ;
  • 김회린 (한국과학기술원 전기및전자공학부)
  • Received : 2024.01.31
  • Accepted : 2024.03.21
  • Published : 2024.03.31

Abstract

Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

최근 음성합성(text-to-speech, TTS) 기술의 발전은 합성음의 음질을 크게 향상하였으며, 사람의 음성에 가까운 합성음을 생성할 수 있는 수준에 이르렀다. 특히, 다양한 음성 특성과 개인화된 음성을 제공하는 TTS 모델은 AI(artificial intelligence) 튜터, 광고, 비디오 더빙과 같은 분야에서 널리 활용되고 있다. 따라서 본 논문은 훈련 중 보지 않은 화자의 발화를 사용하여 음성을 합성함으로써 음향적 다양성을 보장하고 개인화된 음성을 제공하는 원샷 다화자 음성합성 시스템을 제안했다. 이 제안 모델은 FastSpeech2 음향 모델과 HiFi-GAN 보코더로 구성된 TTS 모델에 RawNet3 기반 화자 인코더를 결합한 구조이다. 화자 인코더는 목표 음성에서 화자의 음색이 담긴 임베딩을 추출하는 역할을 한다. 본 논문에서는 영어 원샷 다화자 음성합성 모델뿐만 아니라 한국어 원샷 다화자 음성합성 모델도 구현하였다. 제안한 모델로 합성한 음성의 자연성과 화자 유사도를 평가하기 위해 객관적인 평가 지표와 주관적인 평가 지표를 사용하였다. 주관적 평가에서, 제안한 한국어 원샷 다화자 음성합성 모델의 NMOS(naturalness mean opinion score)는 3.36점이고 SMOS(similarity MOS)는 3.16점이었다. 객관적 평가에서, 제안한 영어 원샷 다화자 음성합성 모델과 한국어 원샷 다화자 음성합성 모델의 P-MOS(prediction MOS)는 각각 2.54점과 3.74점이었다. 이러한 결과는 제안 모델이 화자 유사도와 자연성 두 측면 모두에서 비교 모델들보다 성능이 향상되었음을 의미한다.

Keywords

Acknowledgement

본 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1A2C1014044).

References

  1. Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). Barcelona, Spain.
  2. Choi, S., Han, S., Kim, D., & Ha, S. (2020, October). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Proceedings of the Interspeech 2020. Shanghai, China.
  3. Casanova, E., Shulby, C., Golge, E., Muller, N. M., de Oliveira, F. S., Candido A. C. Jr., Soares, A. S., ... & Ponti, M. A. (2021, August-September). Sc-glowTTS: An efficient zero-shot multi-speaker text-to-speech model. Proceedings of the Interspeech 2021 (pp. 3645-3649). Brno, Czechia.
  4. Casanova, E., Weber, J., Shulby, C., Candido Jr., A., Golge, E., & Ponti, M. A. (2022, June). Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. Proceedings of the 39th International Conference on Machine Learning (pp. 2709-2720). Baltimore, MD.
  5. Chung, J. S., Nagrani, A., & Zisserman, A. (2018, September). VoxCeleb2: Deep speaker recognition. Proceedings of the Interspeech (pp. 1086-1090). Hyderabad, India.
  6. Desplanques, B., Thienpondt, J., & Demuynck, K. (2020, October). ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proceedings of the Interspeech (pp. 3830-3834). Shanghai, China.
  7. Heo, H. S., Lee, B. J., Huh, J., & Chung, J. S. (2020, October). Clova baseline system for the VoxCeleb speaker recognition challenge 2020. Proceedings of the Interspeech. Shanghai, China.
  8. Hu, J., Shen, L., & Sun, G. (2018, June). Squeeze-and-excitation networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7132-7141). Salt Lake City, UT.
  9. Hsu, W. N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang, Y., Cao, Y., ... Pang, R. (2018, April-May). Hierarchical generative modeling for controllable speech synthesis. Proceedings of the International Conference on Learning Representations. Vancouver, BC.
  10. Jung, J., Kim, Y., Heo, H. S., Lee, B. J., Kwon, Y., & Chung, J. S. (2022, September). Pushing the limits of raw waveform speaker recognition. Proceedings of the Interspeech 2022 (pp. 2228-2232). Incheon, Korea,
  11. Jung, J., Kim, S., Shim, H., Kim, J., & Yu, H. (2020, October). Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. Proceedings of the Interspeech 2020 (pp. 1496-1500). Shanghai, China.
  12. Kwon, Y., Heo, H. S., Lee, B. J., & Chung, J. S. (2021, June). The ins and outs of speaker recognition: Lessons from VoxSRC 2020. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5809-5813). Toronto, ON.
  13. Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033.
  14. Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6706-6713). Honolulu, HI.
  15. Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
  16. Moss, H. B., Aggarwal, V., Prateek, N., Gonzalez, J., & Barra-Chicote, R. (2020, May). Boffin TTS: Few-shot speaker adaptation by Bayesian optimization. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7639-7643). Barcelona, Spain.
  17. Nagrani, A., Chung, J. S., & Zisserman, A. (2017, August). VoxCeleb: A large-scale speaker identification dataset. Proceedings of the Interspeech 2017 (pp. 2616-2620). Stockholm, Sweden.
  18. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Vancouver, BC.
  19. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020, June). FastSpeech 2: Fast and high-quality end-to-end text to speech. Retrieved from arXiv:2006.04558v8
  20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L, ... Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach, CA.
  21. Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2018, April). Generalized end-to-end loss for speaker verification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4879-4883). Calgary, AB.
  22. Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., ... Wu, Y. (2019, September). LibriTTS: A corpus derived from librispeech for text-to-speech. Proceedings of the Interspeech 2019. Graz, Austria.
  23. Zhao, B., Zhang, X., Wang, J., Cheng, N., & Xiao, J. (2022, May). nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech. Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4293-4297). Singapore.