DOI QR코드

DOI QR Code

Complex nested U-Net-based speech enhancement model using a dual-branch decoder

이중 분기 디코더를 사용하는 복소 중첩 U-Net 기반 음성 향상 모델

  • 황서림 (연세대학교 지능형신호처리연구실) ;
  • 박성욱 (강릉원주대학교 전자공학과) ;
  • 박영철 (연세대학교 지능형신호처리연구실)
  • Received : 2024.01.22
  • Accepted : 2024.02.07
  • Published : 2024.03.31

Abstract

This paper proposes a new speech enhancement model based on a complex nested U-Net with a dual-branch decoder. The proposed model consists of a complex nested U-Net to simultaneously estimate the magnitude and phase components of the speech signal, and the decoder has a dual-branch decoder structure that performs spectral mapping and time-frequency masking in each branch. At this time, compared to the single-branch decoder structure, the dual-branch decoder structure allows noise to be effectively removed while minimizing the loss of speech information. The experiment was conducted on the VoiceBank + DEMAND database, commonly used for speech enhancement model training, and was evaluated through various objective evaluation metrics. As a result of the experiment, the complex nested U-Net-based speech enhancement model using a dual-branch decoder increased the Perceptual Evaluation of Speech Quality (PESQ) score by about 0.13 compared to the baseline, and showed a higher objective evaluation score than recently proposed speech enhancement models.

본 논문에서는 이중 분기 디코더를 갖는 복소 중첩 U-Net 기반의 새로운 음성 향상 모델을 제안하였다. 제안된 모델은 음성 신호의 크기와 위상 성분을 동시에 추정할 수 있도록 복소 중첩 U-Net으로 구성되며, 디코더는 스펙트럼 사상과 시간 주파수 마스킹을 각각의 분기에서 수행하는 이중 분기 디코더 구조를 갖는다. 이때, 이중 분기 디코더 구조는 단일 디코더 구조에 비하여, 음성 정보의 손실을 최소화하면서 잡음을 효과적으로 제거할 수 있도록 한다. 실험은 음성 향상 모델 학습을 위해 보편적으로 사용되는 VoiceBank + DEMAND 데이터베이스 상에서 이루어졌으며, 다양한 객관적 평가 지표를 통해 평가되었다. 실험 결과, 이중 분기 디코더를 사용하는 복소 중첩 U-Net 기반 음성 향상 모델은 기존의 베이스라인과 비교하여 Perceptual Evaluation of Speech Quality(PESQ) 점수가 0.13가량 증가하였으며, 최근 제안된 음성 향상 모델들보다도 높은 객관적 평가 점수를 보였다.

Keywords

References

  1. P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. (CRC Press, Inc., Boca Raton, 2013), pp. 1-768.
  2. S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," arXiv preprint arXiv:1703.09452 (2017).
  3. H. Dubey, A. Aazami, V. Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, H. Gamper, M. Golestaneh, and R. Aichner, "Icassp 2023 deep speech enhancement challenge," arXiv preprint arXiv:2303.11510 (2023).
  4. S. Hwang, S. W. Park, and Y. Park. "Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain" (in Korean), J. Acoust. Soc. Kr. 41, 30-37 (2022).
  5. S. A. Nossier, J. Wall, M. Moniri, C. Glackin, and N. Cannings, "Mapping and masking targets comparison using different deep learning based speech enhancement architectures," Proc, IEEE IJCNN, 1-8 (2020).
  6. H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex u-net," arXiv preprint arXiv:1903.03107 (2019).
  7. X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, "U2-Net: Going deeper with nested U-structure for salient object detection," Pattern Recognition, 106, 107404 (2020).
  8. S. Hwang, S. W. Park, and Y. Park, "Monoaural speech enhancement using a nested U-net with two-level skip connections," Proc. Interspeech, 191-195. (2022).
  9. R. Cao, S. Abdulatif, and B. Yang, "CMGAN: Conformer-based metric GAN for speech enhancement," arXiv preprint arXiv:2203.15149 (2022).
  10. Z. Zhang, S. Xu, X. Zhuang, L. Zhou, H. Li, and M. Wang, "Two-stage UNet with multi-axis gated multilayer perceptron for monaural noisy-reverberant speech enhancement," Proc. IEEE ICASSP, 1-5 (2023).
  11. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, "DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement," arXiv preprint arXiv:2008.00264 (2020).
  12. S. Hwang, J. Byun, and Y.-C. Park. "Performance comparison evaluation of speech enhancement using various loss functions" (in Korean), J. Acoust. Soc. Kr. 40, 176-182 (2021).
  13. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, "Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech," Proc. SSW, 146-152 (2016).
  14. Y. Hu and P. C. Loizou. "Evaluation of objective measures for speech enhancement," Proc. Interspeech, 1447-1450 (2006).
  15. A. Li, C. Zheng, L. Zhang, and X. Li, "Glance and gaze: A collaborative learning framework for single-channel speech enhancement," Appl. Acoust. 187, 108499 (2022).
  16. A. Defossez, G. Synnaeve, and Y. Adi, "Real time speech enhancement in the waveform domain," arXiv preprint arXiv:2006.12847 (2020).
  17. A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, "Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement," IEEE/ACM Transa. on Audio, Speech, and Lang. Process. 29, 1829-1843 (2021).
  18. S. Zhao, B. Ma, K. N. Watcharasupat, and W. S. Gan, "FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement," Proc. IEEE ICASSP, 9281-9285 (2022).