DOI QR코드

DOI QR Code

A study on the application of residual vector quantization for vector quantized-variational autoencoder-based foley sound generation model

벡터 양자화 변분 오토인코더 기반의 폴리 음향 생성 모델을 위한 잔여 벡터 양자화 적용 연구

  • Seokjin Lee (School of Electronic and Electrical Engineering, Kyungpook National University)
  • 이석진 (경북대학교 전자공학부, 전자전기공학부)
  • Received : 2024.01.23
  • Accepted : 2024.02.15
  • Published : 2024.03.31

Abstract

Among the Foley sound generation models that have recently begun to be studied, a sound generation technique using the Vector Quantized-Variational AutoEncoder (VQ-VAE) structure and generation model such as Pixelsnail are one of the important research subjects. On the other hand, in the field of deep learning-based acoustic signal compression, residual vector quantization technology is reported to be more suitable than the conventional VQ-VAE structure. Therefore, in this paper, we aim to study whether residual vector quantization technology can be effectively applied to the Foley sound generation. In order to tackle the problem, this paper applies the residual vector quantization technique to the conventional VQ-VAE-based Foley sound generation model, and in particular, derives a model that is compatible with the existing models such as Pixelsnail and does not increase computational resource consumption. In order to evaluate the model, an experiment was conducted using DCASE2023 Task7 data. The results show that the proposed model enhances about 0.3 of the Fréchet audio distance. Unfortunately, the performance enhancement was limited, which is believed to be due to the decrease in the resolution of time-frequency domains in order to do not increase consumption of the computational resources.

최근에 연구되기 시작한 폴리(Foley) 음향 생성 모델 중 벡터 양자화 변분 오토인코더(Vector Quantized-Variational AutoEncoder, VQ-VAE) 구조와 Pixelsnail 등 생성모델을 활용한 생성 기법은 중요한 연구대상 중 하나이다. 한편, 딥러닝 기반의 음향 신호의 압축/복원 분야에서는 기존의 VQ-VAE 구조에 비해 잔여 벡터 양자화 기술이 더 적합한 것으로 보고되고 있으며, 따라서 본 논문에서는 폴리 음향 생성 분야에서도 잔여 벡터 양자화 기술이 효과적으로 적용될 수 있을지 연구하고자 한다. 이를 위하여 본 논문에서는 기존의 VQ-VAE 기반의 폴리 음향 생성 모델에 잔여 벡터 양자화 기술을 적용하되, Pixelsnail 등 기존의 다른 모델과 호환이 가능하고 연산 자원의 소모를 늘리지 않는 모델을 고안하여 그 효과를 확인하고자 하였다. 효과를 검증하기 위하여 DCASE2023 Task7의 데이터를 활용하여 실험을 진행하였으며, 그 결과 평균적으로 0.3 가량의 Fréchet audio distance 의 향상을 보이는 것을 확인하였다. 다만 그 성능 향상의 정도가 제한적이었으며, 이는 연산 자원의 소모를 유지하기 위하여 시간-주파수축의 분해능이 저하된 영향으로 판단된다.

Keywords

Acknowledgement

이 논문은 2023학년도 경북대학교 연구년교수 연구비에 의하여 연구되었음

References

  1. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, "Zero-shot text-to-image generation," Proc. ICML, 8821-8831 (2021). 
  2. OpenAI, "GPT-4 technical report," arXiv preprint, arXiv:2303.08774 (2023). 
  3. M. Pasini and J. Schluter, "Musika! fast infinite waveform music generation," arXiv preprint, arXiv: 2208.08706 (2022). 
  4. Z. Borsos, R. Marninier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, "Audiolm: a language modrlling approach to audio generation," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 31, 2523-2533 (2023). 
  5. J. Kong, J. Kim, and J. Bae, "Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis," Proc. NeurIPS. 33, 17022-17033 (2020). 
  6. J. Engel, L. Hantrakul, C. Gu, and A. Roberts, "DDSP: differentiable digital signal processing," arXiv preprint, arXiv:2001.04643 (2020). 
  7. K. Choi, J. Im, L. M. Heller, B. McFee, K. Imoto, Y. Okamoto, M. Lagrange, and S. Takamichi, "Foley sound synthesis at the dcase 2023 challenge," arXiv preprint, arXiv:2304.12521 (2023). 
  8. H. C. Chung, "Foley sound synthesis based on GAN using contrastive learning without label information," DCASE2023, Tech. Rep., 2023. 
  9. Y. Yuan, H. Liu, X. Liu, X Kang, M. D. Plumbley, and W. Wang, "Latent diffusion model based Foley sound generation system for DCASE challenge 2023 task 7," arXiv preprint, arXiv:2305.15905 (2023). 
  10. X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, "Pixelsnail: an improved autoregressive generative model," Proc. International Conference on Machine Learning, 864-872 (2018). 
  11. N. Zeghidour, A. Luebs, A. Omran, J. Skoguld, and M. Tagliasacchi, "Sonudstream: an end-to-end neural audio codec," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 30, 495-507 (2021). 
  12. A. Deffossez, J. Copet, G. Synnaeve, and Y. Adi, "High fidelity neural audio compression," arXiv preprint, arXiv:2210.13438 (2022). 
  13. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint, arXiv:1312.6114 (2013). 
  14. A. Caillon and P. Esling, "RAVE: a variational autoencoder for fast and high-quality neural audio synthesis," arXiv preprint, arXiv:2111.05011 (2021). 
  15. A. van den Oord and O. Vinyals, "Neural discrete representation learning," Proc. NeurIPS. 1-10 (2017). 
  16. A. Razavi, A. van den Oord, and O. Vinyals, "Generating diverse high-fidelity images with VQ-VAE-2," Proc. NeurIPS, 1-11 (2019). 
  17. D. P. Kingma and J. Ba, "Adam: a method for stochastic optimization," arXiv preprint, arXiv:1412.6980 (2014). 
  18. K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, "Frechet audio distance: a metric for evaluating music enhancement algorithms," arXiv preprint, arXiv:1812. 08466 (2018). 
  19. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, "CNN architectures for large-scale audio classification," Proc. IEEE ICASSP, 131- 135 (2017).