Acknowledgement
이 논문은 2023학년도 경북대학교 연구년교수 연구비에 의하여 연구되었음
References
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, "Zero-shot text-to-image generation," Proc. ICML, 8821-8831 (2021).
- OpenAI, "GPT-4 technical report," arXiv preprint, arXiv:2303.08774 (2023).
- M. Pasini and J. Schluter, "Musika! fast infinite waveform music generation," arXiv preprint, arXiv: 2208.08706 (2022).
- Z. Borsos, R. Marninier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, "Audiolm: a language modrlling approach to audio generation," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 31, 2523-2533 (2023).
- J. Kong, J. Kim, and J. Bae, "Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis," Proc. NeurIPS. 33, 17022-17033 (2020).
- J. Engel, L. Hantrakul, C. Gu, and A. Roberts, "DDSP: differentiable digital signal processing," arXiv preprint, arXiv:2001.04643 (2020).
- K. Choi, J. Im, L. M. Heller, B. McFee, K. Imoto, Y. Okamoto, M. Lagrange, and S. Takamichi, "Foley sound synthesis at the dcase 2023 challenge," arXiv preprint, arXiv:2304.12521 (2023).
- H. C. Chung, "Foley sound synthesis based on GAN using contrastive learning without label information," DCASE2023, Tech. Rep., 2023.
- Y. Yuan, H. Liu, X. Liu, X Kang, M. D. Plumbley, and W. Wang, "Latent diffusion model based Foley sound generation system for DCASE challenge 2023 task 7," arXiv preprint, arXiv:2305.15905 (2023).
- X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, "Pixelsnail: an improved autoregressive generative model," Proc. International Conference on Machine Learning, 864-872 (2018).
- N. Zeghidour, A. Luebs, A. Omran, J. Skoguld, and M. Tagliasacchi, "Sonudstream: an end-to-end neural audio codec," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 30, 495-507 (2021).
- A. Deffossez, J. Copet, G. Synnaeve, and Y. Adi, "High fidelity neural audio compression," arXiv preprint, arXiv:2210.13438 (2022).
- D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint, arXiv:1312.6114 (2013).
- A. Caillon and P. Esling, "RAVE: a variational autoencoder for fast and high-quality neural audio synthesis," arXiv preprint, arXiv:2111.05011 (2021).
- A. van den Oord and O. Vinyals, "Neural discrete representation learning," Proc. NeurIPS. 1-10 (2017).
- A. Razavi, A. van den Oord, and O. Vinyals, "Generating diverse high-fidelity images with VQ-VAE-2," Proc. NeurIPS, 1-11 (2019).
- D. P. Kingma and J. Ba, "Adam: a method for stochastic optimization," arXiv preprint, arXiv:1412.6980 (2014).
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, "Frechet audio distance: a metric for evaluating music enhancement algorithms," arXiv preprint, arXiv:1812. 08466 (2018).
- S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, "CNN architectures for large-scale audio classification," Proc. IEEE ICASSP, 131- 135 (2017).