References
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020, December). wav2vec 2.0: A framework for self-supervised learning of speech representations. Proceedings of the Advances in Neural Information Processing Systems. Online Conference.
- Bakhturina, E., Lavrukhin, V., Ginsburg, B., & Zhang, Y. (2021). Hi-fi multi-speaker English TTS dataset. Retrieved from https://doi.org/10.48550/arXiv.2104.01497
- Buduma, N., Buduma, N., & Papa, J. (2022). Fundamentals of deep learning. Sebastopol, CA: O'Reilly Media.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://doi.org/10.48550/arXiv.1810.04805
- Graves, A. (2013). Generating sequences with recurrent neural networks. Retrieved from https://doi.org/10.48550/arXiv.1308.0850
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. https://doi.org/10.1109/TASLP.2021.3122291
- Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Mmachine Learning (pp. 448-456). Lille, France.
- Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. Retrieved from https://doi.org/10.48550/arXiv.1609.04836
- Kim, J., Kim, S., Kong, J., & Yoon, S. (2020, December). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Poceedings of the Advances in Neural Information Processing Systems. Online Conference.
- Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. Proceedings of the International Conference on Machine Learning (pp. 5530-5540). Online Conference.
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Retrieved from https://doi.org/10.48550/arXiv.1412.6980
- Kingma, D. P., & Dhariwal, P. (2018, December). Glow: Generative flow with invertible 1x1 convolutions. Proceedings of the Advances in Neural Information Processing Systems. Montreal, QU, Canada.
- Kong, J., Kim, J., & Bae, J. (2020, December). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proceedings of the Advances in Neural Information Processing Systems. Online Conference.
- Logan IV, R. L., Balazevic, I., Wallace, E., Petroni, F., Singh, S., & Riedel, S. (2021). Cutting down on prompts and parameters: Simple few-shot learning with language models. Retrieved from https://doi.org/10.48550/arXiv.2106.13353
- McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015, July). librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference. Austin, TX.
- Palatucci, M., Pomerleau, D., Hinton, G. E., & Mitchell, T. M. (2009, December). Zero-shot learning with semantic output codes. Proceedings of the Advances in Neural Information Processing Systems. Vancouver, BC.
- Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019, June). AutoVC: Zero-shot voice style transfer with only autoencoder loss. Proceedings of the International Conference on Machine Learning (pp. 5210-5219). Long Beach, CA.
- Rezende, D., & Mohamed, S. (2015, July). Variational inference with normalizing flows. Proceedings of the International Conference on Machine Learning (pp. 1530-1538). Lille, France.
- RVC-Project. (2023). RVC-Project/Retrieval-based-Voice- Conversion-WebUI. GitHub. Retrieved from https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
- Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. Retrieved from https://doi.org/10.48550/arXiv.1904.05862
- Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
- Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., ... Bian, J. (2023). NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. Retrieved from https://doi.org/10.48550/arXiv.2304
- Sisman, B., Yamagishi, J., King, S., & Li, H. (2021). An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 132-157. https://doi.org/10.1109/TASLP.2020.3038524
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, July). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the International Conference on Machine Learning (pp. 2256-2265). Lille, France.
- svc-develop-team. (2023). svc-develop-team/so-vits-svc. GitHub. Retrieved from https://github.com/svc-develop-team/so-vits-svc
- van Niekerk, B., Carbonneau, M. A., Zaidi, J., Baas, M., Seute, H., & Kamper, H. (2022, May). A comparison of discrete and soft speech units for improved voice conversion. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6562-6566). Singapore, Singapore.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017, December). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems. Long Beach, CA.
- Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the voice conversion challenge 2016 evaluation results. Proceedings of the Interspeech (pp. 1637-1641). San Francisco, CA.
- Yamagishi, J., Veaux, C., & MacDonald, K. (2019). CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6, 15.