DOI QR코드

DOI QR Code

Fast speaker adaptation using extended diagonal linear transformation for deep neural networks

  • Kim, Donghyun (SW.Contents Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Kim, Sanghun (SW.Contents Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2017.08.11
  • Accepted : 2018.06.25
  • Published : 2019.02.12

Abstract

This paper explores new techniques that are based on a hidden-layer linear transformation for fast speaker adaptation used in deep neural networks (DNNs). Conventional methods using affine transformations are ineffective because they require a relatively large number of parameters to perform. Meanwhile, methods that employ singular-value decomposition (SVD) are utilized because they are effective at reducing adaptive parameters. However, a matrix decomposition is computationally expensive when using online services. We propose the use of an extended diagonal linear transformation method to minimize adaptation parameters without SVD to increase the performance level for tasks that require smaller degrees of adaptation. In Korean large vocabulary continuous speech recognition (LVCSR) tasks, the proposed method shows significant improvements with error-reduction rates of 8.4% and 17.1% in five and 50 conversational sentence adaptations, respectively. Compared with the adaptation methods using SVD, there is an increased recognition performance with fewer parameters.

Keywords

References

  1. F. Seide, G. Li, and D. Yu, Conversational speech transcription using context-dependent deep neural networks, Annu. Conf. Int. Speech Commun. Assoc., Florence, Italy, Aug. 27-31, 2011, 437-440.
  2. G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process 29 (2012), no. 6, 82-97. https://doi.org/10.1109/MSP.2012.2205597
  3. M. Gales, Maximum likelihood linear transformations for HMMbased speech recognition, Comput. Speech Language 12 (1998), no. 2, 75-98. https://doi.org/10.1006/csla.1998.0043
  4. F. Seide et al., Feature engineering in context-dependent deep neural networks for conversational speech transcription, IEEE Workshop Autom. Speech Recogn. Understanding, Waikoloa, HI, USA, Dec. 11-15, 2011, pp. 24-29.
  5. J. Stadermann and G. Rigoll, Two-stage speaker adaptation of hybrid tied-posterior acoustic models, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Philadelphia, PA, USA, Mar. 23-25, 2005, pp. 977-980.
  6. D. Albesano et al., Adaptation of artificial neural networks avoiding catastrophic forgetting, IEEE Int. Joint Conf. Neural Network Proc., Vancouver, Canada, July 16-21, 2006, pp. 1554-1561.
  7. S. Xue et al., Fast adaptation of deep neural network based on discriminant codes for speech recognition, IEEE/ACM Trans. Audio, Speech, Language Process 22 (2014) no. 12, 1713-1725. https://doi.org/10.1109/TASLP.2014.2346313
  8. T. Tan, Y. Qian, and K. Yu, Cluster adaptive training for deep neural network based acoustic model, IEEE/ACM Trans. Audio, Speech, Language Process 24 (2016) no. 3, 459-468. https://doi.org/10.1109/TASLP.2015.2511922
  9. G. Saon et al., Speaker adaptation of neural network acoustic models using i-vectors, IEEE Workshop Autom Speech Recogn. Understanding, Olomouc, Czech Republic, Dec. 8-12, 2013, pp. 55-59.
  10. A. Senior and I. Lopez-Moreno, Improving DNN speaker independence with i-vector inputs, IEEE Int. Conf. Acoust. Speech Signal Process., Florence, Italy, May 4-9, 2014, pp. 225-229.
  11. J. Trmal, J. Zelinka, and L. Muller, Adaptation of a feedforward artificial neural network using a linear transform, Int. Conf. Text Speech Dialogue Proc., Brno, Czech Republic, Sept. 6-10, 2010, pp. 423-430.
  12. B. Li and K.C. Sim, Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, Annu. Conf. Int. Speech Commun. Assoc., Makuhari, Japan, Sept. 26-30, 2010, pp. 526-529.
  13. R. Gemello et al., Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training, IEEE Int. Conf. Acoust. Speech Signal Process. Proc., Toulouse, France, May 14-19, 2006, pp. 1189-1192.
  14. K. Yao et al., Adaptation of context-dependent deep neural networks for automatic speech recognition, IEEE Spoken Language Technol. Workshop, Miami, FL, USA, Dec. 2-5, 2012, pp. 366-369.
  15. Y. Zhao et al., Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data, IEEE Int. Conf. Acoust. Speech Signal Process., Brisbane, Australia, Apr. 19-24, 2015, pp. 4310-4314.
  16. X. Li and J. Bilmes, Regularized adaptation of discriminative classifiers, IEEE Int. Conf. Acoust. Speech Signal Process., Toulouse, France, May 14-19, 2006, pp. 237-240.
  17. D. Yu et al., KL-divergence regularized deep neural network adaptation improved large vocabulary speech recognition, IEEE Int. Conf. Acoust. Speech Signal Process., Vancouver, Canada, May 26-31, 2013, pp. 7893-7897.
  18. P. Swietojanski and S. Renals, Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, IEEE Spoken Language Technol. Workshop, South Lake Tahoe, NV, USA, Dec. 7-10, 2014, pp. 171-176.
  19. J. Xue et al., Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network, IEEE Int. Conf. Acoust. Speech Signal Process., Florence, Italy, May 4-9, 2014, pp. 6359-6363.
  20. K. Kumar et al., Intermediate-layer DNN Adaptation for offline and session-based iterative speaker adaptation, Annu. Conf. Int. Speech Commun. Assoc., Dresden, Germany, Sept. 6-10, 2015, pp. 1091-1095.
  21. Y. Zhao, J. Li, and Y. Gong, Low-rank plus diagonal adaptation for deep neural networks, IEEE Int. Conf. Acoust. Speech Signal Process., Shanghai, China, Mar. 20-25, 2016, pp. 5005-5009.
  22. D. Yu and L. Deng, Automatic speech recognition: A deep learning approach, Spinger-Verlag London, UK, 2015, pp. 57-65.
  23. I. Sutskever et al., On the importance of initialization and momentum in deep learning, Proc. Int. Conf. Mach. Learn., Atlanta, GA, USA, June 16-21, 2013, pp. 1139-1147.
  24. S. Pan and Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (2010), no. 10, 1345-1359. https://doi.org/10.1109/TKDE.2009.191
  25. J. Huang et al., Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, IEEE Int. Conf. Acoust. Speech Signal Process., Vancouver, Canada, May 26-31, 2013, pp. 7304-7308.
  26. D. Povey et al., The Kaldi speech recognition toolkit, IEEE Workshop Autom. Speech Recogn. Understanding, Waikoloa, HI, USA, Dec. 11-15, 2011.
  27. Y. Miao, H. Zhang, and F. Metze, Speaker adaptive training of deep neural network acoustic models using i-vectors, IEEE/ACM Trans. Audio, Speech, Language Process. 23 (2015), no. 11, 1938-1949. https://doi.org/10.1109/TASLP.2015.2457612
  28. D. Snyder, D. Garcia-Romero, and D. Povey, Time delay deep neural network-based universal background models for speaker recognition, IEEE Workshop Autom. Speech Recogn. Understanding, Scottsdale, AZ, USA, Dec. 13-17, 2015, pp. 92-97.

Cited by

  1. Simultaneous neural machine translation with a reinforced attention mechanism vol.43, pp.5, 2019, https://doi.org/10.4218/etrij.2020-0358