DOI QR코드

DOI QR Code

Speaker Tracking Using Eigendecomposition and an Index Tree of Reference Models

  • Received : 2010.11.15
  • Accepted : 2011.04.22
  • Published : 2011.10.31

Abstract

This paper focuses on online speaker tracking for telephone conversations and broadcast news. Since the online applicability imposes some limitations on the tracking strategy, such as data insufficiency, a reliable approach should be applied to compensate for this shortage. In this framework, a set of reference speaker models are used as side information to facilitate online tracking. To improve the indexing accuracy, adaptation approaches in eigenvoice decomposition space are proposed in this paper. We believe that the eigenvoice adaptation techniques would help to embed the speaker space in the models and hence enrich the generality of the selected speaker models. Also, an index structure of the reference models is proposed to speed up the search in the model space. The proposed framework is evaluated on 2002 Rich Transcription Broadcast News and Conversational Telephone Speech corpus as well as a synthetic dataset. The indexing errors of the proposed framework on telephone conversations, broadcast news, and synthetic dataset are 8.77%, 9.36%, and 12.4%, respectively. Using the index tree structure approach, the run time of the proposed framework is improved by 22%.

Keywords

References

  1. A. Martin and M. Przybocki, "Speaker Recognition in a Multispeaker Environment," Proc. Eur. Conf. Speech Commun. Technol., vol. 2, 2001, pp. 787-790.
  2. S. Meignier et al., "Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization," Comput. Speech Language, vol. 20, no. 2-3, 2006, pp. 303-330. https://doi.org/10.1016/j.csl.2005.08.002
  3. T.H. Nguyen, H. Li, and E.S. Chng, "Cluster Criterion Functions in Spectral Subspace and Their Application in Speaker Clustering," Proc. ICASSP, 2009, pp. 4085-4088.
  4. K. Iso, "Speaker Clustering Using Vector Quantization and Spectral Clustering," Proc. ICASSP, 2010, pp. 4986-4989.
  5. M. Kotti, V. Moschou, and C. Kotropoulos, "Speaker Segmentation and Clustering," Signal Process., vol. 88, no. 5, 2008, pp. 1091-1124. https://doi.org/10.1016/j.sigpro.2007.11.017
  6. S. Kwon and S. Narayanan, "Unsupervised Speaker Indexing Using Generic Models," IEEE Trans. Speech Audio Process., vol. 13, 2004, pp.1004-1013.
  7. M. Davy et al., "Supervised Classification Using MCMC Methods," Proc. ICASSP, 2000, pp. 33-36.
  8. K. Markov and S. Nakamura, "Never-Ending Learning with Dynamic Hidden Markov Network," Proc. Interspeech, 2007, pp. 1437-1440.
  9. D.A. Reynolds, "Speaker Identification and Verification Using Gaussian Mixture Speaker Models," Speech Commun., vol. 17, no. 1-2, 1995, pp. 91-108. https://doi.org/10.1016/0167-6393(95)00009-D
  10. K. Markov and S. Nakamura, "Improved Novelty Detection for Online GMM Based Speaker Diarization," Proc. Interspeech, 2008, pp. 363-366.
  11. M. Zamalloa et al., "Low Latency Online Speaker Tracking on the AMI Corpus of Meeting Conversations," Proc. ICASSP, 2010, pp. 4962-4965.
  12. A.K. Noulas and B.J.A. Krose, "Online Multimodal Speaker Diarization," Int. Conf. Multi-modal Inferences, 2007, pp. 350- 357.
  13. J. Schmalenstroeer et al., "Fusing Audio and Video Information for Online Speaker Diarization," Proc. ASRU, 2007, pp. 1163- 1166.
  14. C. Vaquero, O. Vinyals, and G. Friedland, "A Hybrid Approach to Online Speaker Diarization," Proc. Interspeech, 2010, pp. 2638-2631.
  15. C. Wooters and M. Huijbregts, "The ICSI RT07s Speaker Diarization System," Proc. RT Meeting Recognition Evaluation Workshop, 2007.
  16. D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digit. Signal Process., vol. 10, no. 1-3, 2000, pp. 19-41. https://doi.org/10.1006/dspr.1999.0361
  17. R. Kuhn et al., "Rapid Speaker Adaptation in Eigenvoice Space," IEEE Trans. Speech Audio Process., vol. 8, no. 4, 2000, pp. 695- 707. https://doi.org/10.1109/89.876308
  18. C.H. Huang, J.T. Chien, and H.M. Wang, "A New Eigenvoice Approach to Speaker Adaptation," Proc. Int. Symp. Chinese Spoken Language Process., 2004, pp. 109-112.
  19. X. Anguera et al., "Frame Purification for Cluster Comparison in Speaker Diarization," Proc. 2nd Int. Workshop Multimodal User Authentication, 2006, pp. 135-139.
  20. I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, 1986.
  21. A. Dempster, N. Laird, and D. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., series B, vol. 39, no. 1, 1977, pp. 1-38.
  22. S. Berrani, L. Amsaleg, and P. Gros, "Robust Content-Based Image Searches for Copyright Protection," Proc. ACM Workshop Multimedia Databases, 2003, pp. 70-77.
  23. P. Zezula et al., "Similarity Search: The Metric Space Approach," Adv. Database Syst., vol. 32, 2006, pp. 23-38.
  24. S. Kullback and R.A. Leibler, "On Information and Sufficiency," Annals of Mathematical Statistics, vol. 22, no. 1, 1951, pp. 79-86. https://doi.org/10.1214/aoms/1177729694
  25. M.H. Moattar and M.M. Homayounpour, "A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection," ETRI J., vol. 33, no. 1, 2011, pp. 99-109. https://doi.org/10.4218/etrij.11.1510.0158
  26. J. Garofolo et al., "NIST Rich Transcription 2002 Evaluation: A Preview," Proc. Language Resources Evaluation Conf., May 2002.
  27. Y.K. Muthusamy et al. "The OGI Multi-language Telephone Speech Corpus," Proc. ICSLP, vol. 2, 1992, pp. 895-898.
  28. M. Bijankhan, Great Farsdat Database, Technical report, Iran Research Center on Intelligent Signal Processing, 2002.
  29. The 2009 (RT-09) Rich Transcription Evaluation Plan, http://www.itl.nist.gov/iad/mig//tests/rt/2009/docs/rt09-meetingeval- plan-v2.pdf, last accessed on Dec. 6, 2010.
  30. L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice-Hall, 1993.
  31. W. Wang et al., "A Decision-Tree-Based Online Speaker Clustering," Lect. Notes Comput. Sci., vol. 4477, 2007, pp. 555- 562.
  32. K. Chen et al., "Fast Speaker Adaptation Using Eigenspace-Based Maximum Likelihood Linear Regression," Proc. ICSLP, 2000, pp. 742-745.
  33. B. Mak, J.T. Kwok, and S. Ho, "A Study of Various Composite Kernels for Kernel Eigenvoice Speaker Adaptation," Proc. ICASSP, vol. 1, 2004, pp. 325-328.