잡음에 강인한 시청각 음성인식

  • 이종석 (한국과학기술원 전자전산학부) ;
  • 박철훈 (한국과학기술원 전자전산학부)
  • Published : 2007.09.25

Abstract

Keywords

References

  1. H. McGurk and J. MacDonald, 'Hearing lips and seeing voices,' Nature, vol. 264, pp. 746-748,Dec., 1976 https://doi.org/10.1038/264746a0
  2. A. Q. Summerfield, 'Some preliminaries to a comprehensive account of audio-visual speech perception, in B. Dodd and R. Campbell, eds., Hearing by Eye: The Psychology of Lip-reading, pp. 3-51, Lawrence Erlbarum, London, 1987
  3. L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, 'Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,' Cerebral Cortex, vol. 17, no. 5, pp. 1147-1153, 2007 https://doi.org/10.1093/cercor/bhl024
  4. R. M. Stem, B. Raj, and P. J. Moreno, 'Compensation for environmental degradation in automatic speech recognition,' in Proc. ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition using Unknown Communication Channels, Pont-a-mousson, France, pp. 33-42, Apr. 1997
  5. H. Hermansky and N. Morgan, 'RASTA processing of speech,' IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 578-589,1994 https://doi.org/10.1109/89.326616
  6. L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, New Jersey, 1993
  7. H. P. Graf, E. Cosatto, and G. Potamianos, 'Robust recognition of faces and facial features with a multi-modal system,' in Proc. Int. Conf Systems, Man and Cybernetics, pp. 2034-2039,1997
  8. M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung, 'Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis,' Speech Communication, vol. 43, no. 1-2, pp. 1-16, Jan. 2004 https://doi.org/10.1016/j.specom.2004.01.003
  9. T. Coianiz, L. Torresani, and B. Capril, '2D deformable models for visual speech analysis,' in D. G. Stork and M. E. Hennecke, eds., Speechreading by Humans and Machines: Models, Systems and Applications, pp. 391-398, Springer-Verlag, Berlin, German, 1996
  10. S. Dupont and J. Luettin, 'Audio-visual speech modeling for continuous speech recognition,' IEEE Trans. Multimedia, vol. 2, no. 3,pp.141-151, Sept. 2000 https://doi.org/10.1109/6046.865479
  11. G. Potamianos, H. P. Graf, and E. Cosatto, 'An image transform approach for HMM based automatic lipreading,' in Proc. Int. Conf. Image Processing, vol. 3, Chicago, pp. 173-177, 1998
  12. M. S. Gray, J. R. Movellan, and T. J. Sejnowski, 'Dynamic features for visual speechreading: a systematic comparison,' Advances in Neural Information Processing Systems, vol. 9, pp. 751-757,1997
  13. 이종석, 심선희, 김소영, 박철훈, '제어되지 않은 조명 조건하에서 입술움직임의 강인한 특징추출을 이용한 바이모달 음성인식,' Telecommunications Review, 제 14권 1호, pp. 123-134, 2004년 2월
  14. T. J. Hazen, 'Visual model structures and synchrony constraints for audio-visual speech recognition,' IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 1082-1089, May 2006 https://doi.org/10.1109/TSA.2005.857572
  15. C. Benoit, 'The intrinsic bimodality of speech communication and the synthesis of talking faces,' in M. M. Taylor, F. Nel, and D. Bouwhuis, eds., The Structure of Multimodal Dialogue II, John Benjamins, Amsterdam, The Netherlands, pp. 485-502, 2000
  16. B. Conrey and D. B. Pisoni, 'Auditory-visual speech perception and synchrony detection for speech and nonspeech signals,' Journal of Acoustical Society of America, vol. 119, no. 6, pp. 4065-4073, June, 2006 https://doi.org/10.1121/1.2195091
  17. S. Tamura, K. Iwano, and S. Furui, 'A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 1, pp. 469-472, 2005
  18. 이종석, 박철훈, '시청각 음성인식을 위한 정보통합: 신뢰도 측정방식의 비교와 신경회로망을 이용한 통합 기법,' Telecommunications Review, 제 17권 3호, pp. 538-550, 2007년 6월
  19. S. Nakamura, 'Statistical multimodal integration for audio-visual speech processing,' IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 854-866, Jul. 2002 https://doi.org/10.1109/TNN.2002.1021886
  20. S. M. Chu and T. S. Huang, 'Audio-visual speech modeling using coupled hidden Markov models,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 2, Orlando, FL, pp. 2009-2012, May 2002
  21. S. Bengio, 'Mnltimodal speech processing using asynchronous hidden Markov models,' Information Fusion, vol. 5, pp. 81-89, 2004 https://doi.org/10.1016/j.inffus.2003.04.001
  22. A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, 'Dynamic Bayesian networks for audio-visual speech recognition,' EURASIP J. Applied Signal Processing, vol. 11, pp. 1-15, 2002
  23. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, 'Recent advances in the automatic recognition of audiovisual speech,' Proc. IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003 https://doi.org/10.1109/JPROC.2003.817150
  24. S. Pigeon and L. Vandendorpe, 'The M2VTS multimodal face database,' in Proc. Int. Conf Audio- and Video-based Biometric Person Authentication, pp. 403-409, 1997 https://doi.org/10.1007/BFb0016021
  25. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, 'XM2VTS: the extended M2VTS database,' in Proc. Int. Conf Audio and Video-based Biometric Person Authentication, pp. 72-76,1999
  26. http://voice.etri.re.kr/DBSearch/Voice.asp