全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Feature Fusion Based Audio-Visual Speaker Identification Using Hidden Markov Model under Different Lighting Variations

DOI: 10.1155/2014/831830

Full-Text   Cite this paper   Add to My Lib

Abstract:

The aim of the paper is to propose a feature fusion based Audio-Visual Speaker Identification (AVSI) system with varied conditions of illumination environments. Among the different fusion strategies, feature level fusion has been used for the proposed AVSI system where Hidden Markov Model (HMM) is used for learning and classification. Since the feature set contains richer information about the raw biometric data than any other levels, integration at feature level is expected to provide better authentication results. In this paper, both Mel Frequency Cepstral Coefficients (MFCCs) and Linear Prediction Cepstral Coefficients (LPCCs) are combined to get the audio feature vectors and Active Shape Model (ASM) based appearance and shape facial features are concatenated to take the visual feature vectors. These combined audio and visual features are used for the feature-fusion. To reduce the dimension of the audio and visual feature vectors, Principal Component Analysis (PCA) method is used. The VALID audio-visual database is used to measure the performance of the proposed system where four different illumination levels of lighting conditions are considered. Experimental results focus on the significance of the proposed audio-visual speaker identification system with various combinations of audio and visual features. 1. Introduction Human speaker identification is bimodal in nature [1, 2]. In a face-to-face conversation, we listen to what others say and at the same time observe their lip movements, facial expressions, and gestures. Especially, if we have a problem in listening due to environmental noise, the visual information plays an important role for speech understanding [3]. Even in the clean environment, speech recognition performance is improved when the talking face is visible [4]. Generally, it is true that audio-only speaker identification system is not sufficiently adequate to meet the variety of user requirements for person identification. The AVSI system promises to alleviate some of the drawbacks encountered by audio-only identification. Visual speech information can play an important role in the improvement of natural and robust human-computer interaction [5, 6]. Indeed, various important human-computer components, such as speaker identification, verification [7], localization [8], speech event detection [9], speech signal separation [10], coding [11], video indexing and retrieval [12], and text-to-speech [13], have been shown to benefit from the visual channel [14]. Audio-visual identification system can significantly improve the performance of

References

[1]  D. G. Stork and M. E. Hennecke, Eds., Speechreading by Humans and Machines, Springer, Berlin, Germany, 1996.
[2]  R. Campbell, B. Dodd, and D. Burnham, Eds., Hearing by Eye II, Psychology Press, Hove, UK, 1998.
[3]  L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, “Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,” Cerebral Cortex, vol. 17, no. 5, pp. 1147–1153, 2007.
[4]  P. Arnold and F. Hill, “Bisensory augmentation: a speechreading advantage when speech is clearly audible and intact,” British Journal of Psychology, vol. 92, no. 2, pp. 339–355, 2001.
[5]  S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 141–151, 2000.
[6]  G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features for audio-visual LVCSR,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 165–168, May 2001.
[7]  C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–37, 2002.
[8]  D. N. Zotkin, R. Duraiswami, and L. S. Davis, “Joint audio-visual tracking using particle filters,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1154–1164, 2002.
[9]  P. de Cuetos, C. Neti, and A. W. Senior, “Audio-visual intent-to-speak detection for human-computer interaction,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 2373–2376, Istanbul, Turkey, June 2000.
[10]  D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten, “Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1165–1173, 2002.
[11]  E. Foucher, L. Girin, and G. Feng, “Audiovisual speech coder: using vector quantization to exploit the audio/video correlation,” in Proceedings of the Conference on Audio-Visual Speech Processing, pp. 67–71, Terrigal, Australia, December 1998.
[12]  J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. Wong, “Integration of multimodal features for video scene classification based on HMM,” in Proceedings of the IEEE 3rd Workshop on Multimedia Signal Processing, pp. 53–58, Copenhagen, Denmark, September 1999.
[13]  E. Cosatto and H. P. Graf, “Photo-realistic talking-heads from image samples,” IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 152–163, 2000.
[14]  G. Potamianos, C. Neti, and S. Deligne, “Joint audio-visual speech processing for recognition and enhancement,” in Proceedings of the Auditory-Visual Speech Processing Tutorial and Research Workshop (AVSP '03), pp. 95–104, Saint-Jorioz, France, September 2003.
[15]  A. Ross and R. Govindarajan, “Feature level fusion using hand and face biometrics,” in Biometric Technology for Human Identification II, Proceedings of SPIE, pp. 196–204, Orlando, Fla, USA, March 2005.
[16]  A. Rogozan and P. Deléglise, “Adaptive fusion of acoustic and visual sources for automatic speech recognition,” Speech Communication, vol. 26, no. 1-2, pp. 149–161, 1998.
[17]  G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003.
[18]  J.-S. Lee and C. H. Park, Adaptive Decision Fusion for Audio-Visual Speech Recognition. Speech Recognition, 2008.
[19]  K. Nandakumar, Y. Chen, S. C. Dass, and A. K. Jain, “Likelihood ratio-based biometric score fusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 342–347, 2008.
[20]  Md. Rabiul Islam and Md. Fayzur Rahman, “Likelihood ratio based score fusion for audio-visual speaker identification in challenging environment,” International Journal of Computer Applications, vol. 6, no. 7, pp. 6–11, 2010.
[21]  L. Girin, G. Feng, and J. L. Schwartz, “Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), pp. 1005–1008, 1998.
[22]  S. Doclo and M. Moonen, “On the output SNR of the speech-distortion weighted multichannel Wiener filter,” IEEE Signal Processing Letters, vol. 12, no. 12, pp. 809–811, 2005.
[23]  N. Wiener and R. E. A. C. Paley, Fourier Transforms in the Complex Domains, American Mathematical Society, Providence, RI, USA, 1934.
[24]  K. Kitayama, M. Goto, K. Itou, and T. Kobayashi, “Speech starter: noise-robust endpoint detection by using filled pauses,” in Proceedings of the Eurospeech, pp. 1237–1240, Geneva, Switzerland, 2003.
[25]  S. E. Bou-Ghazale and K. Assaleh, “A robust endpoint detection of speech for noisy environments with application to automatic speech recognition,” in Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP '02), vol. 4, pp. 3808–3811, May 2002.
[26]  J. W. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE, vol. 81, no. 9, pp. 1215–1247, 1993.
[27]  L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A real-time text-independent speaker identification system,” in Proceedings of the 12th International Conference on Image Analysis and Processing, pp. 632–637, IEEE Computer Society Press, Mantova, Italy, September 2003.
[28]  F. J. Harris, “On the use of windows for harmonic analysis with the discrete Fourier transform,” Proceedings of the IEEE, vol. 66, no. 1, pp. 51–83, 1978.
[29]  S. Milborrow, Locating facial features with active shape models [dissertation], Faculty of Engineering, University of Cape Town, Cape Town, South Africa, 2007.
[30]  R. Herpers, G. Verghese, K. Derpains, and R. McCready, “Detection and tracking of face in real environments,” in Proceedings of the IEEE International Workshop on Recognition, Analysis and Tracking of Face and Gesture in Real-Time Systems, pp. 96–104, Corfu, Greece, 1999.
[31]  J. Daugman, “Face detection: a survey,” Computer Vision and Image Understanding, vol. 83, no. 3, pp. 236–274, 2001.
[32]  R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison-Wesley, 2002.
[33]  A. Ross and R. Govindarajan, “Feature level fusion using hand and face biometrics,” in Biometric Technology for Human Identification II, vol. 5779 of Proceedings of SPIE, pp. 196–204, Orlando, Fla, USA, March 2005.
[34]  J.-S. Lee and C. H. Park, Speech Recognition, Technologies and Applications, I-Tech, Vienna, Austria, 2008.
[35]  P. A. Devijver, “Baum's forward-backward algorithm revisited,” Pattern Recognition Letters, vol. 3, no. 6, pp. 369–373, 1985.
[36]  N. A. Fox, B. A. O'Mullane, and R. B. Reilly, “VALID: a new practical audio-visual database, and comparative results,” in Audio - and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, pp. 777–786, 2005.

Full-Text

Contact Us

[email protected]

QQ:3279437679

WhatsApp +8615387084133