全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Estimating a User's Internal State before the First Input Utterance

DOI: 10.1155/2012/865362

Full-Text   Cite this paper   Add to My Lib

Abstract:

This paper describes a method for estimating the internal state of a user of a spoken dialog system before his/her first input utterance. When actually using a dialog-based system, the user is often perplexed by the prompt. A typical system provides more detailed information to a user who is taking time to make an input utterance, but such assistance is nuisance if the user is merely considering how to answer the prompt. To respond appropriately, the spoken dialog system should be able to consider the user’s internal state before the user’s input. Conventional studies on user modeling have focused on the linguistic information of the utterance for estimating the user’s internal state, but this approach cannot estimate the user’s state until the end of the user’s first utterance. Therefore, we focused on the user’s nonverbal output such as fillers, silence, or head-moving until the beginning of the input utterance. The experimental data was collected on a Wizard of Oz basis, and the labels were decided by five evaluators. Finally, we conducted a discrimination experiment with the trained user model using combined features. As a three-class discrimination result, we obtained about 85% accuracy in an open test. 1. Introduction Speech is the most basic medium of human-human communication and is expected to be one of the main modalities of more flexible man-machine interaction along with various intuitive interfaces rather than traditional text-based interfaces. One major topic of speech-based interfaces is the spoken dialog system. Studies on spoken dialog systems have attempted to introduce a user model, which models the user’s internal states, to make the dialog more flexible. The user’s internal states represent various aspects of the user, such as belief [1], preference [2], emotion [3], and familiarity with the system [4–6]. These aspects can also be categorized according to their persistency: as the user’s knowledge and preference are persistent, they can be used for personalizing of the dialog system [7]. Other internal states such as emotion or belief are transient and so are used for making a dialog more natural and smooth. These kinds of internal state should be estimated session-by-session. In this paper, we focus on the latter, transient states. These internal states are estimated based on the verbal and nonverbal information included in the interaction between the user and the system. In this work, we consider a system-initiative dialog system that presents a prompt message at the beginning of a session. In such a system, a session between the

References

[1]  A. Kobsa, “User modeling in dialog systems: potentials and hazards,” AI & Society, vol. 4, no. 3, pp. 214–231, 1990.
[2]  A. N. Pargellis, H. K. J. Kuo, and C. H. Lee, “An automatic dialogue generation platform for personalized dialogue applications,” Speech Communication, vol. 42, no. 3-4, pp. 329–351, 2004.
[3]  R. Gaj?ek, V. ?truc, S. Dobri?ek, and F. Miheli?, “Emotion recognition using linear transformations in combination with video,” in Proceedings of the Interspeech, pp. 1967–1970, 2009.
[4]  K. Jokinen, “Adaptation and user expertise modelling in AthosMail,” Universal Access in the Information Society, vol. 4, no. 4, pp. 374–392, 2004.
[5]  F. D. Rosis, N. Novielli, V. Carofiglio, A. Cavalluzzi, and B. D. Carolis, “User modeling and adaptation in health promotion dialogs with an animated character,” Journal of Biomedical Informatics, vol. 39, no. 5, pp. 514–531, 2006.
[6]  K. Komatani, S. Ueno, T. Kawahara, and H. G. Okuno, “Flexible guidance generation using user model in spoken dialogue systems,” in Proceedings of the COLING, pp. 256–263, 2003.
[7]  C. A. Thompson, M. H. G?ker, and P. Langley, “A personalized system for conversational recommendations,” Journal of Artificial Intelligence Research, vol. 21, pp. 393–428, 2004.
[8]  S. Young, M. Ga?i?, S. Keizer et al., “The Hidden Information State model: a practical framework for POMDP-based spoken dialogue management,” Computer Speech and Language, vol. 24, no. 2, pp. 150–174, 2010.
[9]  S. Hara, N. Kitaoka, and K. Takeda, “Estimation method of user satisfaction using n-gram-based dialog history model for spoken dialog system,” in Proceedings of the LREC, pp. 78–83, 2010.
[10]  O. Lemon and l. Konstas, “User simulations for context-sensitive speech recognition in spoken dialogue systems,” in Proceedings of the EACL, pp. 505–513, 2009.
[11]  M. Rickert, M. E. Foster, M. Giuliani, G. Panin, T. By, and A. Knoll, “Integrating language, vision and action for human robot dialog systems,” in Universal Access in Human-Computer Interaction. Ambient Interaction, pp. 987–995, 2007.
[12]  C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Hoffman, and M. Berlin, “Effects of nonverbal communication on efficiency and robustness in human-robot teamwork,” in Proceedings of the IEEE IRS/RSJ International Conference on Intelligent Robots and Systems (IROS '05), pp. 383–388, August 2005.
[13]  T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe, “Evaluating crossmodal awareness of daily-partner robot to user's behaviors with gaze and utterance detection,” in Proceedings of the 3rd ACM International Workshop on Context-Awareness for Self-Managing Systems (Casemans '09), pp. 1–8, May 2009.
[14]  R. M. Maatman, J. Gratch, and S. Marsella, “Natural behavior of a listening agent,” in Proceedings of the Intelligent Virtual Agents, vol. 3661 of Lecture Notes in Computer Science, pp. 25–36, 2005.
[15]  S. Kopp, T. Stocksmeier, and D. Gibbon, “Incremental multimodal feedback for conversational agents,” in Proceedings of the Intelligent Virtual Agents, vol. 4722 of Lecture Notes in Computer Science, pp. 139–146, 2007.
[16]  L. P. Morency, I. de Kok, and J. Gratch, “A probabilistic multimodal approach for predicting listener backchannels,” Autonomous Agents and Multi-Agent Systems, vol. 20, no. 1, pp. 70–84, 2009.
[17]  A. Kobayashi, K. Kayama, and E. Mizukami, “Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system,” in Spoken Dialogue Systems for Ambient Environments, pp. 73–84, 2010.
[18]  O. Bu? and D. Schlangen, “Modelling subutterance phenomena in spoken dialogue systems,” in Proceedings of Semdial, pp. 33–41, 2010.
[19]  S. E. Hudson, J. Fogarty, C. G. Atkeson et al., “Predicting human interruptibility with sensors: a Wizard of Oz feasibility study,” in Proceedings of the CHI New Horizons: Human Factors in Computing Systems, pp. 257–264, April 2003.
[20]  J. Begole, N. E. Matsakis, and J. C. Tang, “Lilsys: sensing unavailability,” in Proceedings of the Computer Supported Cooperative Work (CSCW '04), pp. 511–514, November 2004.
[21]  S. Satake, T. Kanda, D. F. Glas, M. Imai, H. Ishiguro, and N. Hagita, “How to approach humans? Strategies for social robots to initiate interaction,” in Proceedings of the 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI '09), pp. 109–116, March 2009.
[22]  N. Yankelovich, “How do users know what to say?” Interactions, vol. 3, no. 6, pp. 32–43, 1996.
[23]  S. Benus, A. Gravano, and J. Hirschberg, “Pragmatic aspects of temporal accommodation in turn-taking,” Journal of Pragmatics, vol. 43, no. 12, pp. 3001–3027, 2011.
[24]  L. Ferrer, E. Shriberg, and A. Stolcke, “A prosody-based approach to end-of-utterance detection that does not require speech recognition,” in Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing, pp. 608–611, April 2003.
[25]  K. Laskowski, J. Edhund, and M. Heldner, “Incremental learning and forgetting in stochastic turn-taking models,” in Proceedings of the Interspeech, pp. 2069–2072, 2011.
[26]  K. Laskowski and E. Shriberg, “Corpusindependent history compression for stochastic turn-taking models,” in Proceedings of the ICASSP, pp. 4937–4940, 2012.
[27]  A. Raux and M. Eskenazi, “A finite-state turntaking model for spoken dialog systems,” in Proceedings of the Human Language Technologies, 2009.
[28]  R. Sato, R. Higashinaka, M. Tamoto, M. Nakano, and K. Aikawa, “Learning decision trees to determine turn-taking by spoken dialogue systems,” in Proceedings of the ICSLP, pp. 861–864, 2002.
[29]  J. J. Edlund and M. Nordstrand, “Turn-taking gestures and hourglasses in a multi-modal dialogue system,” in Proceedings of the IDS, 2002.
[30]  A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. N. N?th, “Desperately seeking emotions or: actors, wizards, and human beings,” in Proceedings of the SpeechEmotion, pp. 195–200, 2000.
[31]  D. Litman and K. Forbes-Riley, “Spoken tutorial dialogue and the feeling of another’s knowing,” in Proceedings of the SIGDIAL, 2009.
[32]  M. Goto, K. Itou, and S. Hayamizu, “A realtime filled pause detection system: toward spontaneous speech dialogue,” in Proceedings of the Eurospeech, pp. 187–192, 1999.
[33]  P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. I511–I518, December 2001.
[34]  D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564–577, 2003.
[35]  E. J. E. J. Keogh and M. J. Pazzani, “A simple dimensionality reduction technique for fast similarity search in large time series databases,” in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, pp. 122–133, 2000.
[36]  C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification,” Tech. Rep., Department of Computer Science, National Taiwan University, 2003.
[37]  C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011.

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413