Multi-modal referring expressions in human-human task descriptions and their implications for human-robot interaction

Gross, Stephanie; Krenn, Brigitte; Scheutz, Matthias

doi:10.1075/is.17.2.02gro

Article published In:

Interaction Studies
Vol. 17:2 (2016) ► pp.180–210

Multi-modal referring expressions in human-human task descriptions and their implications for human-robot interaction

Stephanie Gross | Austrian Research Institute for Artificial Intelligence (OFAI)

Brigitte Krenn | Austrian Research Institute for Artificial Intelligence (OFAI)

Matthias Scheutz | Tufts University

Human instructors often refer to objects and actions involved in a task description using both linguistic and non-linguistic means of communication. Hence, for robots to engage in natural human-robot interactions, we need to better understand the various relevant aspects of human multi-modal task descriptions. We analyse reference resolution to objects in a data collection comprising two object manipulation tasks (22 teacher student interactions in Task 1 and 16 in Task 2) and find that 78.76% of all referring expressions to the objects relevant in Task 1 are verbally underspecified and 88.64% of all referring expressions are verbally underspecified in Task 2. The data strongly suggests that a language processing module for robots must be genuinely multi-modal, allowing for seamless integration of information transmitted in the verbal and the visual channel, whereby tracking the speaker’s eye gaze and gestures as well as object recognition are necessary preconditions.

Keywords: multi-modal communication, human-robot interaction, reference resolution

Article outline

1.Introduction
2.Background and related work
- 2.1Multi-modal reference resolution in human-human interaction
  - 2.1.1Variation in language
  - 2.1.2Gesture, gaze and language
- 2.2Computational approaches to multi-modal reference resolution
3.Data collection experiments and research questions
- 3.1Task 1
- 3.2Task 2
- 3.3Data collection
- 3.4Participants and technical tools employed in data analysis
- 3.5Research questions
4.Results
- 4.1RQ1 – Variation of referring expressions per object
- 4.2RQ2 – Underspecified verbal referring expressions
  - 4.2.1Verbal part of referring expressions
  - 4.2.2Verbal part of initial references
  - 4.2.3Pronoun resolution
- 4.3RQ3 – Multi-modality of referring expressions
5.Analysis and challenges
- 5.1Challenge 1 – variation of expressions referring to one specific object
- 5.2Challenge 2 – underspecified verbal referring expressions
- 5.3Challenge 3 – multi-modality of referring expressions
- 5.4Lessons for agent design
  - 5.4.1Variation of expressions referring to one specific object
  - 5.4.2Underspecified verbal referring expressions
  - 5.4.3Multi-modality of referring expressions
6.Conclusion, limitations, and future work
Acknowledgements
Notes
References

Published online: 21 December 2016

https://doi.org/10.1075/is.17.2.02gro

References (63)

References

Admoni, H., Datsikas, C., & Scassellati, B. (2014). Speech and gaze conflicts in collaborative human-robot interactions. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci 2014).

Ahrenholz, B. (2007). Verweise mit Demonstrativa im gesprochenen Deutsch: Grammatik, Zweitspracherwerb und Deutsch als Fremdsprache (Vol. 171). Walter de Gruyter.

Almor, A. (1999). Noun-phrase anaphora and focus: The informational load hypothesis. Psychological Review, 106(4), 748.

Arnold, J. E., Eisenband, J. G., Brown-Schmidt, S., & Trueswell, J. C. (2000). The rapid use of gender information: Evidence of the time course of pronoun resolution from eyetracking. Cognition, 76(1), B13–B26.

Arts, A., Maes, A., Noordman, L., & Jansen, C. (2011). Overspecification facilitates object identification. Journal of Pragmatics, 43(1), 361–374.

Benthall, J., Argyle, M., & Cook, M. (1976). Gaze and mutual gaze. RAIN(12), 7.

Böckler, A., Knoblich, G., & Sebanz, N. (2011). Observing shared attention modulates gaze following. Cognition, 120(2), 292–298.

Brennan, S. E. (1996). Lexical entrainment in spontaneous dialog. Proceedings of International Symposium on Spoken Dialog, 41–44.

(2000). Processes that shape conversation and their implications for computational linguistics. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 1–11).

Brennan, S. E., Chen, X., Dickinson, C. A., Neider, M. B., & Zelinsky, G. J. (2008). Coordinating cognition: The costs and benefits of shared gaze during collaborative search. Cognition, 106(3), 1465–1477.

Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482–1493.

Chai, J. Y., Prasov, Z., & Qu, S. (2006). Cognitive principles in robust multimodal interpretation. Journal of Artificial Intelligence Research (JAIR), 271, 55–83.

Chen, Y., Schermerhorn, P., & Scheutz, M. (2012). Adaptive eye gaze patterns in interactions with human and artificial agents. ACM Transactions on Interactive Intelligent Systems, 1(2), 13.

Clark, H. H. (2003). Pointing and placing. Pointing: Where language, culture, and cognition meet, 243–268.

Clark, H. H., & Krych, M. A. (2004). Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50(1), 62–81.

Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22(1), 1–39.

Dahan, D., Tanenhaus, M. K., & Chambers, C. G. (2002). Accent and reference resolution in spoken-language comprehension. Journal of Memory and Language, 47(2), 292–314.

Dale, R., & Reiter, E. (1995). Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2), 233–263.

Fang, R., Doering, M., & Chai, J. Y. (2015). Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the 10th Annual ACM/IEEE International Conference on Human-Robot Interaction (pp. 271–278).

Frischen, A., Bayliss, A. P., & Tipper, S. P. (2007). Gaze cueing of attention: visual attention, social cognition, and individual differences. Psychological Bulletin, 133(4), 691–721.

Furnas, G., Landauer, T., Gomez, L., & Dumais, S. (1984). Statistical semantics: Analysis of the potential performance of keyword information systems. In Human factors in computer systems (pp. 187–212).

(1987). The vocabulary problem in human-system communication. Communications of the ACM, 50(11), 964–971.

Gatt, A., Krahmer, E., van Deemter, K., & van Gompel, R. P. (2014). Models and empirical data for the production of referring expressions. Language, Cognition and Neuroscience, 29(8), 899–911.

Goudbeek, M., & Krahmer, E. (2012). Alignment in interactive reference production: Content planning, modifier ordering, and referential overspecification. Topics in Cognitive Science, 4(2), 269–289.

Grice, H. (1975). Logic and conversation. In Syntax and semantics: Speech acts (pp. 41–58). New York,

Griffin, Z. M. (2001). Gaze durations during speech reflect word selection and phonological encoding. Cognition, 821(Bl–Bl4).

Grosz, B. J., Weinstein, S., & Joshi, A. K. (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 203–225.

Gundel, J. K. (2010). Reference and accessibility from a givenness hierarchy perspective. International Review of Pragmatics, 2(2), 148–168.

Gundel, J. K., Hedberg, N., & Zacharski, R. (1993). Cognitive status and the form of referring expressions in discourse. Language, 271–307.

(2012). Underspecification of cognitive status in reference production: Some empirical predictions. Topics in Cognitive Science, 4(2), 249–268.

Gundel, J. K., Hedberg, N., Zacharski, R., Mulkern, A., Custis, T., Swierzbin, B., … Watters, S. (2006). Coding protocol for statuses on the giveness hierarchy, (unpublished manuscript)

Hanna, J. E., & Brennan, S. E. (2007). Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation. Journal of Memory and Language, 57(4), 596–615.

Hanna, J. E., & Tanenhaus, M. K. (2004). Pragmatic effects on reference resolution in a collaborative task: Evidence from eye movements. Cognitive Science, 28(1), 105–115.

Huang, C.-M., & Mutlu, B. (2014). Learning-based modeling of multimodal behaviors for humanlike robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction (pp. 57–61).

Huwel, S., Wrede, B., & Sagerer, G. (2006). Robust speech understanding for multimodal human-robot communication. In Proceedings of the 15th IEEE International Symposium on Robot and Human Interactive Communication (pp. 45–50).

Kehler, A. (2000). Cognitive status and form of reference in multimodal human-computer interaction. In Proceedings of the 14th AAAI Conference on Artificial Intelligence (pp. 685–690).

Kelleher, J. D., & Kruijff, G.-J. M. (2006). Incremental generation of spatial referring expressions in situated dialog. In Proceedings of the 21st International Conference on Computational Linguistics (pp. 1041–1048).

Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge University Press,

Knoeferle, P., & Crocker, M. W. (2006). The coordinated interplay of scene, utterance, and world knowledge: evidence from eye tracking. Cognitive Science, 30(3), 481–529.

Kowadlo, G., Ye, P., & Zukerman, I. (2010). Influence of gestural salience on the interpretation of spoken requests. In Proceedings of Interspeech (pp. 2034–2037).

Krahmer, E., & Theune, M. (2002). Efficient context-sensitive generation of referring expressions. In Information sharing: Reference and presupposition in language generation and interpretation. Stanford.

Kranstedt, A., Lucking, A., Pfeiffer, T., Rieser, H., & Wachsmuth, I. (2006). Deictic object reference in task-oriented dialogue. Trends in Linguistic Studies and Monographs, 1661, 155.

Kruijff, G.-J. M., Lison, P., Benjamin, T., Jacobsson, H., Zender, H., Kruijff-Korbayová, L, & Hawes, N. (2010). Situated dialogue processing for human-robot interaction. In Cognitive systems (pp. 311–364). Springer,

Lambrecht, K. (1996). Information structure and sentence form: Topic, focus, and the mental representations of discourse referents (Vol. 711). Cambridge University Press,

Lemaignan, S., Ros, R., Sisbot, E. A., Alami, R., & Beetz, M. (2012). Grounding the interaction: Anchoring situated discourse in everyday human-robot interaction. International Journal of Social Robotics, 4(2), 181–199.

Lozano, S. C., & Tversky, B. (2006). Communicative gestures facilitate problem solving for both communicators and recipients. Journal of Memory and Language, 55(1), 47–63.

McNeill, D. (1992). Hand and mind: What gestures reveal about thought. University of Chicago Press,

(2008). Gesture and thought. University of Chicago Press,

Pechmann, T. (1989). Incremental speech production and referential overspecification. Linguistics, 27(1), 89–110.

Pitsch, K., Lohan, K. S., Rohlfing, K., Saunders, J., Nehaniv, C. L., & Wrede, B. (2012). Better be reactive at the beginning, implications of the first seconds of an encounter for the tutoring style in human-robot-interaction. In Proceedings of RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication (pp. 974–981).

Prasov, Z., & Chai, J. Y. (2008). What’s in a gaze?: the role of eye-gaze in reference resolution in multimodal conversational interfaces. In Proceedings of the 13th International Conference on Intelligent User Interfaces (pp. 20–29).

Reiter, E., Dale, R., & Feng, Z. (2000). Building natural language generation systems (Vol. 331). MIT Press,

Scheutz, M., Briggs, G., Cantrell, R., Krause, E., Williams, T., & Veale, R. (2013). Novel mechanisms for natural human-robot interactions in the DIARC architecture. In Proceedings of AAAI Workshop on Intelligent Robotic Systems.

Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIG DAT-Workshop,

Staudte, M., & Crocker, M. W. (2009a). Producing and resolving multi-modal referring expressions in human-robot interaction. In Proceedings of the Pre-CogSci Workshop on Production of Referring Expressions,

(2009b). Visual attention in spoken human-robot interaction. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction (pp. 77–84).

Streeck, J. (1993). Gesture as communication i: Its coordination with gaze and speech. Communications Monographs, 60(4), 275–299.

Tomasello, M.. & Akhtar, N. (1995). Two-year-olds use pragmatic cues to differentiate reference to objects and actions. Cognitive Development, 10(2), 201–224.

Van Deemter, K., Gatt, A., van Gompel, R. P., & Krahmer, E. (2012). Toward a computational psycholinguistics of reference production. Topics in Cognitive Science, 4(2), 166–183.

Van der Sluis, I., & Krahmer, E. (2007). Generating multimodal references. Discourse Processes, 44(3), 145–174.

Vollmer, A.-L., Lohan, K. S., Fischer, K., Nagai, Y., Pitsch, K., Fritsch, J., … Wrede, B. (2009). People modify their tutoring behavior in robot-directed interaction for action learning. In Proceedings of the 8th International Conference on Development and Learning (pp. 1–6).

Williams, T., Acharya, S., Schreitter, S., & Scheutz, M. (2016). Situated open world reference resolution for human-robot dialogue. In Proceedings of the IEEE/ACM Conference on Human-Robot Interaction (p. forthcoming).

Williams, T., Schreitter, S., Acharya, S., & Scheutz, M. (2015). Towards situated open world reference resolution. In Proceedings of the 2015 AAAI Fall Symposium on Al and HRI.