How direct is the link between words and images?
Günther et al. (2022) investigated the relationship between words and images in which they concluded the possibility of a direct link between words and embodied experience. In their study, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the image that best matched the target noun. Building upon their work, we addressed the following questions. 1. Apart from utilizing visually embodied simulation, what other strategies subjects might have used? How much does this setup rely on visual information? Can it be solved using textual representations? 2. Do current visually-grounded embeddings explain subjects’ selection behavior better than textual embeddings? 3. Does visual grounding improve the representations of both concrete and abstract words? For this aim, we designed novel experiments based on pre-trained word embeddings. Our experiments reveal that subjects’ selection behavior is explained to a large extend on text-based embeddings and word-based similarities. Visually grounded embeddings offered modest advantages over textual embeddings in certain cases. These findings indicate that the experiment by
Günther et al. (2022) may not be well suited for tapping into the perceptual experience of participants, and the extent to which it measures visually grounded knowledge is unclear.
Article outline
- 1.Introduction
- 2.Methodology
- 2.1Materials from Günther et al. (2022)
- 2.2Model from Shahmohammadi et al. (2023)
- 2.3Procedure
- 3.Results
- 3.1Q1: Can we model participant behaviour without assuming participants generate mental images?
- 3.1.1Max models
- 3.1.2GAM models
- 3.2Q2: Is participants’ behaviour best accounted for by purely textual or multimodal word embeddings?
- 3.3Q3: Does the indirect grounding of abstract words afford a better understanding of the experimental results reported by
GPVM?
- 4.Discussion and conclusion
- Acknowledgements
- Notes
-
References
References (87)
References
Abdou, M., Kulmizev, A., Hershcovich, D., Frank, S., Pavlick, E., and Søgaard, A. (2021). Can
Language Models Encode Perceptual Structure Without Grounding? A Case Study in
Color. In Proceedings of the 25th Conference on Computational Natural
Language
Learning, pages 109–132, Stroudsburg, PA, USA. Association for Computational Linguistics.
Anderson, A. J., Bruni, E., Lopopolo, A., Poesio, M., and Baroni, M. (2015). Reading
visually embodied meaning from the brain: Visually grounded computational models decode visual-object mental imagery induced
by written
text. NeuroImage, 1201:309–322.
Anschütz, M., Lozano, D. M., and Groh, G. (2023). This
is not correct! negation-aware evaluation of language generation systems.
Baroni, M. (2016). Grounding
distributional semantics in the visual world. Language and Linguistics
Compass, 10(1):3–13.
Barsalou, L. W. (1999). Perceptual
symbol systems. Behavioral and Brain
Sciences, 22(4).
Barsalou, L. W. (2003). Abstraction
in perceptual symbol systems. Philosophical Transactions of the Royal Society of London. Series
B: Biological Sciences, 358(1435).
Barsalou, L. W. (2008). Grounded
Cognition. Annual Review of
Psychology, 59(1).
Barsalou, L. W. (2010). Grounded
cognition: Past, present, and future. Topics in cognitive
science, 2(4):716–724.
Barsalou, L. W., Santos, A., Simmons, W. K., and Wilson, C. D. (2008). Language
and simulation in conceptual processing. In Symbols and Embodiment:
Debates on meaning and cognition. Oxford University Press.
Bordes, P., Zablocki, E., Soulier, L., Piwowarski, B., and Gallinari, P. (2019). Incorporating
visual semantics into sentence representations within a grounded
space. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 696–707, Hong Kong, China. Association for Computational Linguistics.
Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal
distributional semantics. Journal of Artificial Intelligence
Research, 491:1–47.
Brysbaert, M., Warriner, A. B., and Kuperman, V. (2014). Concreteness
ratings for 40 thousand generally known English word lemmas. Behavior Research
Methods, 46(3):904–911.
Buchanan, E. M., Valentine, K. D., and Maxwell, N. P. (2019). English
semantic feature production norms: An extended database of 4436 concepts. Behavior Research
Methods, 51(4).
Bulat, L., Clark, S., and Shutova, E. (2017). Speaking,
Seeing, Understanding: Correlating semantic models with conceptual representation in the
brain. In Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics.
Castelhano, M. S. and Rayner, K. (2008). Eye
movements during reading, visual search, and scene perception: An overview. Cognitive and
cultural influences on eye
movements, 21751:3–33.
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return
of the Devil in the Details: Delving Deep into Convolutional Nets. arXiv preprint
arXiv:1405.3531.
Chrupaɫa, G., Kádár, Á., and Alishahi, A. (2015). Learning
language through pictures. In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 2: Short
Papers), pages 112–118, Beijing, China. Association for Computational Linguistics.
Collell Talleda, G., Zhang, T., and Moens, M.-F. (2017). Imagined
visual representations as multimodal embeddings. In Proceedings of
the Thirty-First AAAI Conference on Artificial Intelligence
(AAAI-17), pages 4378–4384. AAAI.
Cree, G. S. and McRae, K. (2003). Analyzing
the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese, and cello (and many
other such concrete nouns). Journal of Experimental Psychology:
General, 132(2).
Cronin, D. A., Hall, E. H., Goold, J. E., Hayes, T. R., and Henderson, J. M. (2020). Eye
movements in real-world scene photographs: General characteristics and effects of viewing
task. Frontiers in
Psychology, 101:2915.
De Deyne, S., Navarro, D. J., Collell, G., and Perfors, A. (2021). Visual
and Affective Multimodal Models of Word Meaning in Language and Mind. Cognitive
Science, 45(1).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet:
A largescale hierarchical image database. In 2009 IEEE conference on
computer vision and pattern
recognition, pages 248–255. Ieee.
Dolan, R. J. (2002). Emotion,
cognition, and
behavior. Science, 298(5596):1191–1194.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001). Placing
search in context: The concept revisited. In Proceedings of the 10th
international conference on World Wide
Web, pages 406–414.
Gerz, D., Vulić, I., Hill, F., Reichart, R., and Korhonen, A. (2016). SimVerb-3500:
A large-scale evaluation set of verb similarity. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language
Processing, pages 2173–2182, Austin, Texas. Association for Computational Linguistics.
Goldstone, R. L. (1995). Effects
of Categorization on Color Perception. Psychological
Science, 6(5).
Grondin, R., Lupker, S. J., and McRae, K. (2009). Shared
features dominate semantic richness effects for concrete concepts. Journal of Memory and
Language, 60(1):1–19.
Günther, F., Petilli, M. A., Vergallito, A., and Marelli, M. (2022). Images
of the unseen: extrapolating visual representations for abstract and concrete words in a data-driven computational
model. Psychological Research.
Günther, F., Rinaldi, L., and Marelli, M. (2019). Vector-Space
Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common
Misconceptions. Perspectives on Psychological
Science, 14(6):1006–1033.
Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. (2012). Large-scale
learning of word relatedness with constraints. In Proceedings of the
18th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 1406–1414.
Harnad, S. (1990). The
symbol grounding problem. Physica D: Nonlinear
Phenomena, 42(1–3):335–346.
Harris, Z. S. (1954). Distributional
Structure. WORD, 10(2–3).
Hasegawa, M., Kobayashi, T., and Hayashi, Y. (2017). Incorporating
visual features into word embeddings: A bimodal autoencoder-based
approach. In IWCS 2017 – 12th International Conference on
Computational Semantics – Short papers.
Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999:
Evaluating semantic models with (genuine) similarity estimation. Computational
Linguistics, 41(4):665–695.
Hochreiter, S. and Schmidhuber, J. (1997). Long
short-term memory. Neural
computation, 9(8):1735–1780.
Hoffman, D. (2019). The
case against reality: Why evolution hid the truth from our eyes. WW Norton & Company.
Hollenstein, N., de la Torre, A., Langer, N., and Zhang, C. (2019). CogniVal:
A Framework for Cognitive Word Embedding Evaluation. In Proceedings
of the 23rd Conference on Computational Natural Language Learning (CoNLL), Stroudsburg, PA, USA. Association for Computational Linguistics.
Howell, S. R., Jankowicz, D., and Becker, S. (2005). A
model of grounded language acquisition: Sensorimotor features improve lexical and grammatical
learning. Journal of Memory and
Language, 53(2):258–276.
Kant, I., Guyer, P., and Wood, A. W. (1781/1999). Critique
of pure reason. Cambridge University Press.
Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert:
Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of
NAACL-HLT, pages 4171–4186.
Kiela, D. and Bottou, L. (2014). Learning
image embeddings using convolutional neural networks for improved multi-modal
semantics. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing
(EMNLP), pages 36–45, Doha, Qatar. Association for Computational Linguistics.
Kiela, D., Bulat, L., and Clark, S. (2015). Grounding
semantics in olfactory perception. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 2: Short
Papers), pages 231–236.
Kiela, D. and Clark, S. (2015). Multi-
and Cross-Modal Semantics Beyond Vision: Grounding in Auditory
Perception. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics.
Kiela, D., Conneau, A., Jabri, A., and Nickel, M. (2018). Learning
visually grounded sentence representations. In Proceedings of the
2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long
Papers), pages 408–418, New Orleans, Louisiana. Association for Computational Linguistics.
Kiros, J., Chan, W., and Hinton, G. (2018). Illustrative
language understanding: Largescale visual grounding with image
search. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 922–933, Melbourne, Australia. Association for Computational Linguistics.
Lakoff, G. (1987). Women,
Fire, and Dangerous Things. University of Chicago Press.
Lakoff, G. and Johnson, M. (1980). The
metaphorical structure of the human conceptual system. Cognitive
science, 4(2):195–208.
Landauer, T. K. and Dumais, S. T. (1997). A
solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of
knowledge. Psychological
Review, 104(2).
Langacker, R. W. (1999). A
view from cognitive linguistics. Behavioral and Brain
Sciences, 22(4).
Lazaridou, A., Chrupaɫa, G., Fernández, R., and Baroni, M. (2016). Multimodal
Semantic Learning from Child-Directed Input. In Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Stroudsburg, PA, USA. Association for Computational Linguistics.
Lazaridou, A., Marelli, M., and Baroni, M. (2017). Multimodal
Word Meaning Induction From Minimal Exposure to Natural Text. Cognitive
Science, 411.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft
coco: Common objects in context. In European conference on computer
vision, pages 740–755. Springer.
Louwerse, M. and Connell, L. (2011). A
Taste of Words: Linguistic Context and Perceptual Simulation Predict the Modality of
Words. Cognitive
Science, 35(2):381–398.
Louwerse, M. M. (2011). Symbol
interdependency in symbolic and embodied cognition. Topics in Cognitive
Science, 3(2):273–302.
Louwerse, M. M. and Zwaan, R. A. (2009). Language
Encodes Geographical Information. Cognitive
Science, 33(1):51–73.
Luong, T., Socher, R., and Manning, C. (2013). Better
word representations with recursive neural networks for
morphology. In Proceedings of the Seventeenth Conference on
Computational Natural Language
Learning, pages 104–113, Sofia, Bulgaria. Association for Computational Linguistics.
Lynott, D., Connell, L., Brysbaert, M., Brand, J., and Carney, J. (2020). The
Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English
words. Behavior Research
Methods, 52(3).
Mandera, P., Keuleers, E., and Brysbaert, M. (2017). Explaining
human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and
empirical validation. Journal of Memory and
Language, 921.
Marelli, M. and Amenta, S. (2018). A
database of orthography-semantics consistency (osc) estimates for 15,017 english
words. Behavior research
methods, 501:1482–1495.
Martin, A. (2007). The
Representation of Object Concepts in the Brain. Annual Review of
Psychology, 58(1):25–45.
McRae, K., Cree, G. S., Seidenberg, M. S., and Mcnorgan, C. (2005). Semantic
feature production norms for a large set of living and nonliving things. Behavior Research
Methods, 37(4).
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781.
Mkrtychian, N., Blagovechtchenski, E., Kurmakaeva, D., Gnedykh, D., Kostromina, S., and Shtyrov, Y. (2019). Concrete
vs. Abstract Semantics: From Mental Representations to Functional Brain Mapping. Frontiers in
Human
Neuroscience, 131(August):267.
Montefinese, M. (2019). Semantic
representation of abstract and concrete words: A minireview of neural evidence. Journal of
Neurophysiology, 121(5):1585–1587.
Park, J. and Myaeng, S.-h. (2017a). A
computational study on word meanings and their distributed representations via polymodal
embedding. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 1: Long
Papers), pages 214–223, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Park, J. and Myaeng, S.-h. (2017b). A
computational study on word meanings and their distributed representations via polymodal
embedding. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 1: Long
Papers), pages 214–223.
Pennington, J., Socher, R., and Manning, C. (2014). Glove:
Global Vectors for Word Representation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA, USA. Association for Computational Linguistics.
Pezzelle, S., Takmaz, E., and Fernández, R. (2021). Word
representation learning in multimodal pre-trained transformers: An intrinsic
evaluation. Transactions of the Association for Computational
Linguistics, 91:1563–1579.
Rotaru, A. S. and Vigliocco, G. (2020a). Constructing
semantic models from words, images, and emojis. Cognitive
science, 44(4):e12830.
Rotaru, A. S. and Vigliocco, G. (2020b). Constructing
Semantic Models From Words, Images, and Emojis. Cognitive
Science, 44(4):e12830.
Rozenkrants, B., Olofsson, J. K., and Polich, J. (2008). Affective
visual event-related potentials: arousal, valence, and repetition effects for normal and distorted
pictures. International Journal of
Psychophysiology, 67(2):114–123.
Shahmohammadi, H., Heitmeier, M., Shafaei-Bajestan, E., Lensch, H., and Baayen, H. (2023). Language
with vision: a study on grounded word and sentence embeddings. Behavior Research
Methods, accepted for publication.
Shahmohammadi, H., Lensch, H. P. A., and Baayen, R. H. (2021). Learning
zero-shot multifaceted visually grounded word embeddings via multi-task
training. In Proceedings of the 25th Conference on Computational
Natural Language
Learning, pages 158–170, Online. Association for Computational Linguistics.
Silberer, C. and Lapata, M. (2014). Learning
grounded meaning representations with autoencoders. In Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 721–732, Baltimore, Maryland. Association for Computational Linguistics.
Simmons, W. K., Martin, A., and Barsalou, L. W. (2005). Pictures
of Appetizing Foods Activate Gustatory Cortices for Taste and Reward. Cerebral
Cortex, 15(10):1602–1608.
Solomon, K. O. and Barsalou, L. W. (2001). Representing
Properties Locally. Cognitive
Psychology, 43(2):129–169.
Solomon, K. O. and Barsalou, L. W. (2004). Perceptual
simulation in property verification. Memory &
Cognition, 32(2):244–259.
Tan, H. and Bansal, M. (2020). Vokenization:
Improving language understanding with contextualized, visual-grounded
supervision. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing
(EMNLP), pages 2066–2080, Online. Association for Computational Linguistics.
Tan, M. and Le, Q. (2019). Efficientnet:
Rethinking model scaling for convolutional neural
networks. In International conference on machine
learning, pages 6105–6114. PMLR.
Utsumi, A. (2022). A
test of indirect grounding of abstract concepts using multimodal distributional
semantics. Frontiers in psychology, 131.
Vigliocco, G., Ponari, M., and Norbury, C. (2018). Learning
and processing abstract words and concepts: Insights from typical and atypical
development. Topics in cognitive
science, 10(3):533–549.
Wang, B., Wang, A., Chen, F., Wang, Y., and Kuo, C.-C. J. (2019). Evaluating
word embedding models: Methods and experimental results. APSIPA transactions on signal and
information processing, 81.
Westbury, C. (2014). You
Can’t Drink a Word: Lexical and Individual Emotionality Affect Subjective Familiarity
Judgments. Journal of Psycholinguistic
Research, 43(5).
Westbury, C. and Hollis, G. (2019). Wriggly,
squiffy, lummox, and boobs: What makes some words funny? Journal of Experimental Psychology:
General, 148(1).
Wood, S. N. (2011). Fast
stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear
models. Journal of the Royal Statistical Society
(B), 73(1):3–36.
Yun, T., Sun, C., and Pavlick, E. (2021). Does
vision-and-language pretraining improve lexical
grounding? In Findings of the Association for Computational
Linguistics: EMNLP
2021, pages 4357–4366, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zwaan, R. A. and Madden, C. J. (2005). Embodied
Sentence Comprehension. In Grounding
Cognition. Cambridge University Press.