Multi3Generation

WG 1 – Grounded Multimodal Reasoning and Generation

Linguistic expressions are called grounded when they are linked to non-linguistic, especially perceptual data (such as information coming from modalities such as vision, audition, etc); grounding is in essence a key aspect of acquiring meaning. This is a long-standing challenge for Artificial Intelligence.

WG1 focusses on grounded representations for AI systems that, amongst other things, use multimodal information to reason, learn, and generate natural language. The central themes for WG1 are the following:

  • Explainability and transparency in multimodal models;
  • Complementarity / redundancy among data sources or modalities;
  • Interaction between symbolic & sub-symbolic (e.g. neural) representations in models;
  • The role of commonsense and other knowledge; 
  • Situated reasoning and language generation.

WG1 will be working towards:

  1. Drawing up standards for multimodal data sources
  2. Defining a research roadmap, through an appraisal of existing work and identification of gaps to be addressed in future work.

Individuals interested in joining WG1 should contact the chair, Mehul Bhatt (Örebro University, Sweden) / mehul.bhatt {at} oru.se

SELECT PUBLICATIONS

  • L. Parcalabescu, A. Gatt, A. Frank and I. Calixto (2021). Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks. In: Beyond Language: Multimodal Semantic Representations Workshop (MMSR) 2021.
  • H. Alberts, T. Huang, Y. Deshpande, Y. Liu, K. Cho, C. Vania, I. Calixto (2021). VisualSem: A high-quality knowledge graph for vision and language. In: Multilingual Representation Learning Workshop (MRL) 2021.
  • J. Suchan, M. Bhatt, S. Vardarajan (2021). Commonsense Visual Sensemaking for Autonomous Driving: On Generalised Neurosymbolic Online Abduction Integrating Vision and Semantics. In: Artificial Intelligence Journal (AIJ), October 2021, Volume 299.
  • M. Cafagna, K. van Deemter and A. Gatt (2021). What Vision-Language models ‘see’ when they see scenes. arXiv preprint arXiv:2109.07301

Skip to content