Report of short STSM visit by François Portet (France) to Albert Gatt (Malta)

This short term scientific mission explored and discussed the state-of-the-art in natural language grounding in a multimodal setting including modalities other than pure vision. Natural language grounding is a fundamental task that humans do continuously. Research on Natural Language grounding consists in uncovering how language utterances relate to the real world.

During this one-week mission (19-23 Oct 2020), A. Gatt and F. Portet had daily meetings to discuss current progress of the work and how grounding can be defined and be problematised in multimodal settings. During the STSM, several papers about NL grounding and non-visual grounding were surveyed and discussed with A. Gatt and his PhD student Michele Cafagna. This discussion was formalised in a report which defines symbol grounding and natural language grounding and positions the latter WRT the current computing science approaches to solve this task. A review of the current datasets both Audio/Video and more multimodal dataset was included. The main outcomes were the following.

Started a report on grounding (visual and non-visual).
Identified datasets and rough plan for a shared task on multimodal generation.
Planned next WP1 meeting organisation to formalise this shared task proposal amongst the members of the action on multimodal grounding.
Fruitful interactions with A. Gatt, his PhD student and Postdocs as well as the Malta’s NLP group.
Wrote a master project on multimodal generation

The STSM was pursued through the co-supervision between France and Malta of the Master student Mou LI on multimodal generation from video/transcript summarisation. This supervision also involved 2 PhD students (Michele Cafagna – Malta and Yongxin Zhou – France) and was successfully defended in June 2021.