Structured Multilingual Language Generation

Structured Multilingual Language Generation

Applicant Details

ApplicantDr Johannes Bjerva
Home institutionUniversity of Copenhagen (Copenhagen, Denmark)
Home institution addressUniversitetsparken 1 2100, Copenhagen, Denmark

STSM Details

ActionCA18231 – Multi3Generation: Multi-task, Multilingual, Multi-modal Language Generation
Action General Information: CA18231
STSM titleStructured Multilingual Language Generation
PeriodAGA-CA18231-1: 2019-10-01 – 2020-04-30
Start date2020-03-16
End date2020-03-20
Motivation and Workplan summary1) Aim and Motivation

About seven thousand languages are spoken in the world today, but only a small number of these have the natural language processing (NLP) tools that the modern information society relies on. The vast majority of languages have no machine translation systems, intelligent search algorithms or grammar checkers—in spite of enormous progress made for large majority languages. The main technical challenge for truly multilingual NLP is the lack of training data for the machine learning methods used, with only spotty coverage across different languages and tasks.

The aim of this STSM is to begin developing more efficient ways to use existing data for training multilingual NLP algorithms, focusing on two related areas: identifying the structured patterns of similarity between different languages due to their evolutionary history, and discovering the common features among different NLP tasks. In both cases, the goal is to find abstractions across languages and tasks so that, for instance, data on the grammatical analysis of Swedish can be maximally informative to a model of information searching in the related Danish language.

The final long-term result of the STSM will be improved machine learning methods for learning a large number of heterogeneous tasks when there is some underlying structure among the tasks. We will apply these methods to bring recent advances in natural language technology to thousands of languages, rather than the few dozen languages where they are available today.

While a considerable body of research exists on training models on data from multiple languages or tasks, this STSM aims to fill an important research gap by utilising and integrating the structured relationships between languages. This will allow for more efficient transfer of information across languages, which will in turn lead to more accurate NLP models, in particular of under-resourced languages. In the same vein, we want to extend the multi-task learning (MTL) framework to allow information transfer not just between languages, but between very different types of language data. This will help bridge the divide between languages for which NLP tools, such as machine translation and intelligent search systems, that much of modern society relies on, exist, and those for which they do not.

The main novelties investigated in this project are:

* Structured multilinguality, that is, sharing information such as model parameters between languages in a way that efficiently uses the (mostly) hierarchically structured patterns of similarity between languages.
* Utilising a very large sample of languages, at least 1,000, which allows generalisations across languages to be studied much more extensively than in previous work.
* Drawing inspiration from linguistic typology, which studies the systematic variation of languages. While this project is the area of NLP, we expect it to benefit from, and provide contributions to, both fields.

2) Proposed contribution to the scientific objectives of the Action.

This STSM is particularly related to WG 2 and WG 4 of Multi3Generation.
The relevance to WG2 is mainly in that we will be using methods from multi-task learning, and will work on adapting these to both multi-task and multilingual settings. In particular, we will focus on models using language representations to capture properties of each language that are important to the NLP task(s) at hand, with the main goal being to create methods for solving NLP tasks for multiple languages.
The relevance to WG4 stems from that we will exploit large linguistic knowledge bases, in order to inform our multilingual models of relationships between languages in terms of various linguistic features.


3) Techniques – Please detail what techniques or equipment you may learn to use, if applicable.

The aim of this STSM is to use techniques which lie in the intersection of exploiting large linguistic knowledge bases (KBs), and modern machine learning techniques for multi-task learning.
The visit will allow us to further explore such methods, in particular in the context of linguistic probing of NLP representations. The fact that the STSM is planned to take place at the Department of Linguistics at Stockholm University, which has a strong focus on computational linguistics, is an added benefit in this respect.


4) Planning – Please detail the steps you will take to achieve your proposed aim.

The aim is that this STSM will result in a publication at a top venue in NLP, such as EMNLP 2020 (https://2020.emnlp.org/call-for-papers). We have already begun collaborations with a researcher at Stockholm University (Prof. Robert Östling), and will continue this during my visit. During the visit, we expect to finish the experiments and analysis we need to produce a submission for EMNLP.
After the STSM, we expect to finalise the publication, and to continue the collaboration with Stockholm University.

Host Details

NameProf Robert Östling
InstitutionStockholm University
Institution addressDepartment of Linguistics, Stockholm University Universitetsvägen 10 C Frescati, Stockholm, Stockholm, Sweden

Financial Support

Amount for Travel in EUR300
Amount for Subsistence in EUR800
Total Amount in EUR3500

Skip to content