|Applicant||Mr Felix Arturo Oncevay Marcos|
|Home institution||University of Edinburgh (Edinburgh, United Kingdom)|
|Home institution address||10 Crichton St EH8 9AB, Edinburgh, United Kingdom|
|Action||CA18231 – Multi3Generation: Multi-task, Multilingual, Multi-modal Language Generation|
|Action General Information: CA18231|
|STSM title||Morphological typology awareness in multilingual NLP evaluation|
|Period||AGA-CA18231-2: 2020-05-01 – 2021-10-31|
|Motivation and Workplan summary||(1) Aim and motivation:|
Human languages are far from being entirely understandable by computers, in spite of the progress of the Natural Language Processing (NLP) field. This problem is marked with the presence of over 7,000 languages around the world, where almost no languages possess enough machine-readable data for modern NLP methods. Such languages are referred to as low-resource languages. These languages are consistently devoid of the benefits provided by language technologies developed. One of the first barriers is the processing of their morphology, i.e., how systematically diverse their word formation processes are. For instance, agglutination and fusion are two morphological processes that concatenate morphemes to a root with explicit or non-explicit boundaries, respectively.
Processing morphologically diverse languages and evaluating morphological competence in NLP models is relevant for language generation tasks, such as machine translation. It is unfeasible to develop models with a capacity large enough to encode the full vocabulary of every language, and it is a must to rely on subword segmentation approaches that help to constrain the capacity when generating rare, or even new, words. Hence, understanding morphology is essential to develop robust subword-based models and evaluate their generation outputs. Nevertheless, there is a gap between the probing of whether an NLP model can handle “morphological richness”, and what is a proper measure of “morphological richness” from linguistic typology. In most of the NLP literature for generation tasks, morphological complexity is usually associated with high agglutination, e.g. if we evaluate our models in datasets of highly agglomerative languages like Turkish, then, according to NLP literature, the model can handle morphology better.
There is, however, a debate as to whether languages can indeed be classified into discrete morphological categories. Payne (2017) provided a morphological typology measurement in a continuous space using the indices of synthesis and fusion, which are computed in an intra-language scope. Synthesis is the ratio of morphemes per word in a segment (1 or bigger), whereas fusion is the ratio of the fusional morphemes joints per the total number of joints (from 0 to 1, or from agglutinative to fusional). And surprisingly, it is possible to identify even English sentences with a very low fusion index, meaning that they are highly agglutinative segments. For instance, in the following fragment, the index of fusion is 1/8 or 0.125 (fusional morpheme joints are marked with a dot and the rest with a hyphen): “The company-‘s great break-through came.PAST when they decid-ed to buy trike-s to sell their ice cream around the street-s in the nine-teen tweenty-s”.
If the references of an evaluation set (in any language generation task) are labelled with the indices, we could perform a stratified analysis (e.g. low fusion and high fusion, or per quartiles) to determine how well an NLP model handles morphology for multiple languages. For example, we could assess whether a machine translation model is failing in generating more fusional than agglutinative morpheme joints for a specific target language. Knowing and quantifying that problem concerning morphology is the first step towards proposing a fix or solution. Therefore, this STSM project aims to begin developing an evaluation framework that assesses morphological competence, in terms of synthesis and fusion, for multilingual NLP tasks with the aid of linguistics typology.
Specifically, we aim to develop automatic methods for computing the indices proposed by Payne (2017). We are going to build baselines to approximate the computation of the indices using supervised, semi-supervised and unsupervised morphological analysers and subword segmentation methods. Afterwards, we are going to analyse the performance of the baselines using multilingual NLP knowledge bases of morphology.
The main novelties investigated in this project are:
A call for clarity in the evaluation of morphological competence in multilingual NLP models.
Assessing morphological typology in multilingual NLP evaluation: There is no prior work about morphological evaluation in NLP in an intra-language scope. The literature has always labelled a language with a specific type of morphology (e.g. fusional or agglutinative).
Utilising multilingual NLP knowledge bases with morphological annotation (Universal Morphology, Universal Dependencies) for a new purpose: to develop automatic methods for computing morphological typology indices and their benchmarking.
The main outcomes for this project are:
At least one Open Access research paper with the new evaluation framework of morphological typology for multilingual NLP tasks, and a call for clarity in the assessment of morphological competence, to be submitted to one of the top NLP venues (e.g. ACL, TACL, Computational Linguistics journal), or a top interdisciplinary venue (e.g. PLOS One).
A tool of methods developed to compute the indices of synthesis and fusion in multiple languages, with source code freely available.
A benchmark of the methods.
Thomas E Payne. 2017. Morphological typology. In The Cambridge Handbook of Linguistic Typology, pages 78–94. Cambridge University Press, March.
(2) Proposed contribution to the scientific objectives of the Action:
This STSM is particularly related to WG 2 and WG 4 of Multi3Generation:
The relevance to WG2 is mainly in that we will be using methods from supervised, semi-supervised and unsupervised learning for subword segmentation and sequence tagging (delimitation of morpheme boundaries). We will also work on developing these to multilingual settings. The main goal is to create methods that support the interpretability and probing of NLP tasks for multiple languages.
The relevance to WG4 stems from that we will exploit linguistic knowledge bases, in order to train and evaluate our methods that compute the degree of morphological typology in multiple languages.
This STSM project aims to use techniques which lie in the intersection of computational typology, multilingual NLP evaluation, and modern machine learning techniques for supervised, semi-supervised and unsupervised learning.
The visit will allow us to further explore such methods, in particular in the context of linguistic probing of NLP models concerning morphology. The fact that the STSM is planned to take place at Aalborg University with Johannes Bjerva, who has a strong focus on linguistic typology for multilingual NLP, is an added benefit in this respect.
The aim is that this STSM will result in a publication to be submitted to one of the top NLP venues (e.g. ACL, TACL, Computational Linguistics journal), or a top interdisciplinary venue (e.g. PLOS One).
We have already started collaborating with Johannes Bjerva at Aalborg University. Before the visit, we are going to define the experimentation details, such as which datasets we are going to process and which methods we are going to evaluate. During the visit, we expect to execute the experiments and analysis we need to produce a paper ready to submit. After the STSM, we expect to continue the collaboration with other related topics on computational typology for aiding evaluation of NLP generation tasks.
|Name||Dr Johannes Bjerva|
|Institution||Aalborg University, Campus Copenhagen|
|Institution address||A. C. Meyers Vænge 15, 2450 Copenhagen, Copenhagen, Denmark|
|Amount for Travel in EUR||150|
|Amount for Subsistence in EUR||3350|
|Total Amount in EUR||3500|