Data acquisition for Μachine Τranslation: the ELRC contribution

The European Language Resource Coordination | ELRC

14-07-2021

Language-centric Artificial Intelligence applications are popping up all around us in our everyday lives. Voice assistants, TV show recommenders, automated translation services, to name a few, are used more and more. Especially in a crisis, like the pandemia Covid-19, their use is important for easy access to reliable sources of information.

The development of such powerful applications is fueled by three key factors: algorithms, computing power, and data. Nowadays, well-established deep learning algorithms are available and can be implemented via open-source machine learning libraries. In parallel, hardware IT infrastructures are continuously extended in terms of computing power and big data storage. The third factor concerns the acquisition of sizable data appropriate for the problem at hand (e.g. millions of translated sentence pairs in various language combinations for training Machine Translation engines).

Focusing on the goal of EU's multilingualism policy to give citizens access to EU procedures and information in their own languages, the Connecting Europe Facility - Automated Translation (CEF AT) aims at providing multilingual support to pan-European digital services, public administrations and SMEs in all EU Member States and EEA countries.

As a result, a continuous extension of the supported languages and quality improvement of the automated translation services is required. The European Language Resource Coordination (ELRC) contributes to this objective by coordinating the collection and processing of language resources pertinent to Machine Translation and by maintaining the language data repository (ELRC-SHARE), through which the collected data are made available not only to CEF AT but also, depending on their terms of use, to the general public.

To this end, the Institute for Language and Speech Processing / Athena Research Center, one of the founding partners of the ELRC initiative, has set up a workflow and developed a pipeline for parallel language data acquisition from the web, focusing on three domains, namely “Health”, “Culture” and “Scientific Research”. In the current COVID-19 crisis, the ELRC language data collection activities could not have been inert with respect to the growing demand for improved technology-enhanced multilingual access to COVID-19 information. As part of the data collection activities in the “Health” domain, efforts focused on identifying reliable sources of language data and on compiling dedicated resources on the pandemic. To this end,  the relevant MEDISYS metadata collections have been parsed and harvested in order to extract pairs of parallel sentences from comparable corpora by applying the above-mentioned workflow. Parts of these datasets were offered to the Covid-19 MLIA-Eval initiative.

The total number of TUs for EN-X language pairs that have been collected as part of the ELRC activities during the last two years amount to more than 40 million translation units in total. Further to the above, a considerable number of TUs have been identified for X-Y language pairs, where X and Y are CEF languages other than EN, while millions of TUs have been extracted from websites with multi-domain content, with the aim to cluster them in domain-specific subsets. In total, the constructed Language Resources comprise more than 80 million Tus, numbers that change constantly.

 

ILSP’S workflow for parallel language data acquisition from the web

The process is triggered by identifying multi/bi-lingual websites with content related to the targeted domains. The main sources are websites of National Agencies, International organizations and broadcasts. Then, the ILSP Focused Crawler (ILSP-FC) toolkit is used to acquire the main content of the detected websites and to identify pairs of candidate parallel documents. Depending on the format of the source data, efficient methods for text extraction are applied, including for instance OCR on PDF files. The next step leverages multilingual embeddings to extract Translation Units (TUs). Finally a battery of criteria is applied to filter out TUs of limited or no use (e.g. sentences containing only numbers) and thus generate parallel language resources (LRs) of high quality. It is worth mentioning that the constructed datasets are clustered into groups, according to the conditions of use indicated on the websites the data originated from. 

 
 

Event Calendar

S M T W T F S
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30