Funding

EU – CEF (Connecting Europe Facility) / Telecommunications sector
Total eligible costs: 1,883,714.67 EUR
Estimated CEF contribution: 1,412,786.00 EUR

Duration

2018-10-01 – 2020-09-30 (24 months)

Coordinator

Research Institute for Linguistics of the Hungarian Academy of Sciences (RILMTA)

Partners

  • Institute for Bulgarian Language "Prof. Lyubomir Andreychin", Bulgaria;
  • Faculty of Humanities and Social Sciences, University of Zagreb, Croatia;
  • Institute of Computer Science at the Polish Academy of Sciences;
  • Institute of Artificial Intelligence at the Romanian Academy;
  • Ludový Štur Institute of Linguistics at the Slovak Academy of Sciences;
  • Jozef Stefan Institute, Slovenia.

Description

The overall objective of Multilingual Resources for CEF.AT in the legal domain – MARCELL Action is to provide automatic translation on the body of national legislation (laws, decrees, regulations) in seven countries: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia and Slovenia. At present national legislation texts are not automatically available to CEF.AT and present Machine Translation (MT) systems could be improved if they had access to national legislative texts.

The Action aims to process two resources available in all seven languages concerned i.e. the multilingual ontology-based thesaurus EUROVOC on the one hand and the corpora of all national legislation in the respective languages on the other. As a result, the Action will produce the following deliverables:

  1. Seven large-scale suitably pre-processed (tokenized and morphologically tagged) monolingual corpora of national legislation documents classified into EUROVOC topics/descriptors and enriched with EUROVOC and IATE terms identified.
  2. Comparable corpus of seven languages aligned at the topic level domains identified by EUROVOC descriptors.
  3. Croatian English parallel corpus consisting of ca. 1800 legislative documents.

In addition to the expected overall improvement of the MT system in the seven languages concerned, the Action will have an impact both on the e-justice and the Online Dispute Resolution Digital Service Infrastructures as the resources focus on national legislation, which is of direct relevance to both DSI’s.

Croatian monolingual corpus of national legislation

The Croatian corpus consists of 33,788 documents that represent the national legislation from 1990 until today. The corpus is composed of legally binding acts (laws, regulations, decisions, orders, etc.) and internally binding acts (ordinances, recommendations, etc.). There are 12 different texts types with ordinances (11,521), decisions (7,735) and laws (3,798) as three most frequent text types. In collaboration with the Central State Office for the Development of the Digital Society of the Republic of Croatia (RDD) (http://rdd.gov.hr), which has, as a part of its mission, the securing of online accessibility to all Croatian legal documentation, we received the final data set from their database in October 2019 and we are presenting the figures of that current state. The data were delivered in a proprietary XML format that had to be converted into a CoNLL-U Plus format and the relevant accompanying metadata were extracted from the RDD database.

The corpus was analysed with the Croatian Language Web Services (Padró et al., 2014): paragraphs and sentences are split, tokens are identified and morphologically and syntactically annotated. An annotation tool is being developed to annotate IATE terms and EuroVoc descriptors within the corpus by the way of matching these terms with SWE/MWEs in the corpus. The corpus overall size is almost 10.3 M sentences and around 102 M tokens.

Croatian English parallel corpus of legislative texts

The corpus consists of 1816 documents where 1563 are original Croatian legislative texts translated into English, and 253 documents are international treaties that the Republic of Croatia signed with different international parties as originals valid in Croatian and English.

This parallel corpus is processed at the level of paragraph and sentence splitting, segment alignment and each of 396,984 translation units (TUs) was manually checked for alignment. The file format is TMX (v1.4) while in the header additional metadata is stored. This corpus represents a rich and valuable source for further studies and developments in machine translation, machine learning, cross-lingual terminological data extraction and classification.