EU – CEF (Connecting Europe Facility) / Telecommunications sector
Total eligible costs: 1,883,714.67 EUR
Estimated CEF contribution: 1,412,786.00 EUR
2018-10-01 – 2020-09-30 (24 months)
Research Institute for Linguistics of the Hungarian Academy of Sciences (RILMTA)
The overall objective of Multilingual Resources for CEF.AT in the legal domain – MARCELL Action is to provide automatic translation on the body of national legislation (laws, decrees, regulations) in seven countries: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia and Slovenia. At present national legislation texts are not automatically available to CEF.AT and present Machine Translation (MT) systems could be improved if they had access to national legislative texts.
The Action aims to process two resources available in all seven languages concerned i.e. the multilingual ontology-based thesaurus EUROVOC on the one hand and the corpora of all national legislation in the respective languages on the other. As a result, the Action will produce the following deliverables:
In addition to the expected overall improvement of the MT system in the seven languages concerned, the Action will have an impact both on the e-justice and the Online Dispute Resolution Digital Service Infrastructures as the resources focus on national legislation, which is of direct relevance to both DSI’s.
The Croatian corpus consists of 33,788 documents that represent the national legislation from 1990 until today. The corpus is composed of legally binding acts (laws, regulations, decisions, orders, etc.) and internally binding acts (ordinances, recommendations, etc.). There are 12 different texts types with ordinances (11,521), decisions (7,735) and laws (3,798) as three most frequent text types. In collaboration with the Central State Office for the Development of the Digital Society of the Republic of Croatia (RDD) (http://rdd.gov.hr), which has, as a part of its mission, the securing of online accessibility to all Croatian legal documentation, we received the final data set from their database in October 2019 and we are presenting the figures of that current state. The data were delivered in a proprietary XML format that had to be converted into a CoNLL-U Plus format and the relevant accompanying metadata were extracted from the RDD database.
The corpus was analysed with the Croatian Language Web Services (Padró et al., 2014): paragraphs and sentences are split, tokens are identified and morphologically and syntactically annotated. An annotation tool is being developed to annotate IATE terms and EuroVoc descriptors within the corpus by the way of matching these terms with SWE/MWEs in the corpus. The corpus overall size is almost 10.3 M sentences and around 102 M tokens.
The corpus consists of 1816 documents where 1563 are original Croatian legislative texts translated into English, and 253 documents are international treaties that the Republic of Croatia signed with different international parties as originals valid in Croatian and English.
This parallel corpus is processed at the level of paragraph and sentence splitting, segment alignment and each of 396,984 translation units (TUs) was manually checked for alignment. The file format is TMX (v1.4) while in the header additional metadata is stored. This corpus represents a rich and valuable source for further studies and developments in machine translation, machine learning, cross-lingual terminological data extraction and classification.