elrc-medical-v2

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

https://github.com/qanastek/elrc-medical-v2

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Basic Info
  • Host: GitHub
  • Owner: qanastek
  • License: cc-by-4.0
  • Language: Python
  • Default Branch: main
  • Size: 27.6 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md


annotationscreators: - machine-generated - expert-generated languagecreators: - found languages: - en licenses: - cc-by-4-0 multilinguality: - bg - cs - da - de - el - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv prettyname: ELRC-Medical-V2 sizecategories: - 100K<n<1M sourcedatasets: - extended taskcategories: - translation task_ids:

- translation

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Table of Contents

Dataset Description

  • Homepage: https://qanastek.github.io/ELRC-Medical-V2/
  • Repository: https://github.com/qanastek/ELRC-Medical-V2/
  • Paper: [Needs More Information]
  • Leaderboard: [Needs More Information]
  • Point of Contact: yanis.labrak@univ-avignon.fr

Dataset Summary

ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence.

Supported Tasks and Leaderboards

translation: The dataset can be used to train a model for translation.

Languages

In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).

List of languages : Bulgarian (bg),Czech (cs),Danish (da),German (de),Greek (el),Spanish (es),Estonian (et),Finnish (fi),French (fr),Irish (ga),Croatian (hr),Hungarian (hu),Italian (it),Lithuanian (lt),Latvian (lv),Maltese (mt),Dutch (nl),Polish (pl),Portuguese (pt),Romanian (ro),Slovak (sk),Slovenian (sl),Swedish (sv).

Load the dataset with HuggingFace

python from datasets import load_dataset dataset = load_dataset("qanastek/ELRC-Medical-V2") print(dataset)

Dataset Structure

Data Instances

plain id,lang,source_text,target_text 1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3" 2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат. 3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.

Data Fields

id : The document identifier of type Integer.

lang : The pair of source and target language of type String.

source_text : The source text of type String.

target_text : The target text of type String.

Data Splits

| Lang | # Docs | Avg. # Source Tokens | Avg. # Target Tokens | |--------|-----------|------------------------|------------------------| | bg | 13 149 | 23 | 24 | | cs | 13 160 | 23 | 21 | | da | 13 242 | 23 | 22 | | de | 13 291 | 23 | 22 | | el | 13 091 | 23 | 26 | | es | 13 195 | 23 | 28 | | et | 13 016 | 23 | 17 | | fi | 12 942 | 23 | 16 | | fr | 13 149 | 23 | 28 | | ga | 412 | 12 | 12 | | hr | 12 836 | 23 | 21 | | hu | 13 025 | 23 | 21 | | it | 13 059 | 23 | 25 | | lt | 12 580 | 23 | 18 | | lv | 13 044 | 23 | 19 | | mt | 3 093 | 16 | 14 | | nl | 13 191 | 23 | 25 | | pl | 12 761 | 23 | 22 | | pt | 13 148 | 23 | 26 | | ro | 13 163 | 23 | 25 | | sk | 12 926 | 23 | 20 | | sl | 13 208 | 23 | 21 | | sv | 13 099 | 23 | 21 | ||||| | Total | 277 780 | 22.21 | 21.47 |

Dataset Creation

Curation Rationale

For details, check the corresponding pages.

Source Data

Initial Data Collection and Normalization

The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.

Who are the source language producers?

Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share.

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

ELRC-Medical-V2: Labrak Yanis, Dufour Richard

Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus: Vassilis Papavassiliou and others.

Licensing Information

Attribution 4.0 International (CC BY 4.0) License
This work is licensed under a Attribution 4.0 International (CC BY 4.0) License.

Citation Information

Please cite the following paper when using this model.

latex @inproceedings{losch-etal-2018-european, title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management, author = { L'osch, Andrea and Mapelli, Valérie and Piperidis, Stelios and Vasiljevs, Andrejs and Smal, Lilli and Declerck, Thierry and Schnur, Eileen and Choukri, Khalid and van Genabith, Josef }, booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), month = may, year = 2018, address = Miyazaki, Japan, publisher = European Language Resources Association (ELRA), url = https://aclanthology.org/L18-1213, }

Owner

  • Name: Labrak Yanis
  • Login: qanastek
  • Kind: user
  • Location: Avignon, France
  • Company: Laboratoire Informatique d'Avignon

👨🏻‍🎓 PhD. student in Computer Science (CS), Avignon University 🇫🇷 🏛 Research Scientist - Machine Learning in Healthcare

Citation (CITATION.cff)

cff-version: 1.0.0
message: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management"
authors:
- family-names: "Lösch"
  given-names: "Andrea"
- family-names: "Mapelli"
  given-names: "Valérie"
- family-names: "Piperidis"
  given-names: "Stelios"
- family-names: "Vasiļjevs"
  given-names: "Andrejs"
- family-names: "Smal"
  given-names: "Lilli"
- family-names: "Declerck"
  given-names: "Thierry"
- family-names: "Schnur"
  given-names: "Eileen"
- family-names: "Choukri"
  given-names: "Khalid"
- family-names: "van Genabith"
  given-names: "Josef"
title: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management"
version: 2.0.0
date-released: 2018-01-01
url: "https://aclanthology.org/L18-1213/"

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels