elrc-medical-v2
ELRC-Medical-V2 : European parallel corpus for healthcare machine translation
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.4%) to scientific vocabulary
Repository
ELRC-Medical-V2 : European parallel corpus for healthcare machine translation
Basic Info
- Host: GitHub
- Owner: qanastek
- License: cc-by-4.0
- Language: Python
- Default Branch: main
- Size: 27.6 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
annotationscreators: - machine-generated - expert-generated languagecreators: - found languages: - en licenses: - cc-by-4-0 multilinguality: - bg - cs - da - de - el - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv prettyname: ELRC-Medical-V2 sizecategories: - 100K<n<1M sourcedatasets: - extended taskcategories: - translation task_ids:
- translation
ELRC-Medical-V2 : European parallel corpus for healthcare machine translation
Table of Contents
- Dataset Card for [Needs More Information]
- Table of Contents
- Dataset Description
- Dataset Summary
- Supported Tasks and Leaderboards
- Languages
- Dataset Structure
- Data Instances
- Data Fields
- Data Splits
- Dataset Creation
- Curation Rationale
- Source Data
- Personal and Sensitive Information
- Considerations for Using the Data
- Other Known Limitations
- Additional Information
- Dataset Curators
- Licensing Information
- Citation Information
Dataset Description
- Homepage: https://qanastek.github.io/ELRC-Medical-V2/
- Repository: https://github.com/qanastek/ELRC-Medical-V2/
- Paper: [Needs More Information]
- Leaderboard: [Needs More Information]
- Point of Contact: yanis.labrak@univ-avignon.fr
Dataset Summary
ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence.
Supported Tasks and Leaderboards
translation: The dataset can be used to train a model for translation.
Languages
In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).
List of languages : Bulgarian (bg),Czech (cs),Danish (da),German (de),Greek (el),Spanish (es),Estonian (et),Finnish (fi),French (fr),Irish (ga),Croatian (hr),Hungarian (hu),Italian (it),Lithuanian (lt),Latvian (lv),Maltese (mt),Dutch (nl),Polish (pl),Portuguese (pt),Romanian (ro),Slovak (sk),Slovenian (sl),Swedish (sv).
Load the dataset with HuggingFace
python
from datasets import load_dataset
dataset = load_dataset("qanastek/ELRC-Medical-V2")
print(dataset)
Dataset Structure
Data Instances
plain
id,lang,source_text,target_text
1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3"
2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат.
3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.
Data Fields
id : The document identifier of type Integer.
lang : The pair of source and target language of type String.
source_text : The source text of type String.
target_text : The target text of type String.
Data Splits
| Lang | # Docs | Avg. # Source Tokens | Avg. # Target Tokens | |--------|-----------|------------------------|------------------------| | bg | 13 149 | 23 | 24 | | cs | 13 160 | 23 | 21 | | da | 13 242 | 23 | 22 | | de | 13 291 | 23 | 22 | | el | 13 091 | 23 | 26 | | es | 13 195 | 23 | 28 | | et | 13 016 | 23 | 17 | | fi | 12 942 | 23 | 16 | | fr | 13 149 | 23 | 28 | | ga | 412 | 12 | 12 | | hr | 12 836 | 23 | 21 | | hu | 13 025 | 23 | 21 | | it | 13 059 | 23 | 25 | | lt | 12 580 | 23 | 18 | | lv | 13 044 | 23 | 19 | | mt | 3 093 | 16 | 14 | | nl | 13 191 | 23 | 25 | | pl | 12 761 | 23 | 22 | | pt | 13 148 | 23 | 26 | | ro | 13 163 | 23 | 25 | | sk | 12 926 | 23 | 20 | | sl | 13 208 | 23 | 21 | | sv | 13 099 | 23 | 21 | ||||| | Total | 277 780 | 22.21 | 21.47 |
Dataset Creation
Curation Rationale
For details, check the corresponding pages.
Source Data
Initial Data Collection and Normalization
The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.
Who are the source language producers?
Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share.
Personal and Sensitive Information
The corpora is free of personal or sensitive information.
Considerations for Using the Data
Other Known Limitations
The nature of the task introduce a variability in the quality of the target translations.
Additional Information
Dataset Curators
ELRC-Medical-V2: Labrak Yanis, Dufour Richard
Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus: Vassilis Papavassiliou and others.
Licensing Information

This work is licensed under a Attribution 4.0 International (CC BY 4.0) License.
Citation Information
Please cite the following paper when using this model.
latex
@inproceedings{losch-etal-2018-european,
title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management,
author = {
L'osch, Andrea and
Mapelli, Valérie and
Piperidis, Stelios and
Vasiljevs, Andrejs and
Smal, Lilli and
Declerck, Thierry and
Schnur, Eileen and
Choukri, Khalid and
van Genabith, Josef
},
booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),
month = may,
year = 2018,
address = Miyazaki, Japan,
publisher = European Language Resources Association (ELRA),
url = https://aclanthology.org/L18-1213,
}
Owner
- Name: Labrak Yanis
- Login: qanastek
- Kind: user
- Location: Avignon, France
- Company: Laboratoire Informatique d'Avignon
- Website: linkedin.com/in/yanis-labrak-8a7412145/
- Twitter: LabrakYanis
- Repositories: 8
- Profile: https://github.com/qanastek
👨🏻🎓 PhD. student in Computer Science (CS), Avignon University 🇫🇷 🏛 Research Scientist - Machine Learning in Healthcare
Citation (CITATION.cff)
cff-version: 1.0.0 message: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management" authors: - family-names: "Lösch" given-names: "Andrea" - family-names: "Mapelli" given-names: "Valérie" - family-names: "Piperidis" given-names: "Stelios" - family-names: "Vasiļjevs" given-names: "Andrejs" - family-names: "Smal" given-names: "Lilli" - family-names: "Declerck" given-names: "Thierry" - family-names: "Schnur" given-names: "Eileen" - family-names: "Choukri" given-names: "Khalid" - family-names: "van Genabith" given-names: "Josef" title: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management" version: 2.0.0 date-released: 2018-01-01 url: "https://aclanthology.org/L18-1213/"
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0