elrc-medical-v2

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

https://github.com/qanastek/elrc-medical-v2

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Basic Info

Host: GitHub
Owner: qanastek
License: cc-by-4.0
Language: Python
Default Branch: main
Size: 27.6 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

annotationscreators: - machine-generated - expert-generated languagecreators: - found languages: - en licenses: - cc-by-4-0 multilinguality: - bg - cs - da - de - el - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv prettyname: ELRC-Medical-V2 sizecategories: - 100K<n<1M sourcedatasets: - extended taskcategories: - translation task_ids:

- translation

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Dataset Card for [Needs More Information]

Dataset Description

Homepage: https://qanastek.github.io/ELRC-Medical-V2/
Repository: https://github.com/qanastek/ELRC-Medical-V2/
Paper: [Needs More Information]
Leaderboard: [Needs More Information]
Point of Contact: yanis.labrak@univ-avignon.fr

Dataset Summary

ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence.

Supported Tasks and Leaderboards

translation: The dataset can be used to train a model for translation.

Languages

In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).

List of languages : Bulgarian (bg),Czech (cs),Danish (da),German (de),Greek (el),Spanish (es),Estonian (et),Finnish (fi),French (fr),Irish (ga),Croatian (hr),Hungarian (hu),Italian (it),Lithuanian (lt),Latvian (lv),Maltese (mt),Dutch (nl),Polish (pl),Portuguese (pt),Romanian (ro),Slovak (sk),Slovenian (sl),Swedish (sv).

Load the dataset with HuggingFace

python from datasets import load_dataset dataset = load_dataset("qanastek/ELRC-Medical-V2") print(dataset)

Dataset Structure

Data Instances

plain id,lang,source_text,target_text 1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3" 2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат. 3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.

Data Fields

id : The document identifier of type Integer.

lang : The pair of source and target language of type String.

source_text : The source text of type String.

target_text : The target text of type String.

Data Splits

| Lang | # Docs | Avg. # Source Tokens | Avg. # Target Tokens | |--------|-----------|------------------------|------------------------| | bg | 13 149 | 23 | 24 | | cs | 13 160 | 23 | 21 | | da | 13 242 | 23 | 22 | | de | 13 291 | 23 | 22 | | el | 13 091 | 23 | 26 | | es | 13 195 | 23 | 28 | | et | 13 016 | 23 | 17 | | fi | 12 942 | 23 | 16 | | fr | 13 149 | 23 | 28 | | ga | 412 | 12 | 12 | | hr | 12 836 | 23 | 21 | | hu | 13 025 | 23 | 21 | | it | 13 059 | 23 | 25 | | lt | 12 580 | 23 | 18 | | lv | 13 044 | 23 | 19 | | mt | 3 093 | 16 | 14 | | nl | 13 191 | 23 | 25 | | pl | 12 761 | 23 | 22 | | pt | 13 148 | 23 | 26 | | ro | 13 163 | 23 | 25 | | sk | 12 926 | 23 | 20 | | sl | 13 208 | 23 | 21 | | sv | 13 099 | 23 | 21 | ||||| | Total | 277 780 | 22.21 | 21.47 |

Dataset Creation

Curation Rationale

For details, check the corresponding pages.

Source Data

Initial Data Collection and Normalization

The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.

Who are the source language producers?

Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share.

Personal and Sensitive Information

The corpora is free of personal or sensitive information.

Considerations for Using the Data

Other Known Limitations

The nature of the task introduce a variability in the quality of the target translations.

Additional Information

Dataset Curators

ELRC-Medical-V2: Labrak Yanis, Dufour Richard

Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus: Vassilis Papavassiliou and others.

Licensing Information

This work is licensed under a Attribution 4.0 International (CC BY 4.0) License.

Citation Information

Please cite the following paper when using this model.

latex @inproceedings{losch-etal-2018-european, title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management, author = { L'osch, Andrea and Mapelli, Valérie and Piperidis, Stelios and Vasiljevs, Andrejs and Smal, Lilli and Declerck, Thierry and Schnur, Eileen and Choukri, Khalid and van Genabith, Josef }, booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), month = may, year = 2018, address = Miyazaki, Japan, publisher = European Language Resources Association (ELRA), url = https://aclanthology.org/L18-1213, }

Owner

Name: Labrak Yanis
Login: qanastek
Kind: user
Location: Avignon, France
Company: Laboratoire Informatique d'Avignon

Website: linkedin.com/in/yanis-labrak-8a7412145/
Twitter: LabrakYanis
Repositories: 8
Profile: https://github.com/qanastek

👨🏻‍🎓 PhD. student in Computer Science (CS), Avignon University 🇫🇷 🏛 Research Scientist - Machine Learning in Healthcare

Citation (CITATION.cff)

cff-version: 1.0.0
message: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management"
authors:
- family-names: "Lösch"
  given-names: "Andrea"
- family-names: "Mapelli"
  given-names: "Valérie"
- family-names: "Piperidis"
  given-names: "Stelios"
- family-names: "Vasiļjevs"
  given-names: "Andrejs"
- family-names: "Smal"
  given-names: "Lilli"
- family-names: "Declerck"
  given-names: "Thierry"
- family-names: "Schnur"
  given-names: "Eileen"
- family-names: "Choukri"
  given-names: "Khalid"
- family-names: "van Genabith"
  given-names: "Josef"
title: "European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management"
version: 2.0.0
date-released: 2018-01-01
url: "https://aclanthology.org/L18-1213/"

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

elrc-medical-v2

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

- translation

ELRC-Medical-V2 : European parallel corpus for healthcare machine translation

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Load the dataset with HuggingFace

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Personal and Sensitive Information

Considerations for Using the Data

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels