lecontra

https://github.com/bramvanroy/lecontra

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: BramVanroy
License: cc-by-4.0
Language: Smalltalk
Default Branch: main
Size: 7.86 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

LeCoNTra

We present LeCoNTra, a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master’s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data. The data is freely accessible via the Translation Process Research DataBase and GitHub, which emphasises our commitment of distributing our dataset. The tool that was built for manual sentence segmentation and tokenization, Mantis, is also available as an open-source aid for data processing.

Metadata

Metadata was collected with TICQ. Identifiable columns such as IP addresses and initials have been removed. You will find that there is a P04 in the data who is not present in the metadata. As the paper describes, P04 are reference translations. There is no process data for P04 (only a final translation and product-related metrics). It is unlikely that all translations were made by the same person, so be vigilant when using P04 for analyses as it is a collection of reference translations, likely by different translators.

Citation

Vanroy, B. and Macken, L. (2022). LeConTra: A Learner Corpus of English-to-Dutch News Translation. In Proceedings of the Language Resources and Evaluation Conference, pages 1807-1816, Marseille, France. European Language Resources Association.

```bibtex @InProceedings{vanroy-macken:2022:LREC, author = {Vanroy, Bram and Macken, Lieve}, title = {{LeConTra}: A Learner Corpus of English-to-Dutch News Translation}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {1807--1816}, abstract = {We present {LeConTra}, a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master's programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data. The data is freely accessible via the Translation Process Research DataBase, which emphasises our commitment of distributing our dataset. The tool that was built for manual sentence segmentation and tokenization, Mantis, is also available as an open-source aid for data processing.}, url = {https://aclanthology.org/2022.lrec-1.192} }

```

Erratum

In the paper, the number of tokens column in the Appendix is incorrect. The cause of this is not immediately clear but may have to do with changes in the TPR-DB table generation, and/or white-space and newline handling. To get the right number of tokens per text, please consider the number of rows in the *.st tables (source tokens) in this repository. They show one row per token.

Owner

Name: Bram Vanroy
Login: BramVanroy
Kind: user
Location: Belgium
Company: @CCL-KULeuven @instituutnederlandsetaal

Website: https://bramvanroy.github.io/
Repositories: 29
Profile: https://github.com/BramVanroy

👋 My name is Bram and I work on natural language processing and machine translation (evaluation) but I also spend a lot of time in this open-source world 🌍

Citation (CITATION)

@InProceedings{vanroy-macken:2022:LREC,
  author    = {Vanroy, Bram  and  Macken, Lieve},
  title     = {LeConTra: A Learner Corpus of English-to-Dutch News Translation},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1807--1816},
  abstract  = {We present LeConTra, a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Masterâ€™s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data. The data is freely accessible via the Translation Process Research DataBase, which emphasises our commitment of distributing our dataset. The tool that was built for manual sentence segmentation and tokenization, Mantis, is also available as an open-source aid for data processing.},
  url       = {https://aclanthology.org/2022.lrec-1.192}
}

GitHub Events

Total

Last Year

Committers

Last synced: 7 months ago

All Time

Total Commits: 13
Total Committers: 1
Avg Commits per committer: 13.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Bram Vanroy	B**y@U**e	13

Committer Domains (Top 20 + Academic)

ugent.be: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

lecontra

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

LeCoNTra

Metadata

Citation

Erratum

Owner

Citation (CITATION)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels