colibrita
Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Colibrita: Translation Assistance System
Colibrita is a proof-of-concept translation assistance system that can translate L1 fragments in an L2 context.
The system is designed prior to a new task (presented at SemEval 2014) concerning the translation of L1 fragments, i.e words or phrases, in an L2 context. This type of translation can be applied in writing assistance systems for language learners in which users write in their target language, but are allowed to occasionally back off to their native L1 when they are uncertain of the proper word or expression in L2. These L1 fragments are subsequently translated, along with the L2 context, into L2 fragments.
Colibrita was developed to test whether L2 context information aids in translation of L1 fragments. The results are accepted for publication in ACL 2014, in the paper: Maarten van Gompel, Antal van den Bosch. Translation Assistance by Translation of L1 Fragments in an L2 Context. Proceedings of ACL 2014 Conference (to appear still)
Installation
Colibrita is written in Python 3. It is a complex system involving quite a number of dependencies.
First make sure you have a modern linux distribution with the necessary prerequisites: python3, python3-dev, python3-setuptools, python3-lxml, cython3, gcc, g++, autoconf, automake, autoconf-archive, libtool , libboost-dev, libboost-python
If you intend to build your own training models, then you will also require the following two dependencies:
- Moses - https://github.com/moses-smt/mosesdecoder
- GIZA++ - http://code.google.com/p/giza-pp/
Other unix systems including FreeBSD and Mac OS X will most likely work too, but especially for that latter considerable extra effort may be required in installing things. The instructions here have been tailored for Debian/Ubuntu-based Linux distributions.
In addition to the above dependencies, Colibrita depends on pynlpl, colibri-core, Timbl and python-timbl.
Install PyNLPl from the Python Package Index (or alternatively from https://github.com/proycon/pynlpl):
$ sudo easy_install3 pynlpl
Download colibri-core from https://github.com/proycon/colibri-core and install as follows:
$ bash bootstrap
$ ./configure
$ make
$ sudo make install
$ sudo python3 ./setup.py install
Install Timbl, it may be in your package manager if you use Debian/Ubuntu:
$ sudo apt-get install timbl
Otherwise obtain it from http://ilk.uvt.nl/timbl and compile manually:
$ ./configure
$ make
$ make install
Install Python-Timbl from the Python Package Index (or alternatively from https://github.com/proycon/python-timbl):
$ sudo easy_install3 python-timbl
Then install colibrita from https://github.com/proycon/colibrita:
$ sudo python3 ./setup.py install
Note: If you want to reproduce the results of our ACL paper, then make sure to do
git checkout v0.2.1 in the Colibrita repository prior to installation. Colibrita may have advanced
since then.
Last, if you want to evaluate according to well-known MT metrics such as BLEU, METEOR, NIST, TER, WER, and PER; you should download and unpack http://lst.science.ru.nl/~proycon/mtevalscripts.tar.gz
Usage
The following tools are available:
colibrita- This is the main system, it is used for training and testing.colibrita-evaluate- Tool for evaluation of system output. Point --mtevaldir to the directory where you unpacked mtevalscripts.tar.gz if you want common MT metrics in your report.colibrita-setgen- Tool for generating training & test sets from parallel corpus data, GIZA++ Word Alignments and a Moses Phrasetablecolibrita-datastats- Reports some statistics on a dataset (train or test, XML)colibrita-manualsetbuild- Small interactive console-based script for creating datasets manually
All tools take -h for help on usage options.
Set generation
Building a model starts with generating a training set from a parallel corpus. Ensure you have two plain-text files, one in the source language, one in the target language, with one sentence per line where the line numbers across the two files are indicative of sentences that are translations of eachother. In this documentation we will use two files from our ACL 2014 experiments, obtainable from http://lst.science.ru.nl/~proycon/colibrita-acl2014-data.zip :
* europarl200k-train.nl.txt
* europarl200k-train.en.txt
Given this input data, you can use Colibrita's setgen tool:
$ colibrita-setgen --train --mosesdir=/path/to/mosesdecoder -S nl -T en \
-s europarl200k-train.nl.txt -t europarl200k-train.en.txt --bindir=/usr/local/bin \
-o europarl200k
This tool will invoke Moses (which will in turn invoke GIZA++) and the
Colibri-Core patternmodeller. It builds word alignments, a phrase-translation
table and pattern models, and eventually produces an XML file. This process may
take a very long time and demands conseridable memory. The output prefix
-o will be used in many of the output files. The parameters
--joinedprobabilitythreshold and --divergencefrombestthreshold can be
used to prevent weaker alignments and alternatives from making it into the set,
and correspond to the parameters λ1 and λ2 in our ACL 2014 paper.
A test set can be generated in the same fashion:
$ colibrita-setgen --test --mosesdir=/path/to/mosesdecoder -S nl -T en \
-s europarl200k-test.nl.txt -t europarl200k-test.en.txt --bindir=/usr/local/bin \
-o europarl200k
Training
The next step is feature extraction and classifier training:
$ colibrita --train -f europarl200k.train.xml -l 1 -r 1 \
-o exp-l1r1 --Tclones 4 --trainfortest europarl200k-test.xml
The output will consist of a whole bunch of classifiers (ibase files) in the
directory specified with -o.
Some notes about this example:
-fspecifies the training set, generated bycolibrita-setgenin the previous step.-l 1sets a left context size of one-r 1sets a right context size of one-ospecified a new output prefix, used in generated files and a directory will be generated with this name containing all classifiers and intermediate files--Tclones 4runs Timbl on four cores--trainfortestgenerates only those classifiers that will be used in testing, saving time and resources. But this implies the model will have to be retrained if other test data is offered, and can ever be used in a live setting.
Testing
Testing follows a very similar syntax:
$ colibrita --test -f europarl200k.test.xml -l 1 -r 1 \
-o exp-l1r1 -T train-europarl200k/model/phrase-table.gz
This will generate a file exp-l1r1-output.xml that contains the system output
Some notes:
-Tpasses the original phrase table which will be used as a fallback option-othe same output prefix used in the training step, is used as input as well and assumes a directory by this name exists
Evaluation
System output can subsequently be evaluated against the test set using
colibrita-evaluate:
$ colibrita-evaluate --mtevaldir /path/to/mtevalscripts \
--ref europarl200k.test.xml --out exp-l1r1-output.xml
A summary of all Scores will be written in exp-l1r1-output.summary.score .
Owner
- Name: Maarten van Gompel
- Login: proycon
- Kind: user
- Location: Eindhoven, the Netherlands
- Company: KNAW Humanities Cluster & CLST, Radboud University
- Website: https://proycon.anaproy.nl
- Repositories: 213
- Profile: https://github.com/proycon
Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 InfoSec - https://git.sr.ht/~proycon
CodeMeta (codemeta.json)
{
"@context": [
"https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org",
{
"entryPoints": {
"@reverse": "schema:actionApplication"
},
"interfaceType": {
"@id": "codemeta:interfaceType"
}
}
],
"@type": "SoftwareSourceCode",
"identifier": "colibrita",
"name": "colibrita",
"version": "0.3.1",
"description": "Colibrita is a proof-of-concept translation assistance system that can translate L1 fragments in an L2 context. The system is designed prior to a new task (presented at SemEval 2014) concerning the translation of L1 fragments, i.e words or phrases, in an L2 context. This type of translation can be applied in writing assistance systems for language learners in which users write in their target language, but are allowed to occasionally back off to their native L1 when they are uncertain of the proper word or expression in L2. These L1 fragments are subsequently translated, along with the L2 context, into L2 fragments. Colibrita was developed to test whether L2 context information aids in translation of L1 fragments.",
"license": "GPLv3",
"url": "https://github.com/proycon/colibrita",
"producer": {
"@id": "https://www.ru.nl/cls",
"@type": "Organization",
"name": "Centre for Language Studies",
"url": "https://www.ru.nl/cls",
"parentOrganization": {
"@id": "https://www.ru.nl",
"name": "Radboud University",
"@type": "Organization",
"url": "https://www.ru.nl",
"location": {
"@type": "Place",
"name": "Nijmegen"
}
}
},
"author": [
{
"@id": "https://orcid.org/0000-0002-1046-0006",
"@type": "Person",
"givenName": "Maarten",
"familyName": "van Gompel",
"email": "proycon@anaproy.nl",
"affiliation": {
"@id": "https://www.ru.nl/cls"
}
}
],
"sourceOrganization": {
"@id": "https://www.ru.nl/cls"
},
"programmingLanguage": {
"@type": "ComputerLanguage",
"identifier": "python",
"name": "python"
},
"operatingSystem": "POSIX",
"codeRepository": "https://github.com/proycon/colibrita",
"softwareRequirements": [
{
"@type": "SoftwareApplication",
"identifier": "colibricore",
"name": "Colibri Core"
},
{
"@type": "SoftwareApplication",
"identifier": "moses",
"name": "Moses"
},
{
"@type": "SoftwareApplication",
"identifier": "GIZA++",
"name": "GIZA++"
},
{
"@type": "SoftwareApplication",
"identifier": "python3-timbl",
"name": "python-timbl"
}
],
"referencePublication": [
{
"@id": "http://hdl.handle.net/2066/129758",
"@type": "ScholarlyArticle",
"name": "Translation Assistance by Translation of L1 Fragments in an L2 Context",
"author": [
"Maarten van Gompel",
"Antal van den Bosch"
],
"pageStart": 871,
"pageEnd": 880,
"isPartOf": {
"@type": "PublicationIssue",
"datePublished": "2014-06-23",
"name": "Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)"
},
"url": "http://www.aclweb.org/anthology/P14-1082"
}
],
"interfaceType": "CLI",
"readme": "https://github.com/proycon/colibrita/blob/master/README.md",
"issueTracker": "https://github.com/proycon/colibrita/issues",
"releaseNotes": "https://github.com/proycon/colibrita/releases",
"developmentStatus": "inactive",
"keywords": [
"nlp",
"natural language processing",
"machine translation",
"collocations",
"translation",
"code switching",
"computer-aided language learning"
],
"dateCreated": "2013-07-09"
}
GitHub Events
Total
Last Year
Committers
Last synced: 11 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Maarten van Gompel | p****n@a****l | 711 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- colibricore *
- colibrimt *
- lxml *
- pynlpl *