colibrita

Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques

https://github.com/proycon/colibrita

Last synced: 10 months ago · JSON representation

Repository

Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques

Basic Info

Host: GitHub
Owner: proycon
Language: Python
Default Branch: master
Homepage:
Size: 324 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created about 13 years ago · Last pushed about 8 years ago

Metadata Files

Readme Codemeta

Colibrita: Translation Assistance System

Colibrita is a proof-of-concept translation assistance system that can translate L1 fragments in an L2 context.

The system is designed prior to a new task (presented at SemEval 2014) concerning the translation of L1 fragments, i.e words or phrases, in an L2 context. This type of translation can be applied in writing assistance systems for language learners in which users write in their target language, but are allowed to occasionally back off to their native L1 when they are uncertain of the proper word or expression in L2. These L1 fragments are subsequently translated, along with the L2 context, into L2 fragments.

Colibrita was developed to test whether L2 context information aids in translation of L1 fragments. The results are accepted for publication in ACL 2014, in the paper: Maarten van Gompel, Antal van den Bosch. Translation Assistance by Translation of L1 Fragments in an L2 Context. Proceedings of ACL 2014 Conference (to appear still)

Installation

Colibrita is written in Python 3. It is a complex system involving quite a number of dependencies.

First make sure you have a modern linux distribution with the necessary prerequisites: python3, python3-dev, python3-setuptools, python3-lxml, cython3, gcc, g++, autoconf, automake, autoconf-archive, libtool , libboost-dev, libboost-python

If you intend to build your own training models, then you will also require the following two dependencies:

Moses - https://github.com/moses-smt/mosesdecoder
GIZA++ - http://code.google.com/p/giza-pp/

Other unix systems including FreeBSD and Mac OS X will most likely work too, but especially for that latter considerable extra effort may be required in installing things. The instructions here have been tailored for Debian/Ubuntu-based Linux distributions.

In addition to the above dependencies, Colibrita depends on pynlpl, colibri-core, Timbl and python-timbl.

Install PyNLPl from the Python Package Index (or alternatively from https://github.com/proycon/pynlpl):

$ sudo easy_install3 pynlpl

Download colibri-core from https://github.com/proycon/colibri-core and install as follows:

$ bash bootstrap
$ ./configure 
$ make
$ sudo make install
$ sudo python3 ./setup.py install

Install Timbl, it may be in your package manager if you use Debian/Ubuntu:

$ sudo apt-get install timbl

Otherwise obtain it from http://ilk.uvt.nl/timbl and compile manually:

$ ./configure
$ make
$ make install

Install Python-Timbl from the Python Package Index (or alternatively from https://github.com/proycon/python-timbl):

$ sudo easy_install3 python-timbl

Then install colibrita from https://github.com/proycon/colibrita:

 $ sudo python3 ./setup.py install

Note: If you want to reproduce the results of our ACL paper, then make sure to do git checkout v0.2.1 in the Colibrita repository prior to installation. Colibrita may have advanced since then.

Last, if you want to evaluate according to well-known MT metrics such as BLEU, METEOR, NIST, TER, WER, and PER; you should download and unpack http://lst.science.ru.nl/~proycon/mtevalscripts.tar.gz

Usage

The following tools are available:

colibrita - This is the main system, it is used for training and testing.
colibrita-evaluate - Tool for evaluation of system output. Point --mtevaldir to the directory where you unpacked mtevalscripts.tar.gz if you want common MT metrics in your report.
colibrita-setgen - Tool for generating training & test sets from parallel corpus data, GIZA++ Word Alignments and a Moses Phrasetable
colibrita-datastats - Reports some statistics on a dataset (train or test, XML)
colibrita-manualsetbuild - Small interactive console-based script for creating datasets manually

All tools take -h for help on usage options.

Set generation

Building a model starts with generating a training set from a parallel corpus. Ensure you have two plain-text files, one in the source language, one in the target language, with one sentence per line where the line numbers across the two files are indicative of sentences that are translations of eachother. In this documentation we will use two files from our ACL 2014 experiments, obtainable from http://lst.science.ru.nl/~proycon/colibrita-acl2014-data.zip :

* europarl200k-train.nl.txt
* europarl200k-train.en.txt

Given this input data, you can use Colibrita's setgen tool:

 $ colibrita-setgen --train --mosesdir=/path/to/mosesdecoder -S nl -T en \
 -s europarl200k-train.nl.txt -t europarl200k-train.en.txt --bindir=/usr/local/bin \
 -o europarl200k

This tool will invoke Moses (which will in turn invoke GIZA++) and the Colibri-Core patternmodeller. It builds word alignments, a phrase-translation table and pattern models, and eventually produces an XML file. This process may take a very long time and demands conseridable memory. The output prefix -o will be used in many of the output files. The parameters --joinedprobabilitythreshold and --divergencefrombestthreshold can be used to prevent weaker alignments and alternatives from making it into the set, and correspond to the parameters λ1 and λ2 in our ACL 2014 paper.

A test set can be generated in the same fashion:

 $ colibrita-setgen --test --mosesdir=/path/to/mosesdecoder -S nl -T en \
 -s europarl200k-test.nl.txt -t europarl200k-test.en.txt --bindir=/usr/local/bin \
 -o europarl200k

Training

The next step is feature extraction and classifier training:

$ colibrita --train -f europarl200k.train.xml -l 1 -r 1 \
-o exp-l1r1 --Tclones 4 --trainfortest europarl200k-test.xml

The output will consist of a whole bunch of classifiers (ibase files) in the directory specified with -o.

Some notes about this example:

-f specifies the training set, generated by colibrita-setgen in the previous step.
-l 1 sets a left context size of one
-r 1 sets a right context size of one
-o specified a new output prefix, used in generated files and a directory will be generated with this name containing all classifiers and intermediate files
--Tclones 4 runs Timbl on four cores
--trainfortest generates only those classifiers that will be used in testing, saving time and resources. But this implies the model will have to be retrained if other test data is offered, and can ever be used in a live setting.

Testing

Testing follows a very similar syntax:

$ colibrita --test -f europarl200k.test.xml -l 1 -r 1 \
-o exp-l1r1 -T train-europarl200k/model/phrase-table.gz

This will generate a file exp-l1r1-output.xml that contains the system output

Some notes:

-T passes the original phrase table which will be used as a fallback option
-o the same output prefix used in the training step, is used as input as well and assumes a directory by this name exists

Evaluation

System output can subsequently be evaluated against the test set using colibrita-evaluate:

$ colibrita-evaluate --mtevaldir /path/to/mtevalscripts \
--ref europarl200k.test.xml --out exp-l1r1-output.xml

A summary of all Scores will be written in exp-l1r1-output.summary.score .

Owner

Name: Maarten van Gompel
Login: proycon
Kind: user
Location: Eindhoven, the Netherlands
Company: KNAW Humanities Cluster & CLST, Radboud University

Website: https://proycon.anaproy.nl
Repositories: 213
Profile: https://github.com/proycon

Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 InfoSec - https://git.sr.ht/~proycon

CodeMeta (codemeta.json)

{
  "@context": [
    "https://doi.org/10.5063/schema/codemeta-2.0",
    "http://schema.org",
    {
      "entryPoints": {
        "@reverse": "schema:actionApplication"
      },
      "interfaceType": {
        "@id": "codemeta:interfaceType"
      }
    }
  ],
  "@type": "SoftwareSourceCode",
  "identifier": "colibrita",
  "name": "colibrita",
  "version": "0.3.1",
  "description": "Colibrita is a proof-of-concept translation assistance system that can translate L1 fragments in an L2 context. The system is designed prior to a new task (presented at SemEval 2014) concerning the translation of L1 fragments, i.e words or phrases, in an L2 context. This type of translation can be applied in writing assistance systems for language learners in which users write in their target language, but are allowed to occasionally back off to their native L1 when they are uncertain of the proper word or expression in L2. These L1 fragments are subsequently translated, along with the L2 context, into L2 fragments. Colibrita was developed to test whether L2 context information aids in translation of L1 fragments.",
  "license": "GPLv3",
  "url": "https://github.com/proycon/colibrita",
  "producer": {
    "@id": "https://www.ru.nl/cls",
    "@type": "Organization",
    "name": "Centre for Language Studies",
    "url": "https://www.ru.nl/cls",
    "parentOrganization": {
      "@id": "https://www.ru.nl",
      "name": "Radboud University",
      "@type": "Organization",
      "url": "https://www.ru.nl",
      "location": {
        "@type": "Place",
        "name": "Nijmegen"
      }
    }
  },
  "author": [
    {
      "@id": "https://orcid.org/0000-0002-1046-0006",
      "@type": "Person",
      "givenName": "Maarten",
      "familyName": "van Gompel",
      "email": "proycon@anaproy.nl",
      "affiliation": {
        "@id": "https://www.ru.nl/cls"
      }
    }
  ],
  "sourceOrganization": {
    "@id": "https://www.ru.nl/cls"
  },
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "identifier": "python",
    "name": "python"
  },
  "operatingSystem": "POSIX",
  "codeRepository": "https://github.com/proycon/colibrita",
  "softwareRequirements": [
    {
      "@type": "SoftwareApplication",
      "identifier": "colibricore",
      "name": "Colibri Core"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "moses",
      "name": "Moses"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "GIZA++",
      "name": "GIZA++"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "python3-timbl",
      "name": "python-timbl"
    }
  ],
  "referencePublication": [
    {
      "@id": "http://hdl.handle.net/2066/129758",
      "@type": "ScholarlyArticle",
      "name": "Translation Assistance by Translation of L1 Fragments in an L2 Context",
      "author": [
        "Maarten van Gompel",
        "Antal van den Bosch"
      ],
      "pageStart": 871,
      "pageEnd": 880,
      "isPartOf": {
        "@type": "PublicationIssue",
        "datePublished": "2014-06-23",
        "name": "Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)"
      },
      "url": "http://www.aclweb.org/anthology/P14-1082"
    }
  ],
  "interfaceType": "CLI",
  "readme": "https://github.com/proycon/colibrita/blob/master/README.md",
  "issueTracker": "https://github.com/proycon/colibrita/issues",
  "releaseNotes": "https://github.com/proycon/colibrita/releases",
  "developmentStatus": "inactive",
  "keywords": [
    "nlp",
    "natural language processing",
    "machine translation",
    "collocations",
    "translation",
    "code switching",
    "computer-aided language learning"
  ],
  "dateCreated": "2013-07-09"
}

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 711
Total Committers: 1
Avg Commits per committer: 711.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Maarten van Gompel	p**n@a**l	711

Committer Domains (Top 20 + Academic)

anaproy.nl: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

colibrita

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Colibrita: Translation Assistance System

Installation

Usage

Set generation

Training

Testing

Evaluation

Owner

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies