colloquery

Web application for searching for phrases/collocations/synonyms in phrase translation tables

https://github.com/proycon/colloquery

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.5%) to scientific vocabulary

Keywords

computational-linguistics machine-translation mt natural-language-processing nlp
Last synced: 6 months ago · JSON representation

Repository

Web application for searching for phrases/collocations/synonyms in phrase translation tables

Basic Info
  • Host: GitHub
  • Owner: proycon
  • License: agpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 321 KB
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Topics
computational-linguistics machine-translation mt natural-language-processing nlp
Created about 9 years ago · Last pushed over 6 years ago
Metadata Files
Readme License Codemeta

README.rst

.. image:: http://applejack.science.ru.nl/lamabadge.php/colloquery
   :target: http://applejack.science.ru.nl/languagemachines/

.. image:: https://www.repostatus.org/badges/latest/inactive.svg
   :alt: Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.
   :target: https://www.repostatus.org/#inactive

Colloquery
============

Colloquery is a web application to search for phrase translations, or
collocations, as well as synonyms,in bilingual phrase translation tables.

It is developed for `Van Dale `_ by the `Centre for Language
and Speech Technology `_, Radboud University Nijmegen, and is licensed under the
Affero GNU Public License.

.. image:: https://raw.github.com/proycon/colloquery/master/screenshot.jpg
    :alt: Colloquery screenshot
    :align: center

Installation
--------------

First, clone this repository and edit ``settings.py``.

Colloquery is not trivial to set-up and train, as it relies on numerous
external dependencies:

* Python 3
* `MongoDB `_
* `mongoengine `_
* `Django `_

On Debian/Ubuntu systems, these can be installed using ``sudo apt-get install
python3 mongodb python3-mongoengine python3-django``.

For the data generation step, the following additional dependencies are required:

* `colibri-core `_ (shipped as part of
  `LaMachine `_)
* `colibri-mt `_

To create phrase translation-tables in the first place, use the Moses training
pipeline, which in turn invokes GIZA++:

* `Moses `_
* `GIZA++ `_

Data Generation
--------------------

* Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded
  files, one for the source language (``corpus.fr`` in our example) and one for the target
  language (``corpus.en``).  Make sure they are tokenised, lower-cased and
  contain one sentence per line (you can use `ucto
  `_ for this), sentences on the same line in the other file
  are considering translations.
* Train a phrase translation table using Moses::

  $ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir .  --parallel --corpus corpus --f fr --e en  --first-step 1 --last-step 8

* Invoke the data generation pipeline of Colloquery, adjust the thresholds as
  needed (see ``./manage.py generatedata --help``). This assumes a running
  and properly configured MongoDB::

  ./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4

The Moses and data generation pipeline may take considerable time and system
resources (most notably memory). Set sane thresholds to prevent the data from
becoming unmanageably large.

Owner

  • Name: Maarten van Gompel
  • Login: proycon
  • Kind: user
  • Location: Eindhoven, the Netherlands
  • Company: KNAW Humanities Cluster & CLST, Radboud University

Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 InfoSec - https://git.sr.ht/~proycon

CodeMeta (codemeta.json)

{
  "@context": [
    "https://doi.org/10.5063/schema/codemeta-2.0",
    "http://schema.org",
    {
      "entryPoints": {
        "@reverse": "schema:actionApplication"
      },
      "interfaceType": {
        "@id": "codemeta:interfaceType"
      }
    }
  ],
  "@type": "SoftwareSourceCode",
  "identifier": "colloquery",
  "name": "Colloquery",
  "version": "0.1.1",
  "description": "Web application for searching for phrases/collocations/synonyms in phrase translation tables",
  "license": "AGPL-3.0-or-later",
  "url": "https://github.com/proycon/colloquery",
  "producer": {
    "@id": "https://www.ru.nl/clst",
    "@type": "Organization",
    "name": "Centre for Language and Speech Technology",
    "url": "https://www.ru.nl/clst",
    "parentOrganization": {
      "@id": "https://www.ru.nl/cls",
      "@type": "Organization",
      "name": "Centre for Language Studies",
      "url": "https://www.ru.nl/cls",
      "parentOrganization": {
        "@id": "https://www.ru.nl",
        "name": "Radboud University",
        "@type": "Organization",
        "url": "https://www.ru.nl",
        "location": {
          "@type": "Place",
          "name": "Nijmegen"
        }
      }
    }
  },
  "author": [
    {
      "@id": "https://orcid.org/0000-0002-1046-0006",
      "@type": "Person",
      "givenName": "Maarten",
      "familyName": "van Gompel",
      "email": "proycon@anaproy.nl",
      "affiliation": {
        "@id": "https://www.ru.nl/clst"
      }
    }
  ],
  "sourceOrganization": {
    "@id": "https://www.ru.nl/clst"
  },
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "identifier": "python",
    "name": "python"
  },
  "operatingSystem": "POSIX",
  "codeRepository": "https://github.com/proycon/colloquery",
  "softwareRequirements": [
    {
      "@type": "SoftwareApplication",
      "identifier": "django",
      "name": "django"
    }
  ],
  "funder": [
    {
      "@type": "Organization",
      "name": "Van Dale",
      "url": "https://www.vandale.nl"
    }
  ],
  "readme": "https://github.com/proycon/colloquery/blob/master/README.rst",
  "issueTracker": "https://github.com/proycon/colloquery/issues",
  "releaseNotes": "https://github.com/proycon/colloquery/releases",
  "developmentStatus": "inactive",
  "keywords": [
    "nlp",
    "natural language processing",
    "machine translation",
    "collocations",
    "translation"
  ],
  "dateCreated": "2017-01-29",
  "entryPoints": [
    {
      "@type": "EntryPoint",
      "name": "colloquery",
      "urlTemplate": "https://colloquery.science.ru.nl",
      "description": "Web-application",
      "interfaceType": "WUI"
    }
  ]
}

GitHub Events

Total
Last Year

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 113
  • Total Committers: 1
  • Avg Commits per committer: 113.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Maarten van Gompel p****n@a****l 113
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • proycon (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels