portagetextprocessing

Text processing tools that came out of the Portage SMT project — Outils de traitement de texte issus du projet Portage de TAS

https://github.com/nrc-cnrc/portagetextprocessing

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary

Keywords

machine-translation mt natural-language-processing neural-machine-translation nlp nmt preprocessing smt statistical-machine-translation text-processing
Last synced: 4 months ago · JSON representation ·

Repository

Text processing tools that came out of the Portage SMT project — Outils de traitement de texte issus du projet Portage de TAS

Basic Info
  • Host: GitHub
  • Owner: nrc-cnrc
  • License: mit
  • Language: Perl
  • Default Branch: main
  • Homepage:
  • Size: 414 KB
Statistics
  • Stars: 1
  • Watchers: 10
  • Forks: 0
  • Open Issues: 1
  • Releases: 1
Topics
machine-translation mt natural-language-processing neural-machine-translation nlp nmt preprocessing smt statistical-machine-translation text-processing
Created almost 5 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Français

Portage Text Processing

This repository contains a number of text pre- and post-processing utilities written in the context of the Portage Statistical Machine Translation project. Since they are frequently useful outside that context, we have separated them into this repository that is designed to be trivial to install.

Installation

Clone this repo to the location of your choice and add this line to your .profile or .bashrc:

source /path/to/PortageTextProcessing/SETUP.bash

Dependencies

PortageTextProcessing requires: - Perl >= 5.14, as perl on your PATH, with the packages listed in cpanfile; - any version of Python 3, as python3 on your PATH, with the packages listed in requirements.txt; - /bin/bash, /bin/sh, /usr/bin/env; - xmllint (comes with libxml2) and xml_grep (comes with Perl's XML::Twig).

First, check if you already have these dependencies, since they are very common: go to tests/check-installation/ and run ./run-test.sh. This test suite will flag any missing dependencies.

Install missing dependencies with the package manager of your choice, ideally the OS's own distro manager, like apt, yum, or brew.

CentOS 7 packages:

yum install perl-XML-Twig perl-XML-XPath perl-XML-Writer libxml2 python3

Ubuntu 20.04 packages:

apt-get install libxml-twig-perl libxml-xpath-perl libxml-writer-perl
apt-get install libxml2-utils xml-twig-tools python3

For the Python 3 dependencies, with any OS:

pip3 install -r requirements.txt

Testing

For more extensive testing, go to tests/ and run ./run-all-tests.sh. Go into any directory showing errors and examine _log.run-test to see what went wrong, or run ./run-test.sh interactively.

Some test suites are parallelized to run faster. If you have difficulty figuring out which command caused the error, you can also run make -B interactively in any test suite instead of ./run-test.sh, to run all its test cases sequentially and stop at the first error.

If you have installed PortageClusterUtils, you can also run all the test suites in parallel with ./run-all-tests.sh -j 12.

Documentation

Each script accepts the -h option to output its documentation to your terminal.

List of scripts

| Script | Brief Description | | ------------------------------- | ---------------------------------------------------------- | | clean-utf8-text.pl | Clean up spaces, control chars, hyphen, etc. in utf8 text. | | clean_utf8.py | Yet another utf8 clean up script, now in Python 3. | | crlf2lf.sh | Convert CRLF (DOS-style) line endings to LF (UNIX-style). | | diff-round.pl | Like diff, but ignore rounding errors. | | expand-auto.pl | Like expand, with automatically calculated tab stops. | | filter-long-lines.pl | Filter out long lines. | | filter-parallel.py | Filter parallel files by scores. | | fix-slashes.pl | Separate slash-joined words. | | lc-utf8.pl | Map utf8 text to lowercase, regardless of your locale. | | lfl2tmx.pl | Create a TMX file from plain text aligned files. | | li-sort.sh | Locale-independent sort. | | lines.py | Extract the given lines from a file. | | map-chinese-punct.pl | Map Chinese wide punctuation marks to similar narrow ones. | | normalize-iu-spelling.pl | Apply Inuktut syllabic character normalization rules. | | normalize-unicode.pl | Normalize unicode input into canonical representations. | | parallel-uniq.pl | Like uniq, but take into consideration parallel files. | | ridbom.sh | Remove the byte-order marker (BOM) from UTF8 input. | | second-to-hms.pl | Convert from seconds to HH:MM:SS or vice-versa. | | select-line | Get a given line from a text file. | | select-lines.py | Extract the given lines from a file. | | select-random-chunks.py | Sample random chunks from a file or by indices. | | sort-by-length.pl | Sort a text file by line length. | | stableuniq.pl | Remove duplicates without sorting. | | strip-parallel-blank-lines.py | Strip parallel blank lines from two line-aligned files. | | strip-parallel-duplicates.py | Strip aligned lines that are the same in both files. | | tmx2lfl.pl | Convert a TMX file to plain text aligned files. | | udetokenize.pl | Detokenize utf8 text, reversing utokenize.pl. | | utokenize.pl | Tokenize utf8 text, e.g., for machine translation. | | which-test.sh | Which-like program with reliable exit status. |

Contributing

If you want to contribute scripts to this repo, please: - Make sure they require no compilation or installation (beyond sourcing SETUP.bash). - Add unit tests for your scripts under tests/. - Keep them relevant, which means pretty much anything related to text processing goes.

Citation

bib @misc{Portage_Text_Processing, author = {Larkin, Samuel and Joanis, Eric and Stewart, Darlene and Simard, Michel and Foster, George and Ueffing, Nicola and Tikuisis, Aaron}, license = {MIT}, title = {{Portage Text Processing}}, url = {https://github.com/nrc-cnrc/PortageTextProcessing}, year = {2022}, }

Copyright

Traitement multilingue de textes / Multilingual Text Processing \ Centre de recherche en technologies numériques / Digital Technologies Research Centre \ Conseil national de recherches Canada / National Research Council Canada \ Copyright 2022, Sa Majesté le Roi du Chef du Canada / His Majesty the King in Right of Canada \ Published under the MIT License (see LICENSE)

Owner

  • Name: National Research Council of Canada — Conseil national de recherches du Canada
  • Login: nrc-cnrc
  • Kind: organization
  • Email: info@nrc-cnrc.gc.ca
  • Location: Canada

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Portage Text Processing
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Samuel
    family-names: Larkin
    email: Samuel.Larkin@nrc-cnrc.gc.ca
    affiliation: National Research Council Canada
  - given-names: Eric
    family-names: Joanis
    email: Eric.Joanis@nrc-cnrc.gc.ca
    affiliation: National Research Council Canada
  - given-names: Darlene
    family-names: Stewart
    email: Darlene.Stewart@nrc-cnrc.gc.ca
    affiliation: National Research Council Canada
  - given-names: Michel
    family-names: Simard
    email: Michel.Simard@nrc-cnrc.gc.ca
    affiliation: National Research Council Canada
  - given-names: George
    family-names: Foster
  - given-names: Nicola
    family-names: Ueffing
  - given-names: Aaron
    family-names: Tikuisis
repository-code: 'https://github.com/nrc-cnrc/PortageTextProcessing'
abstract: >-
  Text processing tools that came out of the Portage
  SMT project — Outils de traitement de texte issus
  du projet Portage de TAS
keywords:
  - MT
  - Machine Translation
  - NLP
  - NMT
  - Natural Language Procressing
  - Neural Machine Translation
  - Preprocessing
  - SMT
  - Statistical Machine Translation
  - Text Processing
license: MIT

GitHub Events

Total
  • Pull request event: 1
  • Create event: 1
Last Year
  • Pull request event: 1
  • Create event: 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 0
  • Total pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.38
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • joanise (6)
  • SamuelLarkin (4)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • click >=8
  • regex *
.github/workflows/test-suite.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite