clean_with_moses

Clean parallel corpus with moses.

https://github.com/este1le/clean_with_moses

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Clean parallel corpus with moses.

Basic Info
  • Host: GitHub
  • Owner: Este1le
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Size: 4.88 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

This repo contains a script that cleans parallel corpus (e.g. WMT data) with moses.

Usage

1. Installation

First, clone the Moses repository. git clone https://github.com/moses-smt/mosesdecoder.git Then, clone this repository. git clone https://github.com/Este1le/clean_with_moses.git

2. Clean corpus

Modify the MOSES_PATH in clean_with_moses.sh with the path to Moses scripts.

This script assumes you have two aligned line-oriented corpus files with same prefix and language suffix. For example: europarl-v10.de and europarl-v10.en.

Then, you can run the script: clean_with_moses.sh [prefix] [src_lang] [tgt_lang] For example, clean_with_moses.sh europarl-v10 de en This will result in two files: europarl-v10.de.clean and europarl-v10.en.clean.

What it does?

It cleans the data by four steps.

1. Tokenization

It tokenizes your data to separate words and punctuations.

2. Truecasing

It converts the text into a consistent case format based on the probability of the appearance of a word in a particular case.

3. Remove sentences

It removes sentences that are either too short or too long.

4. Convert special characters

Finally, it converts special characters (e.g. &amp; or &lt;) to their original form (& or <).

Author

Xuan Zhang

xuanzhang@jhu.edu

Owner

  • Name: Xuan Zhang
  • Login: Este1le
  • Kind: user

Johns Hopkins University

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "Clean with Moses"
abstract: "This is a wrapper of Moses that cleans the parallel corpus for the use of machine translation."
authors:
  - family-names: "Zhang"
    given-names: "Xuan"
    orcid: "https://orcid.org/0000-0001-5460-1176"
version: "1.0.0"
date-released: "2023-10-30"
repository-code: "https://github.com/Este1le/clean_with_moses/tree/main"
license: "MIT"

GitHub Events

Total
Last Year