clean_with_moses

Clean parallel corpus with moses.

https://github.com/este1le/clean_with_moses

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Clean parallel corpus with moses.

Basic Info

Host: GitHub
Owner: Este1le
License: mit
Language: Shell
Default Branch: main
Size: 4.88 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

README.md

This repo contains a script that cleans parallel corpus (e.g. WMT data) with moses.

Usage

1. Installation

First, clone the Moses repository. git clone https://github.com/moses-smt/mosesdecoder.git Then, clone this repository. git clone https://github.com/Este1le/clean_with_moses.git

2. Clean corpus

Modify the MOSES_PATH in clean_with_moses.sh with the path to Moses scripts.

This script assumes you have two aligned line-oriented corpus files with same prefix and language suffix. For example: europarl-v10.de and europarl-v10.en.

Then, you can run the script: clean_with_moses.sh [prefix] [src_lang] [tgt_lang] For example, clean_with_moses.sh europarl-v10 de en This will result in two files: europarl-v10.de.clean and europarl-v10.en.clean.

What it does?

It cleans the data by four steps.

1. Tokenization

It tokenizes your data to separate words and punctuations.

2. Truecasing

It converts the text into a consistent case format based on the probability of the appearance of a word in a particular case.

3. Remove sentences

It removes sentences that are either too short or too long.

4. Convert special characters

Finally, it converts special characters (e.g. & or <) to their original form (& or <).

Author

Xuan Zhang

xuanzhang@jhu.edu

Owner

Name: Xuan Zhang
Login: Este1le
Kind: user

Repositories: 3
Profile: https://github.com/Este1le

Johns Hopkins University

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "Clean with Moses"
abstract: "This is a wrapper of Moses that cleans the parallel corpus for the use of machine translation."
authors:
  - family-names: "Zhang"
    given-names: "Xuan"
    orcid: "https://orcid.org/0000-0001-5460-1176"
version: "1.0.0"
date-released: "2023-10-30"
repository-code: "https://github.com/Este1le/clean_with_moses/tree/main"
license: "MIT"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science