Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary
Repository
Clean parallel corpus with moses.
Basic Info
- Host: GitHub
- Owner: Este1le
- License: mit
- Language: Shell
- Default Branch: main
- Size: 4.88 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
This repo contains a script that cleans parallel corpus (e.g. WMT data) with moses.
Usage
1. Installation
First, clone the Moses repository.
git clone https://github.com/moses-smt/mosesdecoder.git
Then, clone this repository.
git clone https://github.com/Este1le/clean_with_moses.git
2. Clean corpus
Modify the MOSES_PATH in clean_with_moses.sh with the path to Moses scripts.
This script assumes you have two aligned line-oriented corpus files with same prefix and language suffix.
For example: europarl-v10.de and europarl-v10.en.
Then, you can run the script:
clean_with_moses.sh [prefix] [src_lang] [tgt_lang]
For example,
clean_with_moses.sh europarl-v10 de en
This will result in two files: europarl-v10.de.clean and europarl-v10.en.clean.
What it does?
It cleans the data by four steps.
1. Tokenization
It tokenizes your data to separate words and punctuations.
2. Truecasing
It converts the text into a consistent case format based on the probability of the appearance of a word in a particular case.
3. Remove sentences
It removes sentences that are either too short or too long.
4. Convert special characters
Finally, it converts special characters (e.g. & or <) to their original form (& or <).
Author
xuanzhang@jhu.edu
Owner
- Name: Xuan Zhang
- Login: Este1le
- Kind: user
- Repositories: 3
- Profile: https://github.com/Este1le
Johns Hopkins University
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "Clean with Moses"
abstract: "This is a wrapper of Moses that cleans the parallel corpus for the use of machine translation."
authors:
- family-names: "Zhang"
given-names: "Xuan"
orcid: "https://orcid.org/0000-0001-5460-1176"
version: "1.0.0"
date-released: "2023-10-30"
repository-code: "https://github.com/Este1le/clean_with_moses/tree/main"
license: "MIT"