https://github.com/ai4bharat/setu-translate

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AI4Bharat
License: mit
Language: Python
Default Branch: master
Size: 4.55 MB

Statistics

Stars: 2
Watchers: 5
Forks: 2
Open Issues: 1
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Setu-Translate: A Large Scale Translation Pipeline

Setu-Translate uses IndicTrans2 (IT2) for performing large-scale translation across English and 22 Indic Languages.

Currently, we provide inference support for PyTorch and Flax versions of IT2. TPUs can be used for large-scale translation by leveraging Flax port of IT2.

Setu Translate Stages Overview

Overview
Quickstart
Usage

Overview

The Setu-Translate Pipeline contains 4 main stages:

Templating : Each dataset is input to the pipeline in parquet format. During this stage, each entry in the dataset is converted into a Document object format. During conversion additional steps such as text cleaning, chunking, remove duplicates, delimitter splitting, etc. are performed.
Global Sentence Dataset : During this stage, the templated datafiles are processed and formatted into a sentence level dataset based on doc_ids.
Binarize : During this stage, the sentences are processed using the IndicProcessor and IndicTransTokenizer based on the source and target language. Further we perform padding and save the output either in numpy (np) or pytorch (pt) format.
Translate The translation stage utilizes IndicTrans2 translation model to translate the English sentences to the corresponding target Indic languages. We provide support to run translation either on local or TPU cluster for larger datasets.
Decode The decode stages process the model output data and replaces the translated ids into their corresponding Indic Text and provides us with the translated text.
Replace During this stage, the translated words are appropriately replaced with the original text positions to maintain document structure. This depends on the output of the templating stage.

Quickstart

Clone repository bash git clone https://github.com/AI4Bharat/setu-translate.git
Prepare environment bash conda create -n translate-env python=3.10 conda activate translate-env conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia conda install -c conda-forge pyspark conda install pip pip install datasets transformers
Install IndicTransTokenizer ```bash cd IndicTransTokenizer

pip install --editable ./ ```

Install JAX and Setup for TPU

Based on your setup (local or TPU) download the appropriate JAX libraries accordingly from JAX Installation.

Also download the Flax Weights for IndicTrans2 and store it at setu-translate/stages/tlt_pipelines/flax/flax_weights/200m.

Usage

For a full run through using a sample subset of Wikipedia Dataset refer to the notebook. You can also run the stages individually using the below commands.

Templating Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python perform_templating.py \ --glob_path "/home/$USER/setu-translate/examples/sample_data/wiki_en_data.parquet" \ --cache_dir_for_original_data "/home/$USER/setu-translate/examples/cache" \ --base_save_path "/home/$USER/setu-translate/examples/output/wiki_en/doc_csvs" \ --save_path "/home/$USER/setu-translate/examples/output/wiki_en/templated" \ --text_col body \ --url_col url \ --timestamp_col timestamp \ --source_type wiki_en \ --translation_type sentence \ --use_cache False \ --split "train[:5%]"

Global Sentence Dataset Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python create_global_ds.py \ --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --global_sent_ds_path "/home/$USER/setu-translate/examples/output/wiki_en/sentences"

Binarize Dataset Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python binarize.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/sentences/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --binarized_dir "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences" \ --batch_size 2048 \ --total_procs 1 \ --padding max_length \ --src_lang eng_Latn \ --tgt_lang hin_Deva \ --return_format np

Translate Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python tlt_pipelines/translate_joblib.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --base_save_dir "/home/$USER/setu-translate/examples/output/wiki_en/model_out" \ --joblib_temp_folder "/home/$USER/setu-translate/tmp" \ --batch_size 512 \ --total_procs 1 \ --devices "0"

Decode Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python decode.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/model_out/*/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --decode_dir "/home/$USER/setu-translate/examples/output/wiki_en/decode" \ --batch_size 64 \ --total_procs 1 \ --src_lang eng_Latn \ --tgt_lang hin_Deva \

Replace Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python replace.py \ --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --batch_size 64 \ --num_procs 1 \ --decode_base_path "/home/$USER/setu-translate/examples/output/wiki_en decode/*.arrow" \ --translated_save_path "/home/$USER/setu-translate/examples/output/wiki_en/translated"

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 2
Average time to close issues: 9 days
Average time to close pull requests: less than a minute
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kathir-ks (2)
kdcyberdude (1)

Pull Request Authors

safikhanSoofiyani (1)
Shanks0465 (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

IndicTransTokenizer/requirements.txt pypi

nltk *
sacremoses *
sentencepiece *
setuptools ==68.2.2
torch *
transformers *

IndicTransTokenizer/setup.py pypi

for *
str *

environment.yaml pypi

aiohttp ==3.9.1
aiosignal ==1.3.1
alabaster ==0.7.16
annotated-types ==0.6.0
argon2-cffi ==23.1.0
argon2-cffi-bindings ==21.2.0
async-timeout ==4.0.3
attrs ==23.2.0
babel ==2.14.0
blis ==0.7.11
catalogue ==2.0.10
certifi ==2023.11.17
cffi ==1.16.0
charset-normalizer ==3.3.2
click ==8.1.7
cloudpathlib ==0.16.0
confection ==0.1.4
ctranslate2 ==3.24.0
cymem ==2.0.8
datasets ==2.16.1
dill ==0.3.7
docutils ==0.20.1
filelock ==3.13.1
frozenlist ==1.4.1
fsspec ==2023.10.0
huggingface-hub ==0.20.2
idna ==3.6
imagesize ==1.4.1
indic-nlp-library ==0.92.1
jinja2 ==3.1.3
joblib ==1.3.2
langcodes ==3.3.0
markupsafe ==2.1.3
minio ==7.2.3
morfessor ==2.0.6
mpmath ==1.3.0
multidict ==6.0.4
multiprocess ==0.70.15
murmurhash ==1.0.10
networkx ==3.0
nltk ==3.8.1
numpy ==1.26.3
packaging ==23.2
pandas ==2.1.4
pillow ==9.3.0
preshed ==3.0.9
pyarrow ==14.0.2
pyarrow-hotfix ==0.6
pycparser ==2.21
pycryptodome ==3.20.0
pydantic ==2.5.3
pydantic-core ==2.14.6
pygments ==2.17.2
python-dateutil ==2.8.2
pytz ==2023.3.post1
pyyaml ==6.0.1
regex ==2023.12.25
requests ==2.31.0
sacremoses ==0.1.1
safetensors ==0.4.1
sentencepiece ==0.1.99
six ==1.16.0
smart-open ==6.4.0
snowballstemmer ==2.2.0
spacy ==3.7.2
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
sphinx ==7.2.6
sphinx-argparse ==0.4.0
sphinx-rtd-theme ==2.0.0
sphinxcontrib-applehelp ==1.0.8
sphinxcontrib-devhelp ==1.0.6
sphinxcontrib-htmlhelp ==2.0.5
sphinxcontrib-jquery ==4.1
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.7
sphinxcontrib-serializinghtml ==1.1.10
srsly ==2.4.8
sympy ==1.12
thinc ==8.2.2
tokenizers ==0.15.0
torch ==2.1.2
torchaudio ==2.1.2
torchvision ==0.16.2
tqdm ==4.66.1
transformers ==4.36.2
triton ==2.1.0
typer ==0.9.0
typing-extensions ==4.9.0
tzdata ==2023.4
urllib3 ==2.1.0
wasabi ==1.1.2
weasel ==0.3.4
xxhash ==3.4.1
yarl ==1.9.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science