https://github.com/ai4bharat/setu-translate

https://github.com/ai4bharat/setu-translate

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AI4Bharat
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 4.55 MB
Statistics
  • Stars: 2
  • Watchers: 5
  • Forks: 2
  • Open Issues: 1
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Setu-Translate: A Large Scale Translation Pipeline

Setu-Translate uses IndicTrans2 (IT2) for performing large-scale translation across English and 22 Indic Languages.

Currently, we provide inference support for PyTorch and Flax versions of IT2. TPUs can be used for large-scale translation by leveraging Flax port of IT2.

Setu Translate Stages Overview

Table of Contents

  1. Overview
  2. Quickstart
  3. Usage

Overview

The Setu-Translate Pipeline contains 4 main stages:

  • Templating : Each dataset is input to the pipeline in parquet format. During this stage, each entry in the dataset is converted into a Document object format. During conversion additional steps such as text cleaning, chunking, remove duplicates, delimitter splitting, etc. are performed.

  • Global Sentence Dataset : During this stage, the templated datafiles are processed and formatted into a sentence level dataset based on doc_ids.

  • Binarize : During this stage, the sentences are processed using the IndicProcessor and IndicTransTokenizer based on the source and target language. Further we perform padding and save the output either in numpy (np) or pytorch (pt) format.

  • Translate The translation stage utilizes IndicTrans2 translation model to translate the English sentences to the corresponding target Indic languages. We provide support to run translation either on local or TPU cluster for larger datasets.

  • Decode The decode stages process the model output data and replaces the translated ids into their corresponding Indic Text and provides us with the translated text.

  • Replace During this stage, the translated words are appropriately replaced with the original text positions to maintain document structure. This depends on the output of the templating stage.

Quickstart

  1. Clone repository bash git clone https://github.com/AI4Bharat/setu-translate.git
  2. Prepare environment bash conda create -n translate-env python=3.10 conda activate translate-env conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia conda install -c conda-forge pyspark conda install pip pip install datasets transformers
  3. Install IndicTransTokenizer ```bash cd IndicTransTokenizer

pip install --editable ./ ```

  1. Install JAX and Setup for TPU

Based on your setup (local or TPU) download the appropriate JAX libraries accordingly from JAX Installation.

Also download the Flax Weights for IndicTrans2 and store it at setu-translate/stages/tlt_pipelines/flax/flax_weights/200m.

Usage

For a full run through using a sample subset of Wikipedia Dataset refer to the notebook. You can also run the stages individually using the below commands.

Templating Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python perform_templating.py \ --glob_path "/home/$USER/setu-translate/examples/sample_data/wiki_en_data.parquet" \ --cache_dir_for_original_data "/home/$USER/setu-translate/examples/cache" \ --base_save_path "/home/$USER/setu-translate/examples/output/wiki_en/doc_csvs" \ --save_path "/home/$USER/setu-translate/examples/output/wiki_en/templated" \ --text_col body \ --url_col url \ --timestamp_col timestamp \ --source_type wiki_en \ --translation_type sentence \ --use_cache False \ --split "train[:5%]"

Global Sentence Dataset Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python create_global_ds.py \ --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --global_sent_ds_path "/home/$USER/setu-translate/examples/output/wiki_en/sentences"

Binarize Dataset Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python binarize.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/sentences/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --binarized_dir "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences" \ --batch_size 2048 \ --total_procs 1 \ --padding max_length \ --src_lang eng_Latn \ --tgt_lang hin_Deva \ --return_format np

Translate Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python tlt_pipelines/translate_joblib.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --base_save_dir "/home/$USER/setu-translate/examples/output/wiki_en/model_out" \ --joblib_temp_folder "/home/$USER/setu-translate/tmp" \ --batch_size 512 \ --total_procs 1 \ --devices "0"

Decode Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python decode.py \ --root_dir "/home/$USER/setu-translate" \ --data_files "/home/$USER/setu-translate/examples/output/wiki_en/model_out/*/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --decode_dir "/home/$USER/setu-translate/examples/output/wiki_en/decode" \ --batch_size 64 \ --total_procs 1 \ --src_lang eng_Latn \ --tgt_lang hin_Deva \

Replace Stage

bash HF_DATASETS_CACHE=/home/$USER/tmp python replace.py \ --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \ --cache_dir "/home/$USER/setu-translate/examples/cache" \ --batch_size 64 \ --num_procs 1 \ --decode_base_path "/home/$USER/setu-translate/examples/output/wiki_en decode/*.arrow" \ --translated_save_path "/home/$USER/setu-translate/examples/output/wiki_en/translated"

Owner

  • Name: AI4Bhārat
  • Login: AI4Bharat
  • Kind: organization
  • Email: opensource@ai4bharat.org
  • Location: India

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 3
  • Total pull requests: 2
  • Average time to close issues: 9 days
  • Average time to close pull requests: less than a minute
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kathir-ks (2)
  • kdcyberdude (1)
Pull Request Authors
  • safikhanSoofiyani (1)
  • Shanks0465 (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

IndicTransTokenizer/requirements.txt pypi
  • nltk *
  • sacremoses *
  • sentencepiece *
  • setuptools ==68.2.2
  • torch *
  • transformers *
IndicTransTokenizer/setup.py pypi
  • for *
  • str *
environment.yaml pypi
  • aiohttp ==3.9.1
  • aiosignal ==1.3.1
  • alabaster ==0.7.16
  • annotated-types ==0.6.0
  • argon2-cffi ==23.1.0
  • argon2-cffi-bindings ==21.2.0
  • async-timeout ==4.0.3
  • attrs ==23.2.0
  • babel ==2.14.0
  • blis ==0.7.11
  • catalogue ==2.0.10
  • certifi ==2023.11.17
  • cffi ==1.16.0
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • cloudpathlib ==0.16.0
  • confection ==0.1.4
  • ctranslate2 ==3.24.0
  • cymem ==2.0.8
  • datasets ==2.16.1
  • dill ==0.3.7
  • docutils ==0.20.1
  • filelock ==3.13.1
  • frozenlist ==1.4.1
  • fsspec ==2023.10.0
  • huggingface-hub ==0.20.2
  • idna ==3.6
  • imagesize ==1.4.1
  • indic-nlp-library ==0.92.1
  • jinja2 ==3.1.3
  • joblib ==1.3.2
  • langcodes ==3.3.0
  • markupsafe ==2.1.3
  • minio ==7.2.3
  • morfessor ==2.0.6
  • mpmath ==1.3.0
  • multidict ==6.0.4
  • multiprocess ==0.70.15
  • murmurhash ==1.0.10
  • networkx ==3.0
  • nltk ==3.8.1
  • numpy ==1.26.3
  • packaging ==23.2
  • pandas ==2.1.4
  • pillow ==9.3.0
  • preshed ==3.0.9
  • pyarrow ==14.0.2
  • pyarrow-hotfix ==0.6
  • pycparser ==2.21
  • pycryptodome ==3.20.0
  • pydantic ==2.5.3
  • pydantic-core ==2.14.6
  • pygments ==2.17.2
  • python-dateutil ==2.8.2
  • pytz ==2023.3.post1
  • pyyaml ==6.0.1
  • regex ==2023.12.25
  • requests ==2.31.0
  • sacremoses ==0.1.1
  • safetensors ==0.4.1
  • sentencepiece ==0.1.99
  • six ==1.16.0
  • smart-open ==6.4.0
  • snowballstemmer ==2.2.0
  • spacy ==3.7.2
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • sphinx ==7.2.6
  • sphinx-argparse ==0.4.0
  • sphinx-rtd-theme ==2.0.0
  • sphinxcontrib-applehelp ==1.0.8
  • sphinxcontrib-devhelp ==1.0.6
  • sphinxcontrib-htmlhelp ==2.0.5
  • sphinxcontrib-jquery ==4.1
  • sphinxcontrib-jsmath ==1.0.1
  • sphinxcontrib-qthelp ==1.0.7
  • sphinxcontrib-serializinghtml ==1.1.10
  • srsly ==2.4.8
  • sympy ==1.12
  • thinc ==8.2.2
  • tokenizers ==0.15.0
  • torch ==2.1.2
  • torchaudio ==2.1.2
  • torchvision ==0.16.2
  • tqdm ==4.66.1
  • transformers ==4.36.2
  • triton ==2.1.0
  • typer ==0.9.0
  • typing-extensions ==4.9.0
  • tzdata ==2023.4
  • urllib3 ==2.1.0
  • wasabi ==1.1.2
  • weasel ==0.3.4
  • xxhash ==3.4.1
  • yarl ==1.9.4