cross-align

EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

https://github.com/lisasiyu/cross-align

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

Basic Info
  • Host: GitHub
  • Owner: lisasiyu
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 16.8 MB
Statistics
  • Stars: 18
  • Watchers: 1
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Created over 3 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Cross-Align

Code for EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

Cross-Align is a high-quality word alignment tool which fully considers the cross-lingual context by modeling deep interactions between the input sentence pairs.

The following table shows how it compares to popular alignment models, the best scores are in bold:

| | De-En | En-Fr | Ro-En | Zh-En | Ja-En | |:---------------------------------------------------------|------:|:-----:|------:|:-----:|:-----:| | FastAlign | 26.2 | 10.5 | 31.4 | 23.7 | 51.1 | | GIZA++ | 18.9 | 5.5 | 26.6 | 19.4 | 48.0 | | SimAlign | 18.8 | 7.6 | 27.2 | 21.6 | 46.6 | | Awesome-Align | 15.6 | 4.4 | 23.0 | 12.9 | 38.4 | | Ours | 13.6 | 3.4 | 20.9 | 10.1 | 35.4 |

We released the above five langauge pairs of Cross-Align models, you can download HERE and inference on test data directly.

Requirements

pip install --user --editable ./

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). For example: Das stimmt nicht ! ||| But this is not what happens .

Two-stage Training

Training Cross-Align on parallel data to get good alignments.

First training stage

In the first stage, the model is trained with TLM to learn the cross-lingual representations. sh ./srcipt/train_stage1.sh

Second training stage

After the first training stage, the model is then finetuned with a self-supervised alignment objective to bridge the gap between the training and inference. sh ./srcipt/train_stage2.sh

Inference

Extracting word alignments from Cross-Align. commandline sh ./srcipt/inference.sh Cross-Align produces outputs in the widely-used i-j “Pharaoh format,” where a pair i-j indicates that the i-th word (zero-indexed) of the source language is aligned to the j-th word of the target sentence. You can see some examples in the data/xx.out.

Calculating AER

The gold alignment file should have the same format as Cross-Align outputs. For sample parallel sentences and their gold alignments, see data/test.xx-xx and data/xx.talp. commandline sh ./srcipt/cal_aer.sh

Publication

If you use the code, please cite @inproceedings{lai-etal-2022-cross, title = "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment", author = "Lai, Siyu and Yang, Zhen and Meng, Fandong and Chen, Yufeng and Xu, Jinan and Zhou, Jie", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.244", pages = "3715--3725", }

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Dependencies

docker/transformers-cpu/Dockerfile docker
  • ubuntu 18.04 build
docker/transformers-gpu/Dockerfile docker
  • nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build
docker/transformers-pytorch-cpu/Dockerfile docker
  • ubuntu 18.04 build
docker/transformers-pytorch-gpu/Dockerfile docker
  • nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build
docker/transformers-pytorch-tpu/Dockerfile docker
  • google/cloud-sdk slim build
docker/transformers-tensorflow-cpu/Dockerfile docker
  • ubuntu 18.04 build
docker/transformers-tensorflow-gpu/Dockerfile docker
  • nvidia/cuda 10.1-cudnn7-runtime-ubuntu18.04 build
setup.py pypi
  • deps *
src/transformers.egg-info/requires.txt pypi
  • GitPython <3.1.19
  • Pillow *
  • black *
  • codecarbon ==1.2.0
  • cookiecutter ==1.7.2
  • dataclasses *
  • datasets *
  • deepspeed >=0.5.9
  • fairscale >0.3
  • faiss-cpu *
  • fastapi *
  • filelock *
  • flake8 >=3.8.3
  • flax >=0.3.5
  • fugashi >=1.0
  • huggingface-hub <1.0,>=0.1.0
  • importlib_metadata *
  • ipadic <2.0,>=1.0.0
  • isort >=5.5.4
  • jax >=0.2.8
  • jaxlib >=0.1.65
  • librosa *
  • nltk *
  • numpy >=1.17
  • onnxconverter-common *
  • onnxruntime >=1.4.0
  • onnxruntime-tools >=1.4.2
  • optax >=0.0.8
  • optuna *
  • packaging >=20.0
  • parameterized *
  • phonemizer *
  • protobuf *
  • psutil *
  • pyctcdecode >=0.3.0
  • pydantic *
  • pytest *
  • pytest-timeout *
  • pytest-xdist *
  • pyyaml >=5.1
  • ray *
  • regex *
  • requests *
  • rouge-score *
  • sacrebleu <2.0.0,>=1.4.12
  • sacremoses *
  • sagemaker >=2.31.0
  • scikit-learn *
  • sentencepiece *
  • sigopt *
  • starlette *
  • tensorflow >=2.3
  • tensorflow-cpu >=2.3
  • tf2onnx *
  • timeout-decorator *
  • timm *
  • tokenizers *
  • torch >=1.0
  • torchaudio *
  • tqdm >=4.27
  • unidic >=1.0.2
  • unidic_lite >=1.0.7
  • uvicorn *
tests/sagemaker/scripts/pytorch/requirements.txt pypi
  • datasets ==1.8.0 test
pyproject.toml pypi
tests/sagemaker/scripts/tensorflow/requirements.txt pypi