cross-align

EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

https://github.com/lisasiyu/cross-align

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

Basic Info

Host: GitHub
Owner: lisasiyu
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 16.8 MB

Statistics

Stars: 18
Watchers: 1
Forks: 3
Open Issues: 1
Releases: 0

Created over 3 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

Cross-Align

Code for EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"

Cross-Align is a high-quality word alignment tool which fully considers the cross-lingual context by modeling deep interactions between the input sentence pairs.

The following table shows how it compares to popular alignment models, the best scores are in bold:

| | De-En | En-Fr | Ro-En | Zh-En | Ja-En | |:---------------------------------------------------------|------:|:-----:|------:|:-----:|:-----:| | FastAlign | 26.2 | 10.5 | 31.4 | 23.7 | 51.1 | | GIZA++ | 18.9 | 5.5 | 26.6 | 19.4 | 48.0 | | SimAlign | 18.8 | 7.6 | 27.2 | 21.6 | 46.6 | | Awesome-Align | 15.6 | 4.4 | 23.0 | 12.9 | 38.4 | | Ours | 13.6 | 3.4 | 20.9 | 10.1 | 35.4 |

We released the above five langauge pairs of Cross-Align models, you can download HERE and inference on test data directly.

Requirements

pip install --user --editable ./

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). For example: Das stimmt nicht ! ||| But this is not what happens .

Two-stage Training

Training Cross-Align on parallel data to get good alignments.

First training stage

In the first stage, the model is trained with TLM to learn the cross-lingual representations. sh ./srcipt/train_stage1.sh

Second training stage

After the first training stage, the model is then finetuned with a self-supervised alignment objective to bridge the gap between the training and inference. sh ./srcipt/train_stage2.sh

Inference

Extracting word alignments from Cross-Align. commandline sh ./srcipt/inference.sh Cross-Align produces outputs in the widely-used i-j “Pharaoh format,” where a pair i-j indicates that the i-th word (zero-indexed) of the source language is aligned to the j-th word of the target sentence. You can see some examples in the data/xx.out.

Calculating AER

The gold alignment file should have the same format as Cross-Align outputs. For sample parallel sentences and their gold alignments, see data/test.xx-xx and data/xx.talp. commandline sh ./srcipt/cal_aer.sh

Publication

If you use the code, please cite @inproceedings{lai-etal-2022-cross, title = "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment", author = "Lai, Siyu and Yang, Zhen and Meng, Fandong and Chen, Yufeng and Xu, Jinan and Zhou, Jie", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.244", pages = "3715--3725", }

GitHub Events

Total

Watch event: 3

Last Year

Watch event: 3

Dependencies

docker/transformers-cpu/Dockerfile docker

ubuntu 18.04 build

docker/transformers-gpu/Dockerfile docker

nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build

docker/transformers-pytorch-cpu/Dockerfile docker

ubuntu 18.04 build

docker/transformers-pytorch-gpu/Dockerfile docker

nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build

docker/transformers-pytorch-tpu/Dockerfile docker

google/cloud-sdk slim build

docker/transformers-tensorflow-cpu/Dockerfile docker

ubuntu 18.04 build

docker/transformers-tensorflow-gpu/Dockerfile docker

nvidia/cuda 10.1-cudnn7-runtime-ubuntu18.04 build

setup.py pypi

deps *

src/transformers.egg-info/requires.txt pypi

GitPython <3.1.19
Pillow *
black *
codecarbon ==1.2.0
cookiecutter ==1.7.2
dataclasses *
datasets *
deepspeed >=0.5.9
fairscale >0.3
faiss-cpu *
fastapi *
filelock *
flake8 >=3.8.3
flax >=0.3.5
fugashi >=1.0
huggingface-hub <1.0,>=0.1.0
importlib_metadata *
ipadic <2.0,>=1.0.0
isort >=5.5.4
jax >=0.2.8
jaxlib >=0.1.65
librosa *
nltk *
numpy >=1.17
onnxconverter-common *
onnxruntime >=1.4.0
onnxruntime-tools >=1.4.2
optax >=0.0.8
optuna *
packaging >=20.0
parameterized *
phonemizer *
protobuf *
psutil *
pyctcdecode >=0.3.0
pydantic *
pytest *
pytest-timeout *
pytest-xdist *
pyyaml >=5.1
ray *
regex *
requests *
rouge-score *
sacrebleu <2.0.0,>=1.4.12
sacremoses *
sagemaker >=2.31.0
scikit-learn *
sentencepiece *
sigopt *
starlette *
tensorflow >=2.3
tensorflow-cpu >=2.3
tf2onnx *
timeout-decorator *
timm *
tokenizers *
torch >=1.0
torchaudio *
tqdm >=4.27
unidic >=1.0.2
unidic_lite >=1.0.7
uvicorn *

tests/sagemaker/scripts/pytorch/requirements.txt pypi

datasets ==1.8.0 test

pyproject.toml pypi

tests/sagemaker/scripts/tensorflow/requirements.txt pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science