cross-align
EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"
Basic Info
Statistics
- Stars: 18
- Watchers: 1
- Forks: 3
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Cross-Align
Code for EMNLP2022 "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment"
Cross-Align is a high-quality word alignment tool which fully considers
the cross-lingual context by modeling deep interactions between the input sentence pairs.
The following table shows how it compares to popular alignment models, the best scores are in bold:
| | De-En | En-Fr | Ro-En | Zh-En | Ja-En | |:---------------------------------------------------------|------:|:-----:|------:|:-----:|:-----:| | FastAlign | 26.2 | 10.5 | 31.4 | 23.7 | 51.1 | | GIZA++ | 18.9 | 5.5 | 26.6 | 19.4 | 48.0 | | SimAlign | 18.8 | 7.6 | 27.2 | 21.6 | 46.6 | | Awesome-Align | 15.6 | 4.4 | 23.0 | 12.9 | 38.4 | | Ours | 13.6 | 3.4 | 20.9 | 10.1 | 35.4 |
We released the above five langauge pairs of Cross-Align models, you can download HERE and inference on test data directly.
Requirements
pip install --user --editable ./
Input format
Inputs should be tokenized and each line is a source language sentence and
its target language translation, separated by (|||). For example:
Das stimmt nicht ! ||| But this is not what happens .
Two-stage Training
Training Cross-Align on parallel data to get good alignments.
First training stage
In the first stage, the model is trained with TLM to learn the cross-lingual representations.
sh ./srcipt/train_stage1.sh
Second training stage
After the first training stage, the model is then finetuned with a self-supervised alignment
objective to bridge the gap between the training and inference.
sh ./srcipt/train_stage2.sh
Inference
Extracting word alignments from Cross-Align.
commandline
sh ./srcipt/inference.sh
Cross-Align produces outputs in the widely-used i-j “Pharaoh format,” where a pair i-j indicates that the i-th word (zero-indexed) of
the source language is aligned to the j-th word of the target sentence. You can see some examples in the data/xx.out.
Calculating AER
The gold alignment file should have the same format as Cross-Align outputs. For sample parallel sentences and their gold alignments, see data/test.xx-xx and data/xx.talp.
commandline
sh ./srcipt/cal_aer.sh
Publication
If you use the code, please cite
@inproceedings{lai-etal-2022-cross,
title = "Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment",
author = "Lai, Siyu and
Yang, Zhen and
Meng, Fandong and
Chen, Yufeng and
Xu, Jinan and
Zhou, Jie",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.244",
pages = "3715--3725",
}
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Dependencies
- ubuntu 18.04 build
- nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build
- ubuntu 18.04 build
- nvidia/cuda 10.2-cudnn7-devel-ubuntu18.04 build
- google/cloud-sdk slim build
- ubuntu 18.04 build
- nvidia/cuda 10.1-cudnn7-runtime-ubuntu18.04 build
- deps *
- GitPython <3.1.19
- Pillow *
- black *
- codecarbon ==1.2.0
- cookiecutter ==1.7.2
- dataclasses *
- datasets *
- deepspeed >=0.5.9
- fairscale >0.3
- faiss-cpu *
- fastapi *
- filelock *
- flake8 >=3.8.3
- flax >=0.3.5
- fugashi >=1.0
- huggingface-hub <1.0,>=0.1.0
- importlib_metadata *
- ipadic <2.0,>=1.0.0
- isort >=5.5.4
- jax >=0.2.8
- jaxlib >=0.1.65
- librosa *
- nltk *
- numpy >=1.17
- onnxconverter-common *
- onnxruntime >=1.4.0
- onnxruntime-tools >=1.4.2
- optax >=0.0.8
- optuna *
- packaging >=20.0
- parameterized *
- phonemizer *
- protobuf *
- psutil *
- pyctcdecode >=0.3.0
- pydantic *
- pytest *
- pytest-timeout *
- pytest-xdist *
- pyyaml >=5.1
- ray *
- regex *
- requests *
- rouge-score *
- sacrebleu <2.0.0,>=1.4.12
- sacremoses *
- sagemaker >=2.31.0
- scikit-learn *
- sentencepiece *
- sigopt *
- starlette *
- tensorflow >=2.3
- tensorflow-cpu >=2.3
- tf2onnx *
- timeout-decorator *
- timm *
- tokenizers *
- torch >=1.0
- torchaudio *
- tqdm >=4.27
- unidic >=1.0.2
- unidic_lite >=1.0.7
- uvicorn *
- datasets ==1.8.0 test