astred

An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.

https://github.com/bramvanroy/astred

Keywords

alignment linguistics nlp parallel-corpus parsing spacy stanza translation

Last synced: 10 months ago · JSON representation ·

Repository

An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.

Basic Info

Host: GitHub
Owner: BramVanroy
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 257 KB

Statistics

Stars: 24
Watchers: 3
Forks: 0
Open Issues: 1
Releases: 0

Topics

alignment linguistics nlp parallel-corpus parsing spacy stanza translation

Created over 6 years ago · Last pushed over 4 years ago

Metadata Files

Readme License Citation

README.rst

Easily compare two word-aligned sentences with ASTrED
=====================================================

Example notebooks
-----------------

A couple example notebooks exist, each with a different grade of automation for the initialisation of the aligned object. 
Once an aligned object has been created, the functionality is identical.

- `High automation`_: *automate all the things*. Tokenisation, parsing, and word alignment is done automatically
  [`Try on Colab `__]
- `Normal automation`_: the typical scenario where you have tokenised and aligned text that is not parsed yet
  [`Try on Colab `__]
- `No automation`_: full-manual mode, where you provide all the required information, including dependency labels
  and heads [`Try on Colab `__]
- `Monolingual`_: in this example we rely on spaCy to compare two English sentences and calculate semantic similarity
  between aligned words [`Try on Colab `__]

.. _High automation: examples/full-auto.ipynb
.. _Normal automation: examples/automatic-parsing.ipynb
.. _No automation: examples/full-manual.ipynb
.. _Monolingual: examples/monolingual.ipynb

Installation
------------

Requires Python 3.7 or higher. To keep the overhead low, a default parser is NOT installed. Currently both `spaCy`_ and
`stanza`_ are supported and you can choose which one to use. Stanza is recommended for bilingual research (because it
is ensured that all of its models use Universal Dependencies), but spaCy can be used as well. The latter is especially
used for monolingual comparisons, or if you are not interested in the linguistic comparisons and only require word
reordering metrics.

A pre-release is available on PyPi. You can install it with pip as follows.

.. code-block:: bash

    # Install with stanza (recommended)
    pip install astred[stanza]
    # ... or install with spacy
    pip install astred[spacy]
    # ... or install with both and decide later
    pip install astred[parsers]

If you want to use spaCy, you have to make sure that you `install`_ the required models manually, which cannot be
automated.

.. _spaCy: https://spacy.io/
.. _stanza: https://github.com/stanfordnlp/stanza
.. _install: https://spacy.io/usage/models

Automatic Word Alignment
------------------------

Automatic word alignment is supported by using a modified version of `Awesome Align`_ under the hood. This is a neural
word aligner that uses transfer learning with multilingual models to do word alignment. It does require
some manual installation work. Specifically, you need to install the :code:`astred_compat` branch from `this fork`_.
If you are using pip, you can run the following command:

.. code-block:: bash

    pip install git+https://github.com/BramVanroy/awesome-align.git@astred_compat

Awesome Align requires PyTorch, like :code:`stanza` above.

If it is installed, you can initialize :code:`AlignedSentences` without providing word alignments. Those will be added
automatically behind the scenes. See `this example notebook`_ [`Try on Colab `__] for more.

.. code-block:: bash

	sent_en = Sentence.from_text("I like eating cookies", "en")
	sent_nl = Sentence.from_text("Ik eet graag koekjes", "nl")

	# Word alignments do not need to be added on init:
	aligned = AlignedSentences(sent_en, sent_nl)

Keep in mind however that automatic alignment will never have the same quality as manual alignments. Use with caution!
I highly suggest reading `the paper`_ of Awesome Align to see whether it is a good pick for you.

.. _Awesome Align: https://github.com/neulab/awesome-align
.. _this fork: https://github.com/BramVanroy/awesome-align/tree/astred_compat
.. _this example notebook: examples/full-auto.ipynb
.. _the paper: https://arxiv.org/abs/2101.08231

License
-------
Licensed under Apache License Version 2.0. See the LICENSE file attached to this repository.

Citation
--------
Please cite our `papers`_ if you use this library.

Vanroy, B., De Clercq, O., Tezcan, A., Daems, J., & Macken, L. (2021). Metrics of syntactic equivalence to assess 
translation difficulty. In M. Carl (Ed.), *Explorations in empirical translation process research* (Vol. 3, pp. 259–294).
Cham, Switzerland: Springer International Publishing. https://doi.org/10.1007/978-3-030-69777-8_10

.. code-block::

	@incollection{vanroy2021metrics,
	    title = {Metrics of syntactic equivalence to assess translation difficulty},
	    booktitle = {Explorations in empirical translation process research},
	    author = {Vanroy, Bram and De Clercq, Orph{\'e}e and Tezcan, Arda and Daems, Joke and Macken, Lieve},
	    editor = {Carl, Michael},
	    year = {2021},
	    series = {Machine {{Translation}}: {{Technologies}} and {{Applications}}},
	    volume = {3},
	    pages = {259--294},
	    publisher = {{Springer International Publishing}},
	    address = {{Cham, Switzerland}},
	    isbn = {978-3-030-69776-1},
	    url = {https://link.springer.com/chapter/10.1007/978-3-030-69777-8_10},
	    doi = {10.1007/978-3-030-69777-8_10}
	}

Vanroy, B., Schaeffer, M., & Macken, L. (2021). Comparing the Effect of Product-Based Metrics on the Translation Process. *Frontiers in Psychology*, 12. https://doi.org/10.3389/fpsyg.2021.681945

.. code-block::

	@article{vanroy2021comparing,
	    publisher = {Frontiers},
	    author = {Vanroy, Bram and Schaeffer, Moritz and Macken, Lieve},
	    title = {Comparing the effect of product-based metrics on the translation process},
	    year = {2021},
	    journal = {Frontiers in Psychology},
	    volume = {12}, 
	    issn = {1664-1078}, 
	    url = {https://www.frontiersin.org/article/10.3389/fpsyg.2021.681945},
	    doi = {10.3389/fpsyg.2021.681945}, 
	}


.. _papers: CITATION

Owner

Name: Bram Vanroy
Login: BramVanroy
Kind: user
Location: Belgium
Company: @CCL-KULeuven @instituutnederlandsetaal

Website: https://bramvanroy.github.io/
Repositories: 29
Profile: https://github.com/BramVanroy

👋 My name is Bram and I work on natural language processing and machine translation (evaluation) but I also spend a lot of time in this open-source world 🌍

Citation (CITATION)

@incollection{vanroy2021metrics,
    title = {Metrics of syntactic equivalence to assess translation difficulty},
    booktitle = {Explorations in empirical translation process research},
    author = {Vanroy, Bram and De Clercq, Orph{\'e}e and Tezcan, Arda and Daems, Joke and Macken, Lieve},
    editor = {Carl, Michael and Way, Andy},
    year = {2021},
    series = {Machine {{Translation}}: {{Technologies}} and {{Applications}}},
    volume = {3},
    pages = {259--294},
    publisher = {{Springer International Publishing}},
    address = {{Cham, Switzerland}},
    isbn = {978-3-030-69776-1},
    url = {https://link.springer.com/chapter/10.1007/978-3-030-69777-8_10},
    doi = {10.1007/978-3-030-69777-8_10}
}

@article{vanroy2021comparing,
    publisher = {Frontiers},
    author = {Vanroy, Bram and Schaeffer, Moritz and Macken, Lieve},
    title = {Comparing the effect of product-based metrics on the translation process},
    year = {2021},
    journal = {Frontiers in Psychology},
    volume = {12},
    issn = {1664-1078}, 
    url = {https://www.frontiersin.org/article/10.3389/fpsyg.2021.681945},
    doi = {10.3389/fpsyg.2021.681945}
}

GitHub Events

Total

Watch event: 5

Last Year

Watch event: 5

Committers

Last synced: over 3 years ago

All Time

Total Commits: 172
Total Committers: 1
Avg Commits per committer: 172.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Bram Vanroy	B**y@U**e	172

Committer Domains (Top 20 + Academic)

ugent.be: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 3
Total pull requests: 1
Average time to close issues: about 6 hours
Average time to close pull requests: 1 minute
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

astred

Science Score: 41.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

Citation (CITATION)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies