grecy

Ancient Greek language models for spaCy

https://github.com/jmyerston/grecy

Last synced: 8 months ago · JSON representation ·

Repository

Ancient Greek language models for spaCy

Basic Info

Host: GitHub
Owner: jmyerston
License: mit
Default Branch: main
Homepage:
Size: 135 MB

Statistics

Stars: 32
Watchers: 3
Forks: 5
Open Issues: 3
Releases: 3

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

greCy

Ancient Greek models for spaCy

greCy is a set of spaCy ancient Greek models and its installer. The models were trained using the Perseus and Proiel UD corpora. Prior to installation, the models can be tested on my Ancient Greek Syntax Analyzer on the Hugging Face Hub, where you can also check the various performance metrics of each model.

In general, models trained with the Proiel corpus perform better in POS Tagging and Dependency Parsing, while Perseus models are better at sentence segmentation using punctuation, and Morphological Analysis. Lemmatization is similar across models because they share the same neural lemmatizer in two variants: the most accurate lemmatizer was trained with word vectors, and the other was not. The best models for lemmatization are the large models .

Installation

First install the python package as usual:

bash pip install -U grecy

Once the package is successfully installed, you can proceed to dowload and install any of the followings models:

grcperseussm
grcproielsm
grcperseuslg
grcproiellg
grcperseustrf
grcproieltrf

The models can be installed from the terminal with the commands below:

python -m grecy install MODEL where you replace MODEL by any of the model names listed above. The suffixes after the corpus name, _sm, _lg, and _trf, indicate the size of the model which directly depends on the word embedding used for training. The smallest models end in _sm (small) and are the less accurate ones: they are good for testing and building lightweight apps. The _lg and _trf are the large and transformers models which are more accurate. The _lg were trained using fasttext word vectors in the spaCy floret version, and the _trf models were trained using a special version of BERT, pertained by ourselves with the largest available Ancient Greek corpus, namely, the TLG. The vectors for large models were also trained with the TLG corpus.

Loading

As usual, you can load any of the four models with the following Python lines:

import spacy nlp = spacy.load("grc_proiel_XX") Remember to replace _XX with the size of the model you would like to use, this means, _sm for small, _lg for large, and _trf for transformer. The _trf model is the most accurate but also the slowest.

Use

spaCy is a powerful NLP library with many application. The most basic of its function is the morpho-syntantic annotation of texts for further processing. A common routine is to load a model, create a doc object, and process a text:

``` import spacy nlp = spacy.load("grcproielsm")

text = "καὶ πρὶν μὲν ἐν κακοῖσι κειμένην ὅμως ἐλπίς μʼ ἀεὶ προσῆγε σωθέντος τέκνου ἀλκήν τινʼ εὑρεῖν κἀπικούρησιν δόμον"

doc = nlp(text)

for token in doc: print(f'{token.text}, lemma: {token.lemma} pos:{token.pos}')

```

The apostrophe issue

Unfortunaly, there is no consensus among the different internet projects that offer ancient Greek texts about how to represent the Ancient Greek apostrophe. Modern Greek simply uses the regular apostrophe, but ancient texts available in Perseus and Perseus under Philologic use various unicode characters for the apostrophe. Instead of the apostrophe, we find the Greek koronis, modifier letter apostrophe, and right single quotation mark. Provisionally, I have opted to use modifier letter apostrophe in the corpus with which I trained the models. This means, that if you want the greCy models to properly handle the apostrophe you have to make sure that the Ancient Greek texts that you are processing use the modifier letter apostrophe ʼ (U+02BC ). Otherwise the models will fail to lemmatize and tag some words in your texts that ends with an 'apostrophe'.

Building

I offer here the project file, I use to train the models in case you want to customize your models for your specific needs. The six standard spaCy models (small, large, and transformer) are built and packaged using the following commands:

python -m spacy project assets
python -m spacy project run all

Performance

For a general comparison, I share here the metrics of the Proiel transformer grcproieltrf and grcperseystrf. These models use for fine-tuning a transformer that was specifically trained to be used with spaCy and, consequently, makes the model much smaller than the alternatives offered by Python nlp libraries such as Stanza and Trankit (for more information on the transformer model and how it was trained see aristoBERTo). The greCy's _trf models outperform Stanza and Trankit in most metrics and have the advantage that their size is only ~430 MB vs. the 1.2 GB of the Trankit model trained with XLM Roberta. See table below:

Proiel

| Library | Tokens | Sentences | UPOS | XPOS | UFeats |Lemmas |UAS |LAS | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | spaCy | 100 | 71.74 | 98.45 | 98.53 | 94.18 | 96.59 | 85.79 | 82.30 | | Trankit | 99.91 | 67.60 |97.86 | 97.93 |93.03 | 97.50 |85.63 |82.31 | | Stanza | 100 | 51.65 | 97.38 | 97.75 | 92.09 | 97.42 | 80.34 |76.33 |

Perseus

| Library | Tokens | Sentences | UPOS | XPOS | UFeats |Lemmas |UAS |LAS | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | spaCy | 100 | 99.38 | 96.75 | 96.82 | 95.16 | 97.33 | 81.92 | 77.26 | | Trankit | 99.71 | 98.70 |93.97 | 87.25 |91.66 | 88.52 |83.48 |78.56 | | Stanza | 99.8 | 98.85 | 92.54 | 85.22 | 91.06 | 88.26 | 78.75 |73.35 | | OdyCy | - | 84.09 | 97.32 | 94.18 | 94.09 | 93.89 | 81.40 |76.42 |

Caveat

Metrics, however, can be misleading. This becomes particularly obvious when you work with texts that are not part of the training and evaluation dataset. In addition, greCy's lemmatizers (in all sizes) exhibit lower benchmarks in comparison to the above mentioned nlp libraries, but they have a substantially larger vocabulary than the Stanza and Trankit models because they were trained with a complemental lemma corpus derived from Giussepe G.A. Celano lemmatized corpus. This means that the greCy's lemmatizers perform better than Trankit and Stanza when processing texts not included in the Perseus and Proiel datasets.

Future Developments

This project was initiated as part of the Diogenet Project, a research initiative that focuses on the automatic extraction of social relations from Ancient Greek texts. As part of this project, greCy will add first, in a non distant future, a NER pipeline for the identification of entities; later I hope also to offer pipeline for the extraction of social relation from Greek texts. This pipeline should contribute to the study of social networks in the ancient world.

Owner

Name: Jacobo Myerston
Login: jmyerston
Kind: user
Location: San Diego, CA
Company: University of California, San Diego

Twitter: jcbmyrstn
Repositories: 1
Profile: https://github.com/jmyerston

Citation (CITATION.cff)

cff-version: 1.2.0
preferred-citation:
type: software
message: "You can cite greCy as it is indicated below."
authors:
- family-names: "Myerston"
  given-names: "Jacobo"
- family-names: "López"
  given-names: "Jose"
title: "greCy: Ancient Greek spaCy models for Natural Language Processing in Python"
version: 1.0
date-released: 2023
url: "https://github.com/jmyerston/greCy"

GitHub Events

Total

Create event: 1
Issues event: 2
Release event: 1
Watch event: 10
Issue comment event: 4
Fork event: 2

Last Year

Create event: 1
Issues event: 2
Release event: 1
Watch event: 10
Issue comment event: 4
Fork event: 2

Packages

Total packages: 1
Total downloads:
- pypi 39 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: grecy

grecy installs six ancient Greek spaCy models that were trained using the Universal Dependency Proiel and Perseus treebanks.

Homepage: https://github.com/jmyerston/greCy
Documentation: https://grecy.readthedocs.io/
License: MIT License Copyright (c) 2023 Jacobo Myerston Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 1.0
published almost 3 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 39 Last month

Rankings

Dependent packages count: 7.5%

Stargazers count: 16.1%

Forks count: 22.9%

Average: 29.0%

Dependent repos count: 69.6%

Maintainers (1)

jcbmyrstn

Last synced: 8 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science