Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: CCS-ZCU
  • License: cc-by-sa-4.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 761 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 12
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

WEEMS: Word Embeddings for Early Modern Science


Authors

  • Vojtch Kae, Jan Tvrz, Jana vadlenkov, Petr Pavlas

License

CC-BY-SA 4.0, see attached License.md


In this repository, we make available for reuse a series of word vector models trained on two corpora of Early Modern Latin texts: * Noscemus Digital Sourcebook (a corpus of digitized Early modern scientific texts in Latin, https://doi.org/10.5281/zenodo.15040256) * EMLAP (a corpus of digitized Early Modern Latin Alchemical Prints, https://doi.org/10.5281/zenodo.14765294)

In addition to that, for comparison, we also implement two other word embedding models based on LASLA and OperaMaiora publicly available from here: https://embeddings.lila-erc.eu/#topnav

In total, we offer 4 temporal models based on NOSCEMUS, 8 discipline-specific models based on NOSCEMUS, 1 model trained on the EMLAP corpus, and two pretrained models inherited from other resources. * NOSCEMUS - 1501-1550 * NOSCEMUS - 1551-1600, * NOSCEMUS - 1601-1650, * NOSCEMUS - 1651-1700, * NOSCEMUS - Alchemy/Chemistry * NOSCEMUS - Astronomy/Astrology/Cosmography * NOSCEMUS - Biology * NOSCEMUS - Geography/Cartography * NOSCEMUS - Mathematics * NOSCEMUS - Medicine * NOSCEMUS - Meteorology/Earth sciences * NOSCEMUS - Physics * LASLA * Opera Maiora * EMLAP

We train the models on textual data, which we previously preprocessed and automatically morphologically annotated using scripts in the following GitHub repositories: https://github.com/CCS-ZCU/noscemusETF and https://github.com/CCS-ZCU/EMLAPETL. Thus, the training textual data have the form of automatically lemmatized and morphologically annotated Latin sentences.

From these sentences, we first filter only for words morphologically annotated as nouns (NOUN), verbs (VERB), adjectives (ADJ), and proper names (PROPN), as these words tend to be semantically most loaded words.

Further, we calculate raw frequencies of these words across the subcorpora. These frequencies we employ to further reduce the size of the vocabulary, i.e., the list of words for which we generate the vectors. First, we extract 2,000 most frequent (lemmatized) words for each subcorpus. This produces a list of 6643 [NOTICE: outdated values from a previous version] unique words. Second, we exclude all words appearing less than 5 times in any of the subcorpora. This reduces the vocabulary to 6,005 unique lemmata. Thus, the models can be aligned by an extensive shared vocabulary overlap.

For the models, we employ the FastText algorithm, with the exact same parametrization as in this paper:

Sprugnoli, R., Moretti, G., & Passarotti, M. (2020). Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas Aquinas. Italian Journal of Computational Linguistics, 6(1). https://doi.org/10.5281/ZENODO.4618000

This makes our vectors directly comparable with their vectors generated for Lasla and OperaMaiora.

The models are available in the form of one pickle file as a Python dictionary of Gensim library keyed vectors: /data/vectors_dict_comp_v0-3.pkl. Once you download or clone the repository, you can load them directly using the following Python code snippet: python with open("../data/vectors_dict_comp_v0-3.pkl", "rb") as file: vectors_dict = pickle.load(file)

This repository is part of the TOME project.

Getting started

```bash git clone [url-of-the-git-file] cd [name-of-the-repo]

(recommendation: create and activate a virtual environement)

pip install -r requirements.txt ```

We reccommend to use a dedicated virtual environment for the whole project:

bash python3 -m venv latin_venv #or specify your own source python to replicate (e.g. python3.12 etc.) latin_venv/bin/python -m pip install --upgrade pip latin_venv/bin/python -m pip install -r requirements.txt latin_venv/bin/python -m ipykernel install --user -name=noscemus_kernel # create the jupyter kernel to be used by the notebooks echo "/latin_venv/" >> .gitignore # add the virtual_venv directory to .gitignore, to prevents its synchronization via github

Anytime you need to install another package, run noscemus_venv/bin/python -m pip install <package-name> or have the environment activated: source noscemus_venv/bin/activate.

Finally, go to the scripts directory and run the Jupyter notebooks you wish;-).


Scripts

The scripts are in the scripts subfolder and their numbers and titles should be self-explanatory. Usually, they have the form of Jupyter notebooks.

Owner

  • Name: CCS-Lab (Computing Culture & Society)
  • Login: CCS-ZCU
  • Kind: organization
  • Email: kase@kfi.zcu.cz
  • Location: Czech Republic

GitHub Events

Total
  • Release event: 11
  • Push event: 25
  • Create event: 10
Last Year
  • Release event: 11
  • Push event: 25
  • Create event: 10