weems

https://github.com/ccs-zcu/weems

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: CCS-ZCU
License: cc-by-sa-4.0
Language: Jupyter Notebook
Default Branch: master
Size: 761 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 12

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

WEEMS: Word Embeddings for Early Modern Science

Authors

Vojtch Kae, Jan Tvrz, Jana vadlenkov, Petr Pavlas

License

CC-BY-SA 4.0, see attached License.md

In this repository, we make available for reuse a series of word vector models trained on two corpora of Early Modern Latin texts: * Noscemus Digital Sourcebook (a corpus of digitized Early modern scientific texts in Latin, https://doi.org/10.5281/zenodo.15040256) * EMLAP (a corpus of digitized Early Modern Latin Alchemical Prints, https://doi.org/10.5281/zenodo.14765294)

In addition to that, for comparison, we also implement two other word embedding models based on LASLA and OperaMaiora publicly available from here: https://embeddings.lila-erc.eu/#topnav

In total, we offer 4 temporal models based on NOSCEMUS, 8 discipline-specific models based on NOSCEMUS, 1 model trained on the EMLAP corpus, and two pretrained models inherited from other resources. * NOSCEMUS - 1501-1550 * NOSCEMUS - 1551-1600, * NOSCEMUS - 1601-1650, * NOSCEMUS - 1651-1700, * NOSCEMUS - Alchemy/Chemistry * NOSCEMUS - Astronomy/Astrology/Cosmography * NOSCEMUS - Biology * NOSCEMUS - Geography/Cartography * NOSCEMUS - Mathematics * NOSCEMUS - Medicine * NOSCEMUS - Meteorology/Earth sciences * NOSCEMUS - Physics * LASLA * Opera Maiora * EMLAP

We train the models on textual data, which we previously preprocessed and automatically morphologically annotated using scripts in the following GitHub repositories: https://github.com/CCS-ZCU/noscemusETF and https://github.com/CCS-ZCU/EMLAPETL. Thus, the training textual data have the form of automatically lemmatized and morphologically annotated Latin sentences.

From these sentences, we first filter only for words morphologically annotated as nouns (NOUN), verbs (VERB), adjectives (ADJ), and proper names (PROPN), as these words tend to be semantically most loaded words.

Further, we calculate raw frequencies of these words across the subcorpora. These frequencies we employ to further reduce the size of the vocabulary, i.e., the list of words for which we generate the vectors. First, we extract 2,000 most frequent (lemmatized) words for each subcorpus. This produces a list of 6643 [NOTICE: outdated values from a previous version] unique words. Second, we exclude all words appearing less than 5 times in any of the subcorpora. This reduces the vocabulary to 6,005 unique lemmata. Thus, the models can be aligned by an extensive shared vocabulary overlap.

For the models, we employ the FastText algorithm, with the exact same parametrization as in this paper:

Sprugnoli, R., Moretti, G., & Passarotti, M. (2020). Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas Aquinas. Italian Journal of Computational Linguistics, 6(1). https://doi.org/10.5281/ZENODO.4618000

This makes our vectors directly comparable with their vectors generated for Lasla and OperaMaiora.

The models are available in the form of one pickle file as a Python dictionary of Gensim library keyed vectors: /data/vectors_dict_comp_v0-3.pkl. Once you download or clone the repository, you can load them directly using the following Python code snippet: python with open("../data/vectors_dict_comp_v0-3.pkl", "rb") as file: vectors_dict = pickle.load(file)

This repository is part of the TOME project.

Getting started

```bash git clone [url-of-the-git-file] cd [name-of-the-repo]

(recommendation: create and activate a virtual environement)

pip install -r requirements.txt ```

We reccommend to use a dedicated virtual environment for the whole project:

bash python3 -m venv latin_venv #or specify your own source python to replicate (e.g. python3.12 etc.) latin_venv/bin/python -m pip install --upgrade pip latin_venv/bin/python -m pip install -r requirements.txt latin_venv/bin/python -m ipykernel install --user -name=noscemus_kernel # create the jupyter kernel to be used by the notebooks echo "/latin_venv/" >> .gitignore # add the virtual_venv directory to .gitignore, to prevents its synchronization via github

Anytime you need to install another package, run noscemus_venv/bin/python -m pip install <package-name> or have the environment activated: source noscemus_venv/bin/activate.

Finally, go to the scripts directory and run the Jupyter notebooks you wish;-).

Scripts

The scripts are in the scripts subfolder and their numbers and titles should be self-explanatory. Usually, they have the form of Jupyter notebooks.

Owner

Name: CCS-Lab (Computing Culture & Society)
Login: CCS-ZCU
Kind: organization
Email: kase@kfi.zcu.cz
Location: Czech Republic

Website: https://ccs.zcu.cz
Repositories: 1
Profile: https://github.com/CCS-ZCU

GitHub Events

Total

Release event: 11
Push event: 25
Create event: 10

Last Year

Release event: 11
Push event: 25
Create event: 10

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science