word2vecelastic

Collect sentences from ElasticSearch, preprocess and train diachronic Word2Vec models

https://github.com/centrefordigitalhumanities/word2vecelastic

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Keywords

elasticsearch embeddings word2vec

Last synced: 6 months ago · JSON representation ·

Repository

Collect sentences from ElasticSearch, preprocess and train diachronic Word2Vec models

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: bsd-3-clause
Language: Python
Default Branch: develop
Homepage:
Size: 119 KB

Statistics

Stars: 1
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 1

Topics

elasticsearch embeddings word2vec

Created almost 7 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

Word2VecElastic

This repository includes utility functions to build diachronic Word2Vec models in gensim, using an Elasticsearch index to collect the data, and SpaCy and NLTK to preprocess it.

It can also be used to train word models for small datasets using the positive pointwise mutual information (PPMI) metric to retain matrices of word similarity, following this paper.

The data is read in year batches from Elasticsearch and preprocessed. Every year's preprocessed data is saved to hard disk (as a pickled list of lists of words), so that for multiple passes (e.g., one to build the vocabulary, one to train the model), the data is available more readily.

For the whole time period, a full model will be generated, which will be used as pre-training data for the individual models. Alternatively, independent models can be trained by setting the -in flag (see #Usage)

Prerequisites

Elasticsearch

The data is fetched from Elasticsearch. By default, this will attempt to fetch from a local instance (i.e., localhost:9200) For local development, install Elasticsearch.

In order to fetch data from a remote Elasticsearch cluster and/or on a different port, set environment variables through an .env file. The .env-dist file can be copied as a starting point.

For instance, to fetch from http://url-of-your-cluster:9900, you would set: bash ES_HOST=http://url-of-your-cluster ES_PORT=9900 If not set, ES_HOST will fall back to localhost, and ES_PORT to 9200, respectively.

To connect to a remote cluster through SSL(recommended), you will also need to set the following variables: ES_API_ID, ES_API_KEY, CERTS_LOCATION.

Finally, if you would like to read from an index with a different name than the corpus, you can do this by setting INDEX. Once your .env file is set correctly, you can load the variables into your environment like so: bash source .env

Python

The code was tested in Python 3.11. Create a virtualenv (python -m venv your_env_name), activate it (source your_env_name/bin/activate) and then run pip install -r requirements.txt

SpaCy language models

With activated environment, download the SpaCy language models required for preprocessing as follows: python -m spacy download en_core_web_sm See (the SpaCy documentation)[https://spacy.io/usage/models].

Corpus configurations

To train word models for a corpus, update (the CORPUS_CONFIGURATIONS dictionary)[code/corpusconfig.py]. The required settings are: - index: the name of the Elasticsearch index - language: the language of the corpus - textfield: in which field of the index text data for training can be found

Optional settings are: - algorithm: set 'ppmi' or leave unset (will default to 'word2vec') - datefield: the field to filter for specific years. Raises a warning if not set to inform that date will be used as default. - independent: if False, the `generatemodelsscript will first train a large corpus for all data, and then proceed to retrain for time slices of the data. Defaults toTrue(i.e., each model is trained independently of data from other time slices). Note that limiting the size of the vocabulary withmaxvocabsizeandmincountmay not be as effective when training withindependent=False`. - maxvocabsize: can be used to prune the vocabulary during training, so that the memory size does not explode. Defaults to None. - mincount: the number of times a word must appear in the data in order to be included in the model. Defaults to None. - maxfinalvocab: can be used to choose the min_count automatically to limit the final vocabulary to this size. Defaults to None. - vectorsize: the number of dimensions of the resulting word vectors. Defaults to 100. - windowsize: the size of the window around a target word for the word2vec algorithm. Defaults to 5.

Usage

To train models, with activated environment, use the command python generate_models.py -i your-index-name -s 1960 -e 2000 -md /path/to/output/models Meaning of the flags: - c: name of the corpus - s: start year of training - e: end year of training - md: set the output directory where models will be written Optional flags: - n: number of years per model (default: 10) - sh: shift between models (default: 5) - sd: path to output preprocessed training data (default: 'source_data')

You can also run python generate_models.py -h to see this documentation.

Output

The training script generates three kinds of output: - preprocessing output, the result of tokenizing, stop word removal and (optional) lemmatization of the source data from Elasticsearch, saved as Python binary .pkl files, named after index and year, in source_directory (set through -sd flag). - word2vec output, the result of traning on the preprocessed data, saved as KeyedVectors, named after the index and time window, with the extension .wv in the model_directory (set through -md flag). - statistics about the number of tokens (all words in the model not discarded during stopword removal) and number of terms (all distinct tokens), named after the index and time window, and saved as a comma-separated table (.csv) in the model_directory.

Preprocessing output

To inspect the preprocessing output, install the dependecies of this repository (see (Prerequesites)[#Prerequesites]), then open a Python terminal. To get a list of sentences (each a list of words), use the following workflow: python from util import inspect_source_data sentences = inspect_source_data('/{filepath}/{index_name}-{year}.pkl')

To write the source data to a text file, use the following workflow: python from util import source_data_to_file source_data_to_file('/{filepath}/{index_name}-{year}.pkl')

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Word2VecElastic
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: >-
      Research Software Lab, Centre for Digital Humanities,
      Utrecht University
abstract: >-
  Collect sentences from ElasticSearch, preprocess and train
  diachronic Word2Vec models
license: BSD-3-Clause
version: v1.0.0

GitHub Events

Total

Delete event: 1
Issue comment event: 1
Member event: 1
Push event: 10
Pull request event: 1
Create event: 2

Last Year

Delete event: 1
Issue comment event: 1
Member event: 1
Push event: 10
Pull request event: 1
Create event: 2

Committers

Last synced: 8 months ago

All Time

Total Commits: 165
Total Committers: 2
Avg Commits per committer: 82.5
Development Distribution Score (DDS): 0.006

Past Year

Commits: 16
Committers: 1
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
BeritJanssen	b**n@g**m	164
lukavdplas	l**s@g**m	1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 3
Total pull requests: 4
Average time to close issues: 3 months
Average time to close pull requests: 2 months
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 1.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 7 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

lukavdplas (3)

Pull Request Authors

BeritJanssen (5)

Top Labels

Issue Labels

wontfix (1)

Pull Request Labels

Dependencies

requirements.in pypi

elasticsearch *
gensim *
pytest *
scikit-learn *
spacy *

requirements.txt pypi

attrs ==22.2.0
blis ==0.7.9
catalogue ==2.0.8
certifi ==2022.12.7
charset-normalizer ==2.1.1
click ==8.1.3
confection ==0.0.3
cymem ==2.0.7
elastic-transport ==8.4.0
elasticsearch ==8.6.1
exceptiongroup ==1.1.0
fst-pso ==1.8.1
fuzzytm ==2.0.5
gensim ==4.3.0
idna ==3.4
iniconfig ==1.1.1
jinja2 ==3.1.2
joblib ==1.2.0
langcodes ==3.3.0
markupsafe ==2.1.1
miniful ==0.0.6
murmurhash ==1.0.9
numpy ==1.24.1
packaging ==22.0
pandas ==1.5.2
pathy ==0.10.1
pluggy ==1.0.0
preshed ==3.0.8
pydantic ==1.10.4
pyfume ==0.2.25
pytest ==7.2.0
python-dateutil ==2.8.2
pytz ==2022.7
requests ==2.28.1
scikit-learn ==1.2.0
scipy ==1.10.0
simpful ==2.9.0
six ==1.16.0
smart-open ==6.3.0
spacy ==3.4.4
spacy-legacy ==3.0.11
spacy-loggers ==1.0.4
srsly ==2.4.5
thinc ==8.1.6
threadpoolctl ==3.1.0
tomli ==2.0.1
tqdm ==4.64.1
typer ==0.7.0
typing-extensions ==4.4.0
urllib3 ==1.26.13
wasabi ==0.10.1

word2vecelastic

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Word2VecElastic

Prerequisites

Elasticsearch

Python

SpaCy language models

Corpus configurations

Usage

Output

Preprocessing output

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies