hist-w2v

Tools for downloading, filtering, and training word embeddings on Google Ngrams

https://github.com/eric-d-knowles/hist_w2v

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Tools for downloading, filtering, and training word embeddings on Google Ngrams

Basic Info

Host: GitHub
Owner: eric-d-knowles
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 3.56 MB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 4

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

hist_w2v

Tools for downloading, processing, and training word2vec models on Google Ngrams

Python package to assist researchers in using Google Ngrams to examine semantic change over years, decades, and centuries. hist_w2v automates downloading and pre-processing raw ngrams and training word2vec models on a corpus.

Installation

There are two ways to install hist_w2v:

Clone the GitHub repository (https://github.com/eric-d-knowles/hist_w2v) to your Python environmen t.
Install from PyPI.org by running pip install hist_w2v in your Python environment.

After installing hist_w2v, the best way to learn how to use it by working through the provided Jupyter Notebook workflows. Together, these notebooks provide a fully documented, end-to-end illustration of the package's functionality.

Package Contents

The library consists of the following modules and notebooks:

src/ngram_tools 1. downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags). 2. convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format. 3. lowercase_ngrams.py: makes the ngrams all lowercase. 4. lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms). 5. filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams. 6. sort_ngrams.py: combines multiple ngrams files into a single sorted file. 7. consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps. 8. index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams. 9. create_yearly_files.py: splits the master corpus into yearly sub-corpora. 10. helpers/file_handler.py: helper script to simplify reading and writing files in the other modules. 11. helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file. 12. helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.

src/training_tools 1. train_ngrams.py: train word2vec models on pre-processed multigram corpora. 2. evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests). 3. plotting.py: plot various types of model results. 4. w2v_model.py: a Python class (W2VModel) to aid in the evaluation, normalization, and alignment of yearly word2vec models

notebooks 1. workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams. 2. workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams. 3. workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results from word2vec models.

Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.

System Requirements

Efficiently downloading, processing, and training models on ngrams takes lots of processors and memory. Unless you have a very powerful PC, you should only try to run hist_w2v on a high-performance computing (HPC) cluster or similar platform. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.

Citing hist_w2v

If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.

License

This project is released under the MIT License.

Owner

Login: eric-d-knowles
Kind: user

Repositories: 1
Profile: https://github.com/eric-d-knowles

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Knowles"
  given-names: "Eric D."
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "hist_w2v: Tools for training word2vec models on Google Ngrams"
version: 0.1.0
doi:
date-released: 2025-1-21
url: "https://github.com/eric-d-knowles/hist_w2v"

GitHub Events

Total

Release event: 3
Watch event: 2
Member event: 1
Public event: 1
Push event: 57
Create event: 7

Last Year

Release event: 3
Watch event: 2
Member event: 1
Public event: 1
Push event: 57
Create event: 7

Packages

Total packages: 1
Total downloads:
- pypi 26 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

pypi.org: hist-w2v

Tools for downloading, processing, and training word2vec models on Google Ngrams

Homepage: https://github.com/eric-d-knowles/hist_w2v
Documentation: https://hist-w2v.readthedocs.io/
License: MIT
Latest release: 0.1.6
published about 1 year ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 26 Last month

Rankings

Dependent packages count: 9.7%

Average: 32.3%

Dependent repos count: 54.8%

Maintainers (1)

ericdknowles

Last synced: 11 months ago

Dependencies

pyproject.toml pypi

gensim >=4.3.0,<5.0
matplotlib >=3.4,<4.0
nltk >=3.7,<4.0
orjson >=3.6,<4.0
pandas >=1.3.0,<2.0
requests >=2.2,<3.0
seaborn >=0.11,<1.0
tqdm >=4.64,<5.0

src/hist_w2v.egg-info/requires.txt pypi

gensim <5.0,>=4.3.0
matplotlib <4.0,>=3.4
nltk <4.0,>=3.7
orjson <4.0,>=3.6
pandas <2.0,>=1.3.0
requests <3.0,>=2.2
seaborn <1.0,>=0.11
tqdm <5.0,>=4.64

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

hist-w2v

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

hist_w2v

Tools for downloading, processing, and training word2vec models on Google Ngrams

Installation

Package Contents

System Requirements

Citing hist_w2v

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: hist-w2v

Rankings

Maintainers (1)

Dependencies