https://github.com/ausgerechnet/semmap

Semantic Maps for Association Tables

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Semantic Maps for Association Tables

Basic Info

Host: GitHub
Owner: ausgerechnet
License: gpl-3.0
Language: HTML
Default Branch: master
Size: 1.53 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created almost 6 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

Semantic Map

This is a Python3 module for creating and updating coordinates of semantic maps. Semantic maps are 2-dimensional representations of sets of type embeddings, the types usually gained from collocation or keyword analyses.

The module tackles two problems: - OOV in embeddings look-up (coordinates can be generated for all items, including items without embeddings) - iterative projection onto lower-dimensional coordinates (items can be added to an existing projection)

Limitations: - no functionality for context-aware embeddings (ELMo, BERT) - for updating the store dynamically (e.g. for MWUs), I will have to migrate e.g. to FAISS

Installation

pip install git+https://github.com/ausgerechnet/semmap.git

Embeddings

Import

module supports creation of embeddings via SentenceTransformers
it can also read in pre-created embeddings from FastText ("C-text") format
- NB: pre-computed embeddings might come without n-gram character encodings (or other subtoken representations)

CLI for creating embeddings

creation of embeddings via transformers
- input: .tsv
- output: embeddings-keys.txt, embeddings-representations.tsv → embeddings.tsv
storage of embeddings via annoy
- input: embeddings.tsv
- output: embeddings.ann

Storage

by default, embeddings are stored in a custom storage (EmbeddingsStore) based on annoy
- EmbeddingsStore config file ends on ".semmap", links to annoy database and type dictionary
alternatively, embeddings can be stored in a pymagnitude database

OOV functionality

pymagnitude:
- construct character n-gram embeddings during import time
- center and normalise randomly (but reproducibly)
- interpolate with in-vocabulary words via string similarity
- improved string similarity: morphology-aware for English, shrinking repeated characters, ...
EmbeddingsStore:
- default: create on the fly -- only reasonable for SentenceTransformers (or FastText, but NotImplemented)
- random (but reproducible) init
- based on string similarity via levenshtein (edit distance)
NB FastText:
- constructs character n-gram embeddings during initial encoding (e.g. FastText)
- yields OOV support for all words if at least one character n-gram has been observed (trivial for unigrams: alphabet)

Nearest neighbours

annoy functionality (fixed lookup trees)
pymagnitude also uses annoy

Working with semantic maps

API

central API offered by semmap.SemanticSpace
init with path (must end on "magnitude" or "semmap")

dimensionality reduction

default: sklearn.manifold.TSNE
umap.UMAP
openTSNE.TSNE

iterative projection

default: convex combination of 2d mapping of similar types (cosine similarity of high-dimensional embeddings)
random (but reproducible) projection
iterative t-SNE (NotImplemented, openTSNE)
iterative UMAP (NotImplemented, umap-learn)

Roadmap

[ ] PyPI
[x] github tests
[x] OOV
[x] use annoy instead of pymagnitude
[ ] use openTSNE instead of sklearn -- fails when there's many items ("IndexError: Vector has wrong length (expected 300, got 1000)")
[ ] iterative projection (add-item) with openTSNE and UMAP

Owner

Name: Philipp Heinrich
Login: ausgerechnet
Kind: user
Location: Erlangen
Company: @fau-klue

Website: https://philipp-heinrich.eu
Repositories: 2
Profile: https://github.com/ausgerechnet

GitHub Events

Total

Delete event: 1
Push event: 10
Pull request event: 2
Create event: 1

Last Year

Delete event: 1
Push event: 10
Pull request event: 2
Create event: 1

Dependencies

pyproject.toml pypi

requirements-dev.txt pypi

adjustText * development
bokeh * development
matplotlib * development
plotnine * development
pytest ==7.4.0 development
pytest-cov ==4.1.0 development

requirements.txt pypi

pandas >=2.0
pymagnitude-lite >=0.1.143
scikit-learn >=1.3.0
umap-learn >=0.5.5

setup.py pypi

pandas >=2.0
pymagnitude-lite >=0.1.143
scikit-learn >=1.3.0
umap-learn >=0.5.5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ausgerechnet/semmap

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Semantic Map

Installation

Embeddings

Import

CLI for creating embeddings

Storage

OOV functionality

Nearest neighbours

Working with semantic maps

API

dimensionality reduction

iterative projection

Roadmap

Owner

GitHub Events

Total

Last Year

Dependencies