https://github.com/ausgerechnet/semmap

Semantic Maps for Association Tables

https://github.com/ausgerechnet/semmap

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Semantic Maps for Association Tables

Basic Info
  • Host: GitHub
  • Owner: ausgerechnet
  • License: gpl-3.0
  • Language: HTML
  • Default Branch: master
  • Size: 1.53 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 6 years ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

Semantic Map

This is a Python3 module for creating and updating coordinates of semantic maps. Semantic maps are 2-dimensional representations of sets of type embeddings, the types usually gained from collocation or keyword analyses.

The module tackles two problems: - OOV in embeddings look-up (coordinates can be generated for all items, including items without embeddings) - iterative projection onto lower-dimensional coordinates (items can be added to an existing projection)

Limitations: - no functionality for context-aware embeddings (ELMo, BERT) - for updating the store dynamically (e.g. for MWUs), I will have to migrate e.g. to FAISS

Installation

pip install git+https://github.com/ausgerechnet/semmap.git

Embeddings

Import

  • module supports creation of embeddings via SentenceTransformers
  • it can also read in pre-created embeddings from FastText ("C-text") format
    • NB: pre-computed embeddings might come without n-gram character encodings (or other subtoken representations)

CLI for creating embeddings

  • creation of embeddings via transformers
    • input: .tsv
    • output: embeddings-keys.txt, embeddings-representations.tsv → embeddings.tsv
  • storage of embeddings via annoy
    • input: embeddings.tsv
    • output: embeddings.ann

Storage

  • by default, embeddings are stored in a custom storage (EmbeddingsStore) based on annoy
    • EmbeddingsStore config file ends on ".semmap", links to annoy database and type dictionary
  • alternatively, embeddings can be stored in a pymagnitude database

OOV functionality

  • pymagnitude:

    • construct character n-gram embeddings during import time
    • center and normalise randomly (but reproducibly)
    • interpolate with in-vocabulary words via string similarity
    • improved string similarity: morphology-aware for English, shrinking repeated characters, ...
  • EmbeddingsStore:

    • default: create on the fly -- only reasonable for SentenceTransformers (or FastText, but NotImplemented)
    • random (but reproducible) init
    • based on string similarity via levenshtein (edit distance)
  • NB FastText:

    • constructs character n-gram embeddings during initial encoding (e.g. FastText)
    • yields OOV support for all words if at least one character n-gram has been observed (trivial for unigrams: alphabet)

Nearest neighbours

  • annoy functionality (fixed lookup trees)
  • pymagnitude also uses annoy

Working with semantic maps

API

  • central API offered by semmap.SemanticSpace
  • init with path (must end on "magnitude" or "semmap")

dimensionality reduction

  • default: sklearn.manifold.TSNE
  • umap.UMAP
  • openTSNE.TSNE

iterative projection

  • default: convex combination of 2d mapping of similar types (cosine similarity of high-dimensional embeddings)
  • random (but reproducible) projection
  • iterative t-SNE (NotImplemented, openTSNE)
  • iterative UMAP (NotImplemented, umap-learn)

Roadmap

  • [ ] PyPI
  • [x] github tests
  • [x] OOV
  • [x] use annoy instead of pymagnitude
  • [ ] use openTSNE instead of sklearn -- fails when there's many items ("IndexError: Vector has wrong length (expected 300, got 1000)")
  • [ ] iterative projection (add-item) with openTSNE and UMAP

Owner

  • Name: Philipp Heinrich
  • Login: ausgerechnet
  • Kind: user
  • Location: Erlangen
  • Company: @fau-klue

GitHub Events

Total
  • Delete event: 1
  • Push event: 10
  • Pull request event: 2
  • Create event: 1
Last Year
  • Delete event: 1
  • Push event: 10
  • Pull request event: 2
  • Create event: 1

Dependencies

pyproject.toml pypi
requirements-dev.txt pypi
  • adjustText * development
  • bokeh * development
  • matplotlib * development
  • plotnine * development
  • pytest ==7.4.0 development
  • pytest-cov ==4.1.0 development
requirements.txt pypi
  • pandas >=2.0
  • pymagnitude-lite >=0.1.143
  • scikit-learn >=1.3.0
  • umap-learn >=0.5.5
setup.py pypi
  • pandas >=2.0
  • pymagnitude-lite >=0.1.143
  • scikit-learn >=1.3.0
  • umap-learn >=0.5.5