https://github.com/centre-for-humanities-computing/embedding-explorer

Tools for interactive visual exploration of semantic embeddings.

https://github.com/centre-for-humanities-computing/embedding-explorer

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary

Keywords

clustering embedding embeddings interactive knowledge-graph machine-learning networks nlp projection semantic
Last synced: 5 months ago · JSON representation

Repository

Tools for interactive visual exploration of semantic embeddings.

Basic Info
Statistics
  • Stars: 37
  • Watchers: 1
  • Forks: 6
  • Open Issues: 0
  • Releases: 0
Topics
clustering embedding embeddings interactive knowledge-graph machine-learning networks nlp projection semantic
Created almost 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

embedding-explorer

Tools for interactive visual exploration of semantic embeddings.

Documentation

New in version 0.6.0

You can now pass a custom Neofuzz process to the explorer if you have specific requirements.

```python from embeddingexplorer import shownetworkexplorer from neofuzz import charngram_process

process = charngramprocess() shownetworkexplorer(corpus=corpus, embeddings=embeddings, fuzzy_search=process) ```

Installation

Install embedding-explorer from PyPI:

bash pip install embedding-explorer

Semantic Explorer

embedding-explorer comes with a web application built for exploring semantic relations in a corpus with the help of embeddings. In this section I will show a couple of examples of running the app with different embedding models and corpora.

Static Word Embeddings

Let's say that you would like to explore semantic relations by investigating word embeddings generated with Word2Vec. You can do this by passing the vocabulary of the model and the embedding matrix to embedding-explorer.

For this example I will use Gensim, which can be installed from PyPI:

bash pip install gensim

We will download GloVe Twitter 25 from gensim's repositories. ```python from gensim import downloader from embeddingexplorer import shownetwork_explorer

model = downloader.load("glove-twitter-25") vocabulary = model.indextokey embeddings = model.vectors shownetworkexplorer(corpus=vocabulary, embeddings=embeddings) ```

This will open a new browser window with the Explorer, where you can enter seed words and set the number of associations that you would like to see on the screen.

Screenshot of the Explorer

Dynamic Embedding Models

If you want to explore relations in a corpus using let's say a sentence transformer, which creates contextually aware embeddings, you can do so by specifying a scikit-learn compatible vectorizer model instead of passing along an embedding matrix.

One clear advantage here is that you can input arbitrary sequences as seeds instead of a predetermined set of texts.

We are going to use the package embetter for embedding documents.

bash pip install embetter[sentence-trf]

I decided to examine four-grams in the 20newsgroups dataset. We will limit the number of four-grams to 4000 so we only see the most relevant ones.

```python from embetter.text import SentenceEncoder from embeddingexplorer import shownetworkexplorer from sklearn.datasets import fetch20newsgroups from sklearn.feature_extraction.text import CountVectorizer

corpus = fetch_20newsgroups( remove=("headers", "footers", "quotes"), ).data

We will use CountVectorizer for obtaining the possible n-grams

fourgrams = ( CountVectorizer( stopwords="english", ngramrange=(4, 4), maxfeatures=4000 ) .fit(corpus) .getfeaturenames_out() )

model = SentenceEncoder() shownetworkexplorer(corpus=four_grams, vectorizer=model) ```

Screenshot of the Explorer

Projection and Clustering

:star2: New in version 0.5.0

In embedding-explorer you can now inspect corpora or embeddings by projecting them into 2D space, and optionally clustering observations.

In this example I'm going to demonstrate how to visualize 20 Newsgroups using various projection and clustering methods in embedding-explorer. We are going to use sentence transformers to encode texts.

```python from embetter.text import SentenceEncoder from sklearn.datasets import fetch_20newsgroups

from embeddingexplorer import showclustering

newsgroups = fetch_20newsgroups( remove=("headers", "footers", "quotes"), ) corpus = newsgroups.data

show_clustering(corpus=corpus, vectorizer=SentenceEncoder()) ```

In the app you can whether or how you want to reduce embedding dimensionality, how you want to cluster the embeddings, and also how you intend to project them onto the 2D plane.

Screenshot of the Clustering parameters

After this you can investigate the semantic structure of your corpus interactively.

Screenshot of the Clustering

Dashboard

If you have multiple models to examine the same corpus or multiple corpora, that you want to examine with the same model, then you can create a dashboard containing all of these options, that users will be able to click on and that takes them to the appropriate explorer page.

For this we will have to assemble these options into a list of Card objects, that contain the information about certain pages.

In the following example I will set up two different sentence transformers with the same corpus from the previous example.

```python from embetter.text import SentenceEncoder from embeddingexplorer import showdashboard from embedding_explorer.cards import NetworkCard, ClusteringCard

cards = [ NetworkCard("MiniLM", corpus=fourgrams, vectorizer=SentenceEncoder("all-MiniLM-L12-v2")), NetworkCard("MPNET", corpus=fourgrams, vectorizer=SentenceEncoder("all-mpnet-base-v2")), ] show_dashboard(cards) ```

Screenshot of the Dashboard

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
  • Watch event: 9
  • Fork event: 4
Last Year
  • Watch event: 9
  • Fork event: 4

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 95 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 15
  • Total maintainers: 1
pypi.org: embedding-explorer

Tools for interactive visual inspection of semantic embeddings.

  • Versions: 15
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 95 Last month
Rankings
Dependent packages count: 6.9%
Average: 20.1%
Downloads: 22.8%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi
  • dash ^2.7.1
  • dash-extensions ^0.1.10
  • dash-iconify ^0.1.2
  • dash-mantine-components ^0.11.1
  • neofuzz >=0.1.2
  • numpy >=1.23.0
  • pandas >=1.5.2
  • python >=3.8.0
  • scikit-learn ^1.1.0
  • wordcloud ^1.8.2.2
.github/workflows/static.yml actions
  • actions/checkout v3 composite
  • actions/configure-pages v2 composite
  • actions/deploy-pages v1 composite
  • actions/upload-pages-artifact v1 composite