https://github.com/centre-for-humanities-computing/embedding-projection

WIP, name is temporary

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

WIP, name is temporary

Basic Info

Host: GitHub
Owner: centre-for-humanities-computing
License: mit
Language: Python
Default Branch: main
Size: 94.7 MB

Statistics

Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License

embedding-projection

A project developing a technique for extracting information from contextual sentence embeddings (sBERT) by utilizing projection of embeddings onto a concept vector (i.e. a Steering Vector in LLM decoder steering literature)

The projection of a corpus shows that different classes are seperatable on the concept axis:

Projection of Reviews onto Sentiment Vector

Words corresponding to the 10 smallest embeddings: ['worse', 'terrible', 'sucked', 'horrible', 'worst', 'bad', 'rotten', 'unacceptable', 'stupidity', 'awful'] Words corresponding to the 10 largest embeddings: ['pleasure', 'anytime', 'admired', 'admire', 'fabulous', 'classical', 'beloved', 'romantic', 'anthologies', 'lovely']

To check if the annotation correlates with human annotators:

Human Annotator Correlation with Semantic Projection

It seems there is a rather strong correlation between average human anotator and the projection method!

Lets see get an idea of what an correlation baseline would even look like.

Annotations of semantics vary when on a continous scale. Different computational methods produce different scores, but different humans also correlate differently with eachother. Annotator Correlation

Defining the positive- and negative centroids from gold-standard annotations.

As an attempt to improve the correlation to the gold-standard I defined the model by turning gold-standard human ratings into binary positive/negative. if (rating >= 7), then == positive. if (rating <= 3) then == negative. this improved correlation by 0.02: ![./img/Scatterplotfiction4wPersonMiniLM.png]

Changing to the larger MPNET-base-v2 model the correlation further improved:

![./img/Scatterplotfiction4wPersonMPNET.png]

We now try to the method on a more complex setting such as Linguistic Acceptability

This metric can be found in the glue/cola test-set, formally defined as: "The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence."

Fiction4 Dataset used: https://huggingface.co/datasets/chcaa/fiction4sentiment

EmoBank Dataset used: https://github.com/JULIELab/EmoBank/blob/master/corpus/emobank.csv

Owner

Name: Center for Humanities Computing Aarhus
Login: centre-for-humanities-computing
Kind: organization
Email: chcaa@cas.au.dk
Location: Aarhus, Denmark

Website: https://chc.au.dk/
Repositories: 130
Profile: https://github.com/centre-for-humanities-computing

GitHub Events

Total

Issue comment event: 1
Push event: 1
Pull request event: 2
Fork event: 1
Create event: 2

Last Year

Issue comment event: 1
Push event: 1
Pull request event: 2
Fork event: 1
Create event: 2

Dependencies

requirements.txt pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/centre-for-humanities-computing/embedding-projection

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

embedding-projection

The projection of a corpus shows that different classes are seperatable on the concept axis:

To check if the annotation correlates with human annotators:

Lets see get an idea of what an correlation baseline would even look like.

Defining the positive- and negative centroids from gold-standard annotations.

Changing to the larger MPNET-base-v2 model the correlation further improved:

We now try to the method on a more complex setting such as Linguistic Acceptability

Owner

GitHub Events

Total

Last Year

Dependencies