https://github.com/centre-for-humanities-computing/embedding-projection
WIP, name is temporary
https://github.com/centre-for-humanities-computing/embedding-projection
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.3%) to scientific vocabulary
Repository
WIP, name is temporary
Basic Info
- Host: GitHub
- Owner: centre-for-humanities-computing
- License: mit
- Language: Python
- Default Branch: main
- Size: 94.7 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
embedding-projection
A project developing a technique for extracting information from contextual sentence embeddings (sBERT) by utilizing projection of embeddings onto a concept vector (i.e. a Steering Vector in LLM decoder steering literature)
The projection of a corpus shows that different classes are seperatable on the concept axis:

Words corresponding to the 10 smallest embeddings: ['worse', 'terrible', 'sucked', 'horrible', 'worst', 'bad', 'rotten', 'unacceptable', 'stupidity', 'awful'] Words corresponding to the 10 largest embeddings: ['pleasure', 'anytime', 'admired', 'admire', 'fabulous', 'classical', 'beloved', 'romantic', 'anthologies', 'lovely']
To check if the annotation correlates with human annotators:

It seems there is a rather strong correlation between average human anotator and the projection method!
Lets see get an idea of what an correlation baseline would even look like.
Annotations of semantics vary when on a continous scale. Different computational methods produce different scores, but different humans also correlate differently with eachother.

Defining the positive- and negative centroids from gold-standard annotations.
As an attempt to improve the correlation to the gold-standard I defined the model by turning gold-standard human ratings into binary positive/negative. if (rating >= 7), then == positive. if (rating <= 3) then == negative. this improved correlation by 0.02: ![./img/Scatterplotfiction4wPersonMiniLM.png]
Changing to the larger MPNET-base-v2 model the correlation further improved:
![./img/Scatterplotfiction4wPersonMPNET.png]
We now try to the method on a more complex setting such as Linguistic Acceptability
This metric can be found in the glue/cola test-set, formally defined as: "The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence."
Fiction4 Dataset used: https://huggingface.co/datasets/chcaa/fiction4sentiment
EmoBank Dataset used: https://github.com/JULIELab/EmoBank/blob/master/corpus/emobank.csv
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
- Issue comment event: 1
- Push event: 1
- Pull request event: 2
- Fork event: 1
- Create event: 2
Last Year
- Issue comment event: 1
- Push event: 1
- Pull request event: 2
- Fork event: 1
- Create event: 2