classy-classification

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.

https://github.com/davidberenstein1957/classy-classification

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

few-shot-classifcation hacktoberfest machine-learning natural-language-processing nlp nlu sentence-transformers spacy text-classification
Last synced: 6 months ago · JSON representation ·

Repository

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.

Basic Info
  • Host: GitHub
  • Owner: davidberenstein1957
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 613 KB
Statistics
  • Stars: 219
  • Watchers: 6
  • Forks: 15
  • Open Issues: 0
  • Releases: 22
Topics
few-shot-classifcation hacktoberfest machine-learning natural-language-processing nlp nlu sentence-transformers spacy text-classification
Created almost 4 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Classy Classification

Have you ever struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using sentence-transformers or spaCy models, provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with Hugginface zero-shot classifiers.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install classy-classification

SetFit support

I got a lot of requests for SetFit support, but I decided to create a separate package for this. Feel free to check it out. ❤️

Quickstart

SpaCy embeddings

```python import spacy

or import standalone

from classy_classification import ClassyClassifier

data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens."] }

nlp = spacy.load("encorewebtrf") nlp.addpipe( "classy_classification", config={ "data": data, "model": "spacy" } )

print(nlp("I am looking for kitchen appliances.")._.cats)

Output:

[{"furniture" : 0.21}, {"kitchen": 0.79}]

```

Sentence level classification

```python import spacy

data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens."] }

nlp.addpipe( "classyclassification", config={ "data": data, "model": "spacy", "include_sent": True } )

print(nlp("I am looking for kitchen appliances. And I love doing so.").sents[0]._.cats)

Output:

[[{"furniture" : 0.21}, {"kitchen": 0.79}]

```

Define random seed and verbosity

```python

nlp.addpipe( "classyclassification", config={ "data": data, "verbose": True, "config": {"seed": 42} } ) ```

Multi-label classification

Sometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the multi-label implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.

```python import spacy

data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa.", "We have a new dinner table.", "There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens.", "We have a new dinner table."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens.", "We have a new dinner table.", "There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens.", "We have a new dinner table."] }

nlp = spacy.load("encorewebmd") nlp.addpipe( "classyclassification", config={ "data": data, "model": "spacy", "multilabel": True, } )

print(nlp("I am looking for furniture and kitchen equipment.")._.cats)

Output:

[{"furniture": 0.92}, {"kitchen": 0.91}]

```

Outlier detection

Sometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using a binary training dataset, however, I have also implemented support for a OneClassSVM for outlier detection using a single label. Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.

Approach 1:

```python import spacy

data_binary = { "inlier": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "outlier": ["Text about kitchen equipment", "This text is about politics", "Comments about AI and stuff."] }

nlp = spacy.load("encorewebmd") nlp.addpipe( "classyclassification", config={ "data": databinary, } )

print(nlp("This text is a random text")._.cats)

Output:

[{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]

```

Approach 2:

```python import spacy

datasingular = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa.", "We have a new dinner table."] } nlp = spacy.load("encorewebmd") nlp.addpipe( "classyclassification", config={ "data": data_singular, } )

print(nlp("This text is a random text")._.cats)

Output:

[{'furniture': 0, 'not_furniture': 1}]

```

Sentence-transfomer embeddings

```python import spacy

data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens."] }

nlp = spacy.blank("en") nlp.addpipe( "classyclassification", config={ "data": data, "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", "device": "gpu" } )

print(nlp("I am looking for kitchen appliances.")._.cats)

Output:

[{"furniture": 0.21}, {"kitchen": 0.79}]

```

Hugginface zero-shot classifiers

```python import spacy

data = ["furniture", "kitchen"]

nlp = spacy.blank("en") nlp.addpipe( "classyclassification", config={ "data": data, "model": "typeform/distilbert-base-uncased-mnli", "cat_type": "zero", "device": "gpu" } )

print(nlp("I am looking for kitchen appliances.")._.cats)

Output:

[{"furniture": 0.21}, {"kitchen": 0.79}]

```

Credits

Inspiration Drawn From

Huggingface does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has a nice approach for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate sentence-transformers and Hugginface zero-shot, instead of default word embeddings. Finally, I decided to integrate with Spacy, since training a custom Spacy TextCategorizer seems like a lot of hassle if you want something quick and dirty.

Or buy me a coffee

"Buy Me A Coffee"

Standalone usage without spaCy

```python

from classy_classification import ClassyClassifier

data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens."] }

classifier = ClassyClassifier(data=data) classifier("I am looking for kitchen appliances.") classifier.pipe(["I am looking for kitchen appliances."])

overwrite training data

classifier.settrainingdata(data=data) classifier("I am looking for kitchen appliances.")

overwrite embedding model

classifier.setembeddingmodel(model="paraphrase-MiniLM-L3-v2") classifier("I am looking for kitchen appliances.")

overwrite SVC config

classifier.setclassificationmodel( config={ "C": [1, 2, 5, 10, 20, 100], "kernel": ["linear"], "maxcrossvalidation_folds": 5 } ) classifier("I am looking for kitchen appliances.") ```

Save and load models

```python data = { "furniture": ["This text is about chairs.", "Couches, benches and televisions.", "I really need to get a new sofa."], "kitchen": ["There also exist things like fridges.", "I hope to be getting a new stove today.", "Do you also have some ovens."] } classifier = classyClassifier(data=data)

with open("./classifier.pkl", "wb") as f: pickle.dump(classifier, f)

f = open("./classifier.pkl", "rb") classifier = pickle.load(f) classifier("I am looking for kitchen appliances.") ```

Owner

  • Name: David Berenstein
  • Login: davidberenstein1957
  • Kind: user
  • Location: Madrid
  • Company: @argilla-io

👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing Developer Advocate @argilla-io

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: David
    given-names: Berenstein
title: "Classy Classification - an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface."
version: 0.6.0
date-released: 2022-12-31

GitHub Events

Total
  • Create event: 4
  • Release event: 2
  • Issues event: 7
  • Watch event: 8
  • Issue comment event: 13
  • Push event: 9
  • Pull request event: 3
Last Year
  • Create event: 4
  • Release event: 2
  • Issues event: 7
  • Watch event: 8
  • Issue comment event: 13
  • Push event: 9
  • Pull request event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 117
  • Total Committers: 4
  • Avg Commits per committer: 29.25
  • Development Distribution Score (DDS): 0.342
Past Year
  • Commits: 23
  • Committers: 1
  • Avg Commits per committer: 23.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
david d****n@g****m 77
David Berenstein d****n@p****m 34
Pepijn Boers p****b@g****m 4
Boers p****s@z****l 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 44
  • Total pull requests: 10
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 2 months
  • Total issue authors: 31
  • Total pull request authors: 6
  • Average comments per issue: 3.64
  • Average comments per pull request: 1.9
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 3
  • Pull requests: 3
  • Average time to close issues: 23 days
  • Average time to close pull requests: 20 minutes
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 3.33
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • davidberenstein1957 (4)
  • swageeth (3)
  • andremacola (3)
  • kbillesk (2)
  • nv78 (2)
  • koaning (2)
  • KikeVen (2)
  • RaiAmanRai (2)
  • saitej123 (2)
  • drkonafa (1)
  • nsankar (1)
  • dpicca (1)
  • espdev (1)
  • atefalvi (1)
  • jackrvaughan (1)
Pull Request Authors
  • davidberenstein1957 (5)
  • RobinRojowiec (2)
  • adelevie (1)
  • PepijnBoers (1)
  • Masboes (1)
  • dependabot[bot] (1)
Top Labels
Issue Labels
enhancement (6) bug (6) documentation (1)
Pull Request Labels
dependencies (2) hacktoberfest (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 246 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 33
  • Total maintainers: 2
pypi.org: classy-classification

Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!

  • Versions: 33
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 246 Last month
Rankings
Downloads: 4.2%
Dependent packages count: 4.8%
Stargazers count: 5.3%
Average: 9.2%
Forks count: 10.2%
Dependent repos count: 21.6%
Maintainers (2)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • 105 dependencies
pyproject.toml pypi
  • black ^22.3.0 develop
  • flake8 ^4.0.1 develop
  • flake8-bugbear ^22.3.23 develop
  • flake8-docstrings ^1.6.0 develop
  • isort ^5.10.1 develop
  • pep8-naming ^0.12.1 develop
  • pre-commit ^2.17.0 develop
  • pytest ^7.0.1 develop
  • python ^3.7
  • scikit-learn ^1.0
  • sentence-transformers ^2.0
  • spacy ^3.0
  • txtai ^4.5.0
.github/workflows/python-package.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
.github/workflows/python-publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite