AuDoLab

AuDoLab: Automatic document labelling and classification for extremely unbalanced data - Published in JOSS (2021)

https://github.com/arnetillmann/audolab

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
  • Committers with academic emails
    2 of 11 committers (18.2%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords from Contributors

genome

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 69% confidence
Last synced: 4 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: ArneTillmann
  • License: other
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 9.89 MB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 3
Created almost 5 years ago · Last pushed about 4 years ago
Metadata Files
Readme Changelog Contributing License Authors

README.html

AuDoLab

https://img.shields.io/pypi/v/AuDoLab.svg https://api.travis-ci.com/ArneTillmann/AuDoLab.svg?branch=main&status=passed Documentation Status

With AuDoLab you can perform Latend Direchlet Allocation on highly imbalanced datasets.

Installation

Stable release

To install AuDoLab, run this command in your terminal:

$ pip install AuDoLab

This is the preferred method to install AuDoLab, as it will always install the most recent stable release.

If you dont have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for AuDoLab can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/ArneTillmann/AuDoLab

Or download the tarball:

$ curl -OJL https://github.com/ArneTillmann/AuDoLab/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

To use AuDoLab in a project:

from AuDoLab import AuDoLab
import asyncio

Then you want to create an instance of the AuDoLab class

audo = AuDoLab.AuDoLab()

In this example we used publicly available data from the nltk package:

from nltk.corpus import reuters
import numpy as np
import pandas as pd

data = []

for fileid in reuters.fileids():
    tag, filename = fileid.split("/")
    data.append(
        (filename,
         ", ".join(
             reuters.categories(fileid)),
            reuters.raw(fileid)))

data = pd.DataFrame(data, columns=["filename", "categories", "text"])

Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper:

async def scrape():
    return await audo.scrape_abstracts(
        url=None, keywords=["cotton"], in_data="all_meta", pages=5
    )

scraped_documents = asyncio.get_event_loop().run_until_complete(scrape())

The data as well as the scraped papers need to be preprocessed before use in the classifier:

preprocessed_target = audo.preprocessing(data=data, column="text")

preprocessed_paper = audo.preprocessing(
    data=scraped_documents, column="text")

target_tfidf, training_tfidf = audo.tf_idf(
    data=preprocessed_target,
    papers=preprocessed_paper,
    data_column="lemma",
    papers_column="lemma",
    features=100000,
)

Afterwards we can train and use the classifiers and choose the desired one:

classifier = audo.one_class_svm(
    training=training_tfidf,
    predicting=target_tfidf,
    nus=np.round(np.arange(0.01, 0.5, 0.01), 7),
    quality_train=0.9,
    min_pred=0.001,
    max_pred=0.05,
)

df_data = audo.choose_classifier(preprocessed_target, classifier, 2)

And finally you can estimate the topics of the data:

audo.lda_modeling(df_data, num_topics=2)

a = audo.lda_visualize_topics()
html = a.data
with open('html_file.html', 'w') as f:
    f.write(html)

Owner

  • Login: ArneTillmann
  • Kind: user

JOSS Publication

AuDoLab: Automatic document labelling and classification for extremely unbalanced data
Published
October 19, 2021
Volume 6, Issue 66, Page 3719
Authors
Arne Tillmann
Georg-August-Universität Göttingen, Göttingen, Germany
Anton Thielmann
Georg-August-Universität Göttingen, Göttingen, Germany
Gillian Kant
Georg-August-Universität Göttingen, Göttingen, Germany
Christoph Weisser
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Benjamin Säfken
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Alexander Silbersdorff
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Thomas Kneib
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Editor
Arfon Smith ORCID
Tags
One-class SVM Unsupervised Document Classification One-class Document Classification LDA Topic Modelling Out-of-domain Training Data

GitHub Events

Total
Last Year

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 425
  • Total Committers: 11
  • Avg Commits per committer: 38.636
  • Development Distribution Score (DDS): 0.285
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
ArneTillmann a****r@g****m 304
Christoph Weisser 5****9 54
Anton Thielmann 5****n 42
AFThielmann a****n@t****e 13
Benjamin Säfken b****e@u****e 3
kantg g****t@f****e 3
Arfon Smith a****n 2
tkneib t****b@u****e 1
pyup-bot g****t@p****o 1
Gillian Kant 5****g 1
AlexanderSilbersdorff 8****f 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 5
  • Total pull requests: 29
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 4 days
  • Total issue authors: 3
  • Total pull request authors: 4
  • Average comments per issue: 0.6
  • Average comments per pull request: 0.07
  • Merged pull requests: 26
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ArneTillmann (3)
  • linuxscout (1)
  • pyup-bot (1)
Pull Request Authors
  • ArneTillmann (25)
  • arfon (2)
  • ChrisW09 (1)
  • pyup-bot (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 311 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 34
  • Total maintainers: 1
pypi.org: audolab

With AuDoLab you can do LDA on highly imbalanced datasets.

  • Versions: 34
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 311 Last month
Rankings
Dependent packages count: 10.1%
Downloads: 17.0%
Forks count: 19.1%
Stargazers count: 21.5%
Average: 27.0%
Dependent repos count: 67.4%
Maintainers (1)
Last synced: 4 months ago

Dependencies

requirements.txt pypi
  • Click >=7.0
  • Sphinx >=3.5.4
  • WordCloud >=1.8.1
  • bs4 >=0.0.1
  • bump2version >=0.5.11
  • coverage >=6.0b1
  • flake8 >=3.7.8
  • funcy *
  • gensim >=3.8.3
  • lime >=0.2.0.1
  • matplotlib >=3.3.4
  • nest-asyncio >=1.5.1
  • nltk >=3.5
  • numpy >=1.19.2
  • pandas >=1.2.3
  • pyldavis >=3.3.1
  • pyppeteer *
  • pytest >=4.6.5
  • pytest-runner >=5.1
  • requests >=2.25.1
  • scikit-learn >=0.24.1
  • tox >=3.14.0
  • tqdm >=4.62.0
  • twine *
  • watchdog >=0.9.0
  • webbot >=0.34
  • wheel >=0.33.6
  • wordcloud *