AuDoLab

AuDoLab: Automatic document labelling and classification for extremely unbalanced data - Published in JOSS (2021)

https://github.com/arnetillmann/audolab

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
○
Academic publication links
✓
Committers with academic emails
2 of 11 committers (18.2%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords from Contributors

genome

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 69% confidence

Last synced: 11 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: ArneTillmann
License: other
Language: Jupyter Notebook
Default Branch: main
Size: 9.89 MB

Statistics

Stars: 5
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 3

Created over 5 years ago · Last pushed almost 5 years ago

Metadata Files

Readme Changelog Contributing License Authors

README.html


AuDoLab



With AuDoLab you can perform Latend Direchlet Allocation on highly imbalanced datasets.


Installation

Stable release
To install AuDoLab, run this command in your terminal:
$ pip install AuDoLab

This is the preferred method to install AuDoLab, as it will always install the most recent stable release.
If you dont have pip installed, this Python installation guide can guide
you through the process.


From sources
The sources for AuDoLab can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/ArneTillmann/AuDoLab

Or download the tarball:
$ curl -OJL https://github.com/ArneTillmann/AuDoLab/tarball/master

Once you have a copy of the source, you can install it with:
$ python setup.py install




Usage
To use AuDoLab in a project:
from AuDoLab import AuDoLab
import asyncio

Then you want to create an instance of the AuDoLab class

audo = AuDoLab.AuDoLab()
In this example we used publicly available data from the nltk package:
from nltk.corpus import reuters
import numpy as np
import pandas as pd

data = []

for fileid in reuters.fileids():
    tag, filename = fileid.split("/")
    data.append(
        (filename,
         ", ".join(
             reuters.categories(fileid)),
            reuters.raw(fileid)))

data = pd.DataFrame(data, columns=["filename", "categories", "text"])

Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper:
async def scrape():
    return await audo.scrape_abstracts(
        url=None, keywords=["cotton"], in_data="all_meta", pages=5
    )

scraped_documents = asyncio.get_event_loop().run_until_complete(scrape())

The data as well as the scraped papers need to be preprocessed before use in the
classifier:
preprocessed_target = audo.preprocessing(data=data, column="text")

preprocessed_paper = audo.preprocessing(
    data=scraped_documents, column="text")

target_tfidf, training_tfidf = audo.tf_idf(
    data=preprocessed_target,
    papers=preprocessed_paper,
    data_column="lemma",
    papers_column="lemma",
    features=100000,
)

Afterwards we can train and use the classifiers and choose the desired
one:
classifier = audo.one_class_svm(
    training=training_tfidf,
    predicting=target_tfidf,
    nus=np.round(np.arange(0.01, 0.5, 0.01), 7),
    quality_train=0.9,
    min_pred=0.001,
    max_pred=0.05,
)

df_data = audo.choose_classifier(preprocessed_target, classifier, 2)

And finally you can estimate the topics of the data:
audo.lda_modeling(df_data, num_topics=2)

a = audo.lda_visualize_topics()
html = a.data
with open('html_file.html', 'w') as f:
    f.write(html)


Free software: GNU General Public License v3
Documentation: https://AuDoLab.readthedocs.io.

Owner

Login: ArneTillmann
Kind: user

Repositories: 3
Profile: https://github.com/ArneTillmann

JOSS Publication

AuDoLab: Automatic document labelling and classification for extremely unbalanced data

Published

October 19, 2021

DOI

10.21105/joss.03719

Volume 6, Issue 66, Page 3719

Authors

Arne Tillmann
Georg-August-Universität Göttingen, Göttingen, Germany

Anton Thielmann
Georg-August-Universität Göttingen, Göttingen, Germany

Gillian Kant
Georg-August-Universität Göttingen, Göttingen, Germany

Christoph Weisser
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany

Benjamin Säfken
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany

Alexander Silbersdorff
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany

Thomas Kneib
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany

Editor

Arfon Smith

GitHub Events

Total

Last Year

Committers

Last synced: 12 months ago

All Time

Total Commits: 425
Total Committers: 11
Avg Commits per committer: 38.636
Development Distribution Score (DDS): 0.285

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
ArneTillmann	a**r@g**m	304
Christoph Weisser	5****9	54
Anton Thielmann	5****n	42
AFThielmann	a**n@t**e	13
Benjamin Säfken	b**e@u**e	3
kantg	g**t@f**e	3
Arfon Smith	a****n	2
tkneib	t**b@u**e	1
pyup-bot	g**t@p**o	1
Gillian Kant	5****g	1
AlexanderSilbersdorff	8****f	1

Committer Domains (Top 20 + Academic)

uni-goettingen.de: 2 pyup.io: 1 fida.de: 1 t-online.de: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 5
Total pull requests: 29
Average time to close issues: about 1 month
Average time to close pull requests: 4 days
Total issue authors: 3
Total pull request authors: 4
Average comments per issue: 0.6
Average comments per pull request: 0.07
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ArneTillmann (3)
linuxscout (1)
pyup-bot (1)

Pull Request Authors

ArneTillmann (25)
arfon (2)
ChrisW09 (1)
pyup-bot (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 311 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 34
Total maintainers: 1

pypi.org: audolab

With AuDoLab you can do LDA on highly imbalanced datasets.

Homepage: https://github.com/ArneTillmann/AuDoLab
Documentation: https://audolab.readthedocs.io/
License: GNU General Public License v3
Latest release: 1.0.16
published almost 5 years ago

Versions: 34
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 311 Last month

Rankings

Dependent packages count: 10.1%

Downloads: 17.0%

Forks count: 19.1%

Stargazers count: 21.5%

Average: 27.0%

Dependent repos count: 67.4%

Maintainers (1)

ArneTillmann

Last synced: 11 months ago

Dependencies

requirements.txt pypi

Click >=7.0
Sphinx >=3.5.4
WordCloud >=1.8.1
bs4 >=0.0.1
bump2version >=0.5.11
coverage >=6.0b1
flake8 >=3.7.8
funcy *
gensim >=3.8.3
lime >=0.2.0.1
matplotlib >=3.3.4
nest-asyncio >=1.5.1
nltk >=3.5
numpy >=1.19.2
pandas >=1.2.3
pyldavis >=3.3.1
pyppeteer *
pytest >=4.6.5
pytest-runner >=5.1
requests >=2.25.1
scikit-learn >=0.24.1
tox >=3.14.0
tqdm >=4.62.0
twine *
watchdog >=0.9.0
webbot >=0.34
wheel >=0.33.6
wordcloud *

AuDoLab

Science Score: 95.0%

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.html

AuDoLab

Installation

Stable release

From sources

Usage

Owner

JOSS Publication

AuDoLab: Automatic document labelling and classification for extremely unbalanced data

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: audolab

Rankings

Maintainers (1)

Dependencies