AuDoLab
AuDoLab: Automatic document labelling and classification for extremely unbalanced data - Published in JOSS (2021)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
○Academic publication links
-
✓Committers with academic emails
2 of 11 committers (18.2%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords from Contributors
Scientific Fields
Repository
Basic Info
- Host: GitHub
- Owner: ArneTillmann
- License: other
- Language: Jupyter Notebook
- Default Branch: main
- Size: 9.89 MB
Statistics
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 3
Metadata Files
README.html
Installation
Stable release
To install AuDoLab, run this command in your terminal:
$ pip install AuDoLabThis is the preferred method to install AuDoLab, as it will always install the most recent stable release.
If you dont have pip installed, this Python installation guide can guide you through the process.
From sources
The sources for AuDoLab can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/ArneTillmann/AuDoLabOr download the tarball:
$ curl -OJL https://github.com/ArneTillmann/AuDoLab/tarball/masterOnce you have a copy of the source, you can install it with:
$ python setup.py installUsage
To use AuDoLab in a project:
from AuDoLab import AuDoLab import asyncioThen you want to create an instance of the AuDoLab class
audo = AuDoLab.AuDoLab()In this example we used publicly available data from the nltk package:
from nltk.corpus import reuters import numpy as np import pandas as pd data = [] for fileid in reuters.fileids(): tag, filename = fileid.split("/") data.append( (filename, ", ".join( reuters.categories(fileid)), reuters.raw(fileid))) data = pd.DataFrame(data, columns=["filename", "categories", "text"])Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper:
async def scrape(): return await audo.scrape_abstracts( url=None, keywords=["cotton"], in_data="all_meta", pages=5 ) scraped_documents = asyncio.get_event_loop().run_until_complete(scrape())The data as well as the scraped papers need to be preprocessed before use in the classifier:
preprocessed_target = audo.preprocessing(data=data, column="text") preprocessed_paper = audo.preprocessing( data=scraped_documents, column="text") target_tfidf, training_tfidf = audo.tf_idf( data=preprocessed_target, papers=preprocessed_paper, data_column="lemma", papers_column="lemma", features=100000, )Afterwards we can train and use the classifiers and choose the desired one:
classifier = audo.one_class_svm( training=training_tfidf, predicting=target_tfidf, nus=np.round(np.arange(0.01, 0.5, 0.01), 7), quality_train=0.9, min_pred=0.001, max_pred=0.05, ) df_data = audo.choose_classifier(preprocessed_target, classifier, 2)And finally you can estimate the topics of the data:
audo.lda_modeling(df_data, num_topics=2) a = audo.lda_visualize_topics() html = a.data with open('html_file.html', 'w') as f: f.write(html)
- Free software: GNU General Public License v3
- Documentation: https://AuDoLab.readthedocs.io.
Owner
- Login: ArneTillmann
- Kind: user
- Repositories: 3
- Profile: https://github.com/ArneTillmann
JOSS Publication
AuDoLab: Automatic document labelling and classification for extremely unbalanced data
Authors
Georg-August-Universität Göttingen, Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Georg-August-Universität Göttingen, Göttingen, Germany, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Tags
One-class SVM Unsupervised Document Classification One-class Document Classification LDA Topic Modelling Out-of-domain Training DataGitHub Events
Total
Last Year
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| ArneTillmann | a****r@g****m | 304 |
| Christoph Weisser | 5****9 | 54 |
| Anton Thielmann | 5****n | 42 |
| AFThielmann | a****n@t****e | 13 |
| Benjamin Säfken | b****e@u****e | 3 |
| kantg | g****t@f****e | 3 |
| Arfon Smith | a****n | 2 |
| tkneib | t****b@u****e | 1 |
| pyup-bot | g****t@p****o | 1 |
| Gillian Kant | 5****g | 1 |
| AlexanderSilbersdorff | 8****f | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 5
- Total pull requests: 29
- Average time to close issues: about 1 month
- Average time to close pull requests: 4 days
- Total issue authors: 3
- Total pull request authors: 4
- Average comments per issue: 0.6
- Average comments per pull request: 0.07
- Merged pull requests: 26
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ArneTillmann (3)
- linuxscout (1)
- pyup-bot (1)
Pull Request Authors
- ArneTillmann (25)
- arfon (2)
- ChrisW09 (1)
- pyup-bot (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 311 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 34
- Total maintainers: 1
pypi.org: audolab
With AuDoLab you can do LDA on highly imbalanced datasets.
- Homepage: https://github.com/ArneTillmann/AuDoLab
- Documentation: https://audolab.readthedocs.io/
- License: GNU General Public License v3
-
Latest release: 1.0.16
published about 4 years ago
Rankings
Maintainers (1)
Dependencies
- Click >=7.0
- Sphinx >=3.5.4
- WordCloud >=1.8.1
- bs4 >=0.0.1
- bump2version >=0.5.11
- coverage >=6.0b1
- flake8 >=3.7.8
- funcy *
- gensim >=3.8.3
- lime >=0.2.0.1
- matplotlib >=3.3.4
- nest-asyncio >=1.5.1
- nltk >=3.5
- numpy >=1.19.2
- pandas >=1.2.3
- pyldavis >=3.3.1
- pyppeteer *
- pytest >=4.6.5
- pytest-runner >=5.1
- requests >=2.25.1
- scikit-learn >=0.24.1
- tox >=3.14.0
- tqdm >=4.62.0
- twine *
- watchdog >=0.9.0
- webbot >=0.34
- wheel >=0.33.6
- wordcloud *
