niacin

niacin: A Python package for text data enrichment - Published in JOSS (2020)

https://github.com/deniederhut/niacin

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Mathematics Computer Science - 37% confidence
Last synced: 6 months ago · JSON representation

Repository

Enrich your data

Basic Info
  • Host: GitHub
  • Owner: deniederhut
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: master
  • Homepage: https://niacin.readthedocs.io
  • Size: 2.36 MB
Statistics
  • Stars: 17
  • Watchers: 2
  • Forks: 2
  • Open Issues: 2
  • Releases: 2
Created over 7 years ago · Last pushed almost 4 years ago
Metadata Files
Readme Contributing License

README.md

niacin

A Python library for replacing the missing variation in your text data.

PyPI version travis codecov readthedocs DOI

Why should I use this?

Data collected for model training necessarily undersamples the likely variance in the input space. This library is a collection of tools for inserting typical kinds of perturbations to better approximate population variance; and, for creating similar-but-incorrect examples to aid in reducing the total size of the hypothesis space. These are commonly known as ENRICHMENT and NEGATIVE SAMPLING, respectively.

How do I use this?

Functions in niacin are separated into submodules for specific data types. Functions expose a similar API, with two input arguments: the data to be transformed, and the probability of applying a specific transformation.

enrichment:

python from niacin.text import en data = "This is the song that never ends and it goes on and on my friends" print(en.add_misspelling(data, p=1.0))

output This is teh song tath never ends adn it goes on anbd on my firends

negative sampling:

python from niacin.text import en data = "This is the song that never ends and it goes on and on my friends" print(en.add_hypernyms(data, p=1.0))

output This is the musical composition that never extremity and it exit on and on my person

How do I install this?

with pip:

sh pip install niacin

from source:

sh git clone git@github.com:deniederhut/niacin.git && cd niacin && python setup.py install

If you have installed niacin from source, you can run the test suite to verify that everything is working properly. We use pytest, which you will first need to install:

sh pip install pytest

then you can run the library's tests with

sh pytest -m 'not slow'

if you would like to see the coverage report, you can do so with pytest-cov like so:

sh pip install pytest-cov pytest -m 'not slow' --cov=niacin && coverage html

How can I install the optional dependencies?

If you want to use the backtranslate functionality, niacin will need pytorch and some other libraries. These can be installed as extras with:

sh pip install niacin[backtranslate]

If you are on macos, this might fail with a warning about your version of gcc:

Your compiler (g++) is not compatible with the compiler Pytorch was built with for this platform, which is clang++ on darwin.

You can avoid this error by executing the following:

sh CFLAGS='-stdlib=libc++' pip install niacin[backtranslate]

Owner

  • Name: Dillon Niederhut
  • Login: deniederhut
  • Kind: user
  • Company: @novilabs

data science @novilabs | editor @scipy-conference | ml research @adversarial-designs | bad poetry everywhere | he/him

JOSS Publication

niacin: A Python package for text data enrichment
Published
June 11, 2020
Volume 5, Issue 50, Page 2136
Authors
Dillon Niederhut ORCID
Novi Labs
Editor
Olivia Guest ORCID
Tags
data augmentation natural language processing machine learning

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 70
  • Total Committers: 1
  • Avg Commits per committer: 70.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Dillon Niederhut d****t@g****m 70

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 34
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 3 days
  • Total issue authors: 4
  • Total pull request authors: 1
  • Average comments per issue: 0.88
  • Average comments per pull request: 0.18
  • Merged pull requests: 34
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sara-02 (5)
  • gunturbudi (1)
  • IllyShaieb (1)
  • fabiencro (1)
Pull Request Authors
  • deniederhut (34)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

backtranslate-requirements.txt pypi
  • fairseq *
  • fastbpe *
  • sacremoses *
  • torch *
requirements.txt pypi
  • nltk >=3.0
  • regex *
  • scipy >=1.0.0
torch-requirements.txt pypi
  • pandas *
  • torch *
  • torchtext *