Nostril

Nostril: A nonsense string evaluator written in Python - Published in JOSS (2018)

https://github.com/casics/nostril

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README and JOSS metadata
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

detector gibberish identifier-string identifiers inference mining-software-repositories nonsense nonsense-string-evaluator source-code text-processing

Scientific Fields

Psychology Social Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

Nostril: Nonsense String Evaluator

Basic Info
  • Host: GitHub
  • Owner: casics
  • License: lgpl-2.1
  • Language: Python
  • Default Branch: master
  • Size: 143 MB
Statistics
  • Stars: 197
  • Watchers: 2
  • Forks: 35
  • Open Issues: 17
  • Releases: 4
Archived
Topics
detector gibberish identifier-string identifiers inference mining-software-repositories nonsense nonsense-string-evaluator source-code text-processing
Created about 8 years ago · Last pushed over 3 years ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

Nostril

Nostril is the Nonsense String Evaluator: a Python module that infers whether a given short string of characters is likely to be random gibberish or something meaningful.

License: GPL v3 Python Latest release DOI JOSS DOI

Author: Michael Hucka (ORCID: 0000-0001-9105-5960)
Code repository: https://github.com/casics/nostril
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

Recent news and activities

November 2019: Release version 1.2.0 changes the license for Nostril to LGPL version 2.1. There are no API or behavioral changes; all changes are limited to documentation strings, the README file, and a new DOI.

The file NEWS contains a more complete change log that includes information about previous releases.

Table of contents

Introduction

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens.

Nostril is a Python 3 module that can be used to infer whether a given word or text string is likely to be nonsense or meaningful text. Nostril takes a text string and returns True if it is probably nonsense, False otherwise. Meaningful in this case means a string of characters that is probably constructed from real or real-looking English words or fragments of real words (even if the words are run togetherlikethis). The main use case is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random characters or other non-identifier strings. To illustrate, the following example code,

python from nostril import nonsense real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo', 'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom'] junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty'] for s in real_test + junk_test: print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real')) produces the following output:

bunchofwords: real getint: real xywinlist: real ioFlXFndrInfo: real DMEcalPreshowerDigis: real httpredaksikatakamiwordpresscom: real faiwtlwexu: nonsense asfgtqwafazfyiur: nonsense zxcvbnmlkjhgfdsaqwerty: nonsense

Nostril uses a combination of heuristic rules and a probabilistic assessment. It is not always correct (see below). It is tuned to reduce false positives: it is more likely to say something is not gibberish when it really might be. This is suitable for its intended purpose of filtering source code identifiers – a difficult problem, incidentally, because program identifiers often consist of acronyms and word fragments jammed together (e.g., "kBoPoMoFoOrderIdCID", "ioFlXFndrInfo", etc.), which can challenge even humans. Nevertheless, on the identifier strings from the Loyola University of Delaware Identifier Splitting Oracle, Nostril classifies over 99% correctly.

Nostril is reasonably fast: once the module is loaded, on a 4 Ghz Apple OS X 10.12 computer, calling the evaluation function returns a result in 30–50 microseconds per string on average.

Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Nostril and you publish papers about work that uses Nostril, please cite the Nostril paper:

Hucka, M. (2018). Nostril: A nonsense string evaluator written in Python. Journal of Open Source Software, 3(25), 596, https://doi.org/10.21105/joss.00596

Please also use the DOI to indicate the specific version you use, to improve other people's ability to reproduce your results:

Installation instructions

The following is probably the simplest and most direct way to install Nostril on your computer: sudo pip3 install git+https://github.com/casics/nostril.git

Alternatively, you can clone this repository and then run setup.py: git clone https://github.com/casics/nostril.git cd nostril sudo python3 -m pip install .

Both of these installation approaches should automatically install some Python dependencies that Nostril relies upon, namely plac, tabulate, humanize, and pytest.

Using Nostril

The basic usage is very simple. Nostril provides a Python function named nonsense(). This function takes a single text string as an argument and returns a Boolean value as a result. Here is an example:

python from nostril import nonsense if nonsense('yoursinglestringhere'): print("nonsense") else: print("real")

The Nostril source code distribution also comes with a command-line program called nostril. You can invoke the nostril command-line interface in two ways:

  1. Using the Python interpreter: python3 -m nostril
  2. On Linux and macOS systems, using the program nostril, which should be installed automatically by setup.py in a bin directory on your shell's command search path. Thus, you should be able to run it normally: nostril

The command-line program can take strings on the command line or (with the -f option) in a file, and will return nonsense-or-not assessments for each string. It can be useful for interactive testing and experimentation. For example:

```sh

nostril bunchofwords xywinlist ioFlXFndrInfo lasaakldfalakj

xywinlist [real] ioFlXFndrInfo [real] lasaakldfalakj [nonsense] xyxyxyx [nonsense] ```

Beware that the Nostril module takes a noticeable amount of time to load, and since the command-line program must reload the module anew each time, it is relatively slow as a means of using Nostril. (In normal usage, your program would only load the Python module once and not incur the loading time on every call.)

Nostril ignores numbers, spaces and punctuation characters embedded in the input string. This was a design decision made for practicality – it makes Nostril a bit easier to use. If, for your application, non-letter characters indicates a string that is definitely nonsense, then you may wish to test for that separately before passing the string to Nostril.

Please see the docs subdirectory for more information about Nostril and its operation.

Performance

You can verify the following results yourself by running the small test program tests/test.py. The following are the results on sets of strings that are all either real identifiers or all random/gibberish text:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Accuracy
/usr/share/dict/web2 218,752 0 89 0 99.96%
Ludiso oracle 2,540 0 6 0 99.76%
Auto-generated random strings 0 997,636 0 82,754 91.70%
Hand-written random strings 0 1,000 0 205 79.50%

In tests on real identifiers extracted from actual software source code mined by the author in another project, Nostril's performance is as follows:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Precision Recall
Strings mined from real code 4,261 364 6 5 98.36% 98.63%

Limitations

Nostril is not fool-proof; it will generate some false positive and false negatives. This is an unavoidable consequence of the problem domain: without direct knowledge, even a human cannot recognize a real text string in all cases. Nostril's default trained system puts emphasis on reducing false positives (i.e., reducing how often it mistakenly labels something as nonsense) rather than false negatives, so it will sometimes report that something is not nonsense when it really is.

A vexing result is that this system does more poorly on supposedly "random" strings typed by a human. I hypothesize this is because those strings may be less random than they seem: if someone is asked to type junk at random on a QWERTY keyboard, they are likely to use a lot of characters from the home row (a-s-d-f-g-h-j-k-l), and those actually turn out to be rather common in English words. In other words, what we think of a strings "typed at random" on a keyboard are actually not that random, and probably have statistical properties similar to those of real words. These cases are hard for Nostril, but thankfully, in real-world situations, they are rare. This view is supported by the fact that Nostril's performance is much better on statistically random text strings generated by software.

Nostril has been trained using American English words, and is unlikely to work for other languages unchanged. However, the underlying framework may work if it were retrained on different sample inputs. Nostril uses uses n-grams coupled with a custom TF-IDF weighting scheme. See the subdirectory training for the code used to train the system.

Finally, the algorithm does not perform well on very short text, and by default Nostril imposes a lower length limit of 6 characters – strings must be longer than 6 characters or else it will raise an exception.

More information

Please see the docs subdirectory for more information.

Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

Contributing — info for developers

Any constructive contributions – bug reports, pull requests (code or documentation), suggestions for improvements, and more – are welcome. Please feel free to contact me directly, or even better, jump right in and use the standard GitHub approach of forking the repo and creating a pull request.

Everyone is asked to read and respect the code of conduct when participating in this project.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


             

Owner

  • Name: CASICS
  • Login: casics
  • Kind: organization
  • Email: casics-team@googlegroups.com
  • Location: Pasadena, California

Comprehensive and Automated Software Inventory Creation System

JOSS Publication

Nostril: A nonsense string evaluator written in Python
Published
May 11, 2018
Volume 3, Issue 25, Page 596
Authors
Michael Hucka ORCID
Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA
Editor
Jake Vanderplas ORCID
Tags
mining source repositories identifiers text processing inference

GitHub Events

Total
  • Watch event: 7
Last Year
  • Watch event: 8

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 152
  • Total Committers: 1
  • Avg Commits per committer: 152.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Michael Hucka m****a@c****u 152
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 18
  • Total pull requests: 6
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 10 months
  • Total issue authors: 10
  • Total pull request authors: 5
  • Average comments per issue: 1.17
  • Average comments per pull request: 0.33
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mhucka (7)
  • 1Hyena (3)
  • TheFausap (1)
  • pombredanne (1)
  • dedpehto (1)
  • rsuhaibani (1)
  • fowler-mychale (1)
  • alanhamlett (1)
  • pete7628 (1)
  • ShridharSahu (1)
Pull Request Authors
  • Forest216 (2)
  • alexwilson1 (1)
  • dzoladz (1)
  • MonsieurV (1)
  • FlorentJeannot (1)
Top Labels
Issue Labels
Coding: enhancement ✨ (3) Task: docs 📜 (2) Task: chore 📥 (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 36,648 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
pypi.org: nostril-detector

Nonsense String Evaluator

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 36,648 Last month
Rankings
Dependent packages count: 10.0%
Average: 37.8%
Dependent repos count: 65.7%
Maintainers (1)
Last synced: 4 months ago