Nostril

Nostril: A nonsense string evaluator written in Python - Published in JOSS (2018)

Keywords

detector gibberish identifier-string identifiers inference mining-software-repositories nonsense nonsense-string-evaluator source-code text-processing

Scientific Fields

Psychology Social Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Nostril: Nonsense String Evaluator

Basic Info

Host: GitHub
Owner: casics
License: lgpl-2.1
Language: Python
Default Branch: master
Size: 143 MB

Statistics

Stars: 197
Watchers: 2
Forks: 35
Open Issues: 17
Releases: 4

Archived

Topics

detector gibberish identifier-string identifiers inference mining-software-repositories nonsense nonsense-string-evaluator source-code text-processing

Created about 8 years ago · Last pushed almost 4 years ago

Metadata Files

Readme Changelog Contributing License Citation

Nostril

Nostril is the Nonsense String Evaluator: a Python module that infers whether a given short string of characters is likely to be random gibberish or something meaningful.

Author: Michael Hucka (ORCID: 0000-0001-9105-5960)
Code repository: https://github.com/casics/nostril
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

Recent news and activities

November 2019: Release version 1.2.0 changes the license for Nostril to LGPL version 2.1. There are no API or behavioral changes; all changes are limited to documentation strings, the README file, and a new DOI.

The file NEWS contains a more complete change log that includes information about previous releases.

Introduction

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens.

Nostril is a Python 3 module that can be used to infer whether a given word or text string is likely to be nonsense or meaningful text. Nostril takes a text string and returns True if it is probably nonsense, False otherwise. Meaningful in this case means a string of characters that is probably constructed from real or real-looking English words or fragments of real words (even if the words are run togetherlikethis). The main use case is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random characters or other non-identifier strings. To illustrate, the following example code,

python from nostril import nonsense real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo', 'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom'] junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty'] for s in real_test + junk_test: print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real')) produces the following output:

bunchofwords: real getint: real xywinlist: real ioFlXFndrInfo: real DMEcalPreshowerDigis: real httpredaksikatakamiwordpresscom: real faiwtlwexu: nonsense asfgtqwafazfyiur: nonsense zxcvbnmlkjhgfdsaqwerty: nonsense

Nostril uses a combination of heuristic rules and a probabilistic assessment. It is not always correct (see below). It is tuned to reduce false positives: it is more likely to say something is not gibberish when it really might be. This is suitable for its intended purpose of filtering source code identifiers – a difficult problem, incidentally, because program identifiers often consist of acronyms and word fragments jammed together (e.g., "kBoPoMoFoOrderIdCID", "ioFlXFndrInfo", etc.), which can challenge even humans. Nevertheless, on the identifier strings from the Loyola University of Delaware Identifier Splitting Oracle, Nostril classifies over 99% correctly.

Nostril is reasonably fast: once the module is loaded, on a 4 Ghz Apple OS X 10.12 computer, calling the evaluation function returns a result in 30–50 microseconds per string on average.

Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Nostril and you publish papers about work that uses Nostril, please cite the Nostril paper:

Hucka, M. (2018). Nostril: A nonsense string evaluator written in Python. Journal of Open Source Software, 3(25), 596, https://doi.org/10.21105/joss.00596

Please also use the DOI to indicate the specific version you use, to improve other people's ability to reproduce your results:

Nostril release 1.2.0 ⇒ 10.22002/D1.1313
Nostril release 1.1.0 ⇒ 10.22002/D1.935

Installation instructions

The following is probably the simplest and most direct way to install Nostril on your computer: sudo pip3 install git+https://github.com/casics/nostril.git

Alternatively, you can clone this repository and then run setup.py: git clone https://github.com/casics/nostril.git cd nostril sudo python3 -m pip install .

Both of these installation approaches should automatically install some Python dependencies that Nostril relies upon, namely plac, tabulate, humanize, and pytest.

Using Nostril

The basic usage is very simple. Nostril provides a Python function named nonsense(). This function takes a single text string as an argument and returns a Boolean value as a result. Here is an example:

python from nostril import nonsense if nonsense('yoursinglestringhere'): print("nonsense") else: print("real")

The Nostril source code distribution also comes with a command-line program called nostril. You can invoke the nostril command-line interface in two ways:

Using the Python interpreter: python3 -m nostril
On Linux and macOS systems, using the program nostril, which should be installed automatically by setup.py in a bin directory on your shell's command search path. Thus, you should be able to run it normally: nostril

The command-line program can take strings on the command line or (with the -f option) in a file, and will return nonsense-or-not assessments for each string. It can be useful for interactive testing and experimentation. For example:

```sh

nostril bunchofwords xywinlist ioFlXFndrInfo lasaakldfalakj

xywinlist [real] ioFlXFndrInfo [real] lasaakldfalakj [nonsense] xyxyxyx [nonsense] ```

Beware that the Nostril module takes a noticeable amount of time to load, and since the command-line program must reload the module anew each time, it is relatively slow as a means of using Nostril. (In normal usage, your program would only load the Python module once and not incur the loading time on every call.)

Nostril ignores numbers, spaces and punctuation characters embedded in the input string. This was a design decision made for practicality – it makes Nostril a bit easier to use. If, for your application, non-letter characters indicates a string that is definitely nonsense, then you may wish to test for that separately before passing the string to Nostril.

Please see the docs subdirectory for more information about Nostril and its operation.

Performance

You can verify the following results yourself by running the small test program tests/test.py. The following are the results on sets of strings that are all either real identifiers or all random/gibberish text:

	Type of content		Results
Test case	Meaningful	Gibberish	False pos.	False neg.	Accuracy
/usr/share/dict/web2	218,752	0	89	0	99.96%
Ludiso oracle	2,540	0	6	0	99.76%
Auto-generated random strings	0	997,636	0	82,754	91.70%
Hand-written random strings	0	1,000	0	205	79.50%

In tests on real identifiers extracted from actual software source code mined by the author in another project, Nostril's performance is as follows:

	Type of content		Results
Test case	Meaningful	Gibberish	False pos.	False neg.	Precision	Recall
Strings mined from real code	4,261	364	6	5	98.36%	98.63%

Limitations

Nostril is not fool-proof; it will generate some false positive and false negatives. This is an unavoidable consequence of the problem domain: without direct knowledge, even a human cannot recognize a real text string in all cases. Nostril's default trained system puts emphasis on reducing false positives (i.e., reducing how often it mistakenly labels something as nonsense) rather than false negatives, so it will sometimes report that something is not nonsense when it really is.

A vexing result is that this system does more poorly on supposedly "random" strings typed by a human. I hypothesize this is because those strings may be less random than they seem: if someone is asked to type junk at random on a QWERTY keyboard, they are likely to use a lot of characters from the home row (a-s-d-f-g-h-j-k-l), and those actually turn out to be rather common in English words. In other words, what we think of a strings "typed at random" on a keyboard are actually not that random, and probably have statistical properties similar to those of real words. These cases are hard for Nostril, but thankfully, in real-world situations, they are rare. This view is supported by the fact that Nostril's performance is much better on statistically random text strings generated by software.

Nostril has been trained using American English words, and is unlikely to work for other languages unchanged. However, the underlying framework may work if it were retrained on different sample inputs. Nostril uses uses n-grams coupled with a custom TF-IDF weighting scheme. See the subdirectory training for the code used to train the system.

Finally, the algorithm does not perform well on very short text, and by default Nostril imposes a lower length limit of 6 characters – strings must be longer than 6 characters or else it will raise an exception.

More information

Please see the docs subdirectory for more information.

Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

Contributing — info for developers

Any constructive contributions – bug reports, pull requests (code or documentation), suggestions for improvements, and more – are welcome. Please feel free to contact me directly, or even better, jump right in and use the standard GitHub approach of forking the repo and creating a pull request.

Everyone is asked to read and respect the code of conduct when participating in this project.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Owner

Name: CASICS
Login: casics
Kind: organization
Email: casics-team@googlegroups.com
Location: Pasadena, California

Website: https://casics.github.io/overview
Repositories: 2
Profile: https://github.com/casics

Comprehensive and Automated Software Inventory Creation System

JOSS Publication

Nostril: A nonsense string evaluator written in Python

Published

May 11, 2018

DOI

10.21105/joss.00596

Volume 3, Issue 25, Page 596

Authors

Michael Hucka

Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA

Editor

Jake Vanderplas

GitHub Events

Total

Watch event: 7

Last Year

Watch event: 8

Committers

Last synced: 7 months ago

All Time

Total Commits: 152
Total Committers: 1
Avg Commits per committer: 152.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michael Hucka	m**a@c**u	152

Committer Domains (Top 20 + Academic)

caltech.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 18
Total pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: 10 months
Total issue authors: 10
Total pull request authors: 5
Average comments per issue: 1.17
Average comments per pull request: 0.33
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mhucka (7)
1Hyena (3)
TheFausap (1)
pombredanne (1)
dedpehto (1)
rsuhaibani (1)
fowler-mychale (1)
alanhamlett (1)
pete7628 (1)
ShridharSahu (1)

Pull Request Authors

Forest216 (2)
alexwilson1 (1)
dzoladz (1)
MonsieurV (1)
FlorentJeannot (1)

Top Labels

Issue Labels

Coding: enhancement ✨ (3) Task: docs 📜 (2) Task: chore 📥 (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 36,648 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

pypi.org: nostril-detector

Nonsense String Evaluator

Homepage: https://github.com/casics/nostril
Documentation: https://nostril-detector.readthedocs.io/
License: GNU Lesser General Public License, version 2.1
Latest release: 1.2.2
published about 2 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 36,648 Last month

Rankings

Dependent packages count: 10.0%

Average: 37.8%

Dependent repos count: 65.7%

Maintainers (1)

master314

Last synced: 6 months ago

Nostril

Science Score: 95.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Nostril

Recent news and activities

Table of contents

Introduction

Please cite the Spiral paper and the version you use

Installation instructions

Using Nostril

nostril bunchofwords xywinlist ioFlXFndrInfo lasaakldfalakj

Performance

Limitations

More information

Getting help and support

Contributing — info for developers

Acknowledgments

Owner

JOSS Publication

Nostril: A nonsense string evaluator written in Python

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: nostril-detector

Rankings

Maintainers (1)