faststylometry
Stylometry library for Burrows' Delta method
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Keywords
Repository
Stylometry library for Burrows' Delta method
Basic Info
- Host: GitHub
- Owner: fastdatascience
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://fastdatascience.com/fast-stylometry-python-library/
- Size: 6.2 MB
Statistics
- Stars: 42
- Watchers: 2
- Forks: 10
- Open Issues: 2
- Releases: 16
Topics
Metadata Files
README.md
Fast Stylometry Python library: Natural Language Processing tool
You can run the walkthrough notebook in Google Colab with a single click:
<!-- badges: end -->
☄ Fast Stylometry - Burrows Delta NLP technique ☄
Developed by Fast Data Science. Fast Data Science develops products, offers consulting services, and training courses in natural language processing (NLP). Subscribe to our blog for regular news from the NLP universe.
Source code at https://github.com/fastdatascience/faststylometry
Tutorial at https://fastdatascience.com/fast-stylometry-python-library/
Fast Stylometry is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as forensic stylometry.
💻 Installing the Fast Stylometry Python package
You can install from PyPI.
pip install faststylometry
Troubleshooting the installation
Due to compatibility problems with Numpy, when you install faststylometry==1.0.15, it works with Python 3.12 but you need to downgrade Numpy.
You need Python 3.12 or later.
This is inconvenient because Google Colab runs on 3.11 by default. So anyone running the Colab script needs to work out how to upgrade Python within Colab to get this library to work.
Then you can install with
pip install faststylometry==1.0.15
pip install numpy==1.26.4
The second command is to downgrade Numpy. We tried to get the library to build so that it runs with Numpy 2.x but I cannot see how to do that. This is still an open issue if anyone can see how to make the Pypi package build with upgraded Numpy.
For anyone coming across this issue
Please can you check the pyproject.toml and .github scripts to see how you can make this package build, so that it runs out of the box with Numpy 2.x?
🌟 Using Fast Stylometry NLP library for the first time 🌟
⚠️ We recommend you follow the walk through notebook titled Burrows Delta Walkthrough.ipynb in order to understand how the library works. If you don't have the correct environment set up on your machine, then you can run the walkthrough notebook easily using this link to create a notebook in Google Colab.
💡 Usage examples
Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.
We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.
You can get the training corpus by cloning https://github.com/fastdatascience/faststylometry, the data is in data. Or you can call download_examples() from Python after importing Fast Stylometry:
from faststylometry import download_examples
download_examples()
📖 Create a corpus
The Burrows Delta Walkthrough.ipynb Jupyter notebook is the best place to start, but here are the basic commands to use the library:
To create a corpus and add books, the pattern is as follows:
from faststylometry import Corpus
corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])
Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method util.load_corpus_from_folder(folder, pattern).
``` import os import re
from faststylometry.corpus import Corpus
corpus = Corpus() for root, , files in os.walk(folder): for filename in files: if filename.endswith(".txt") and "" in filename: with open(os.path.join(root, filename), "r", encoding="utf-8") as f: text = f.read() author, book = re.split("-", re.sub(r'.txt', '', filename))
corpus.add_book(author, book, text)
```
💡 Example 1
Download some example data (Project Gutenberg texts) from the Fast Stylometry repository:
from faststylometry import download_examples
download_examples()
Load a corpus and calculate Burrows' Delta
``` from faststylometry.util import loadcorpusfromfolder from faststylometry.en import tokeniseremovepronounsen from faststylometry.burrowsdelta import calculateburrows_delta
traincorpus = loadcorpusfromfolder("data/train")
traincorpus.tokenise(tokeniseremovepronounsen)
testcorpussenseandsensibility = loadcorpusfrom_folder("data/test", pattern="sense")
testcorpussenseandsensibility.tokenise(tokeniseremovepronouns_en)
calculateburrowsdelta(traincorpus, testcorpussenseand_sensibility) ```
returns a Pandas dataframe of Burrows' Delta scores
💡 Example 2
Using the probability calibration functionality, you can calculate the probability of two books being by the same author.
from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)
outputs a Pandas dataframe of probabilities.
✉️ Who to contact
Thomas Wood at Fast Data Science
🤝 Contributing to the project
If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our Github repository. You can also raise an issue.
Developing the library
Automated tests
Test code is in tests/ folder using unittest.
The testing tool tox is used in the automation with GitHub Actions CI/CD.
Use tox locally
Install tox and run it:
pip install tox
tox
In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.
The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:
tox -e py39
Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.
🤖 Continuous integration/deployment to PyPI
This package is based on the template https://pypi.org/project/example-pypi-package/
This package
- uses GitHub Actions for both testing and publishing
- is tested when pushing
masterormainbranch, and is published when create a release - includes test files in the source distribution
- uses setup.cfg for version single-sourcing (setuptools 46.4.0+)
🧍 Re-releasing the package manually
The code to re-release Fast Stylometry on PyPI is as follows:
source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*
😊 Who worked on the Fast Stylometry NLP library?
The tool was developed by:
- Thomas Wood, Natural Language Processing consultant and data scientist at Fast Data Science.
📜 License of Fast Stylometry library
MIT License. Copyright (c) 2023 Fast Data Science
✍️ Citing the Fast Stylometry library
If you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.
Wood, T.A., Fast Stylometry Computer software. Data Science Ltd. DOI: 10.5281/zenodo.11096941, accessed at https://fastdatascience.com/fast-stylometry-python-library, Fast Data Science (2024)
A BibTeX entry for LaTeX users is:
@software{faststylometry,
author = {Wood, T.A.},
title = {Fast Stylometry (Computer software), Version 1.0.15},
year = {2024},
url = {https://fastdatascience.com/fast-stylometry-python-library/},
doi = {10.5281/zenodo.11096941},
}
Owner
- Name: Fast Data Science
- Login: fastdatascience
- Kind: organization
- Website: https://fastdatascience.com
- Twitter: fastdatascienc1
- Repositories: 15
- Profile: https://github.com/fastdatascience
NLP and data science consulting
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Wood" given-names: "Thomas Andrew" orcid: "https://orcid.org/0000-0001-8962-8571" title: "Fast Stylometry (Computer software)" version: 1.0.15 doi: 10.5281/zenodo.11096941 date-released: 2024-05-01 url: "https://fastdatascience.com/fast-stylometry-python-library"
GitHub Events
Total
- Create event: 3
- Release event: 5
- Issues event: 2
- Watch event: 11
- Issue comment event: 3
- Push event: 18
- Fork event: 1
Last Year
- Create event: 3
- Release event: 5
- Issues event: 2
- Watch event: 11
- Issue comment event: 3
- Push event: 18
- Fork event: 1
Packages
- Total packages: 1
-
Total downloads:
- pypi 360 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 18
- Total maintainers: 1
pypi.org: faststylometry
Python library for calculating the Burrows Delta.
- Documentation: https://fastdatascience.com/fast-stylometry-python-library
- License: MIT License Copyright (c) 2023 Fast Data Science (maintainer: Thomas Wood, tutorial: https://fastdatascience.com/fast-stylometry-python-library/) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 1.0.15
published 7 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- nltk ==3.7
- numpy ==1.24.3
- pandas ==2.0.0
- scikit-learn ==1.3.0