faststylometry

Stylometry library for Burrows' Delta method

https://github.com/fastdatascience/faststylometry

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary

Keywords

natural-language-processing nlp stylometry

Last synced: 6 months ago · JSON representation ·

Repository

Stylometry library for Burrows' Delta method

Basic Info

Host: GitHub
Owner: fastdatascience
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://fastdatascience.com/fast-stylometry-python-library/
Size: 6.2 MB

Statistics

Stars: 42
Watchers: 2
Forks: 10
Open Issues: 2
Releases: 16

Topics

natural-language-processing nlp stylometry

Created about 5 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

Fast Stylometry Python library: Natural Language Processing tool

You can run the walkthrough notebook in Google Colab with a single click:

☄ Fast Stylometry - Burrows Delta NLP technique ☄

Developed by Fast Data Science. Fast Data Science develops products, offers consulting services, and training courses in natural language processing (NLP). Subscribe to our blog for regular news from the NLP universe.

Source code at https://github.com/fastdatascience/faststylometry

Tutorial at https://fastdatascience.com/fast-stylometry-python-library/

Fast Stylometry is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as forensic stylometry.

A useful explanation of the maths and thinking behind Burrows' Delta and how it works

💻 Installing the Fast Stylometry Python package

You can install from PyPI.

pip install faststylometry

Troubleshooting the installation

Due to compatibility problems with Numpy, when you install faststylometry==1.0.15, it works with Python 3.12 but you need to downgrade Numpy.

You need Python 3.12 or later.

This is inconvenient because Google Colab runs on 3.11 by default. So anyone running the Colab script needs to work out how to upgrade Python within Colab to get this library to work.

Then you can install with

pip install faststylometry==1.0.15 pip install numpy==1.26.4

The second command is to downgrade Numpy. We tried to get the library to build so that it runs with Numpy 2.x but I cannot see how to do that. This is still an open issue if anyone can see how to make the Pypi package build with upgraded Numpy.

For anyone coming across this issue

Please can you check the pyproject.toml and .github scripts to see how you can make this package build, so that it runs out of the box with Numpy 2.x?

🌟 Using Fast Stylometry NLP library for the first time 🌟

⚠️ We recommend you follow the walk through notebook titled Burrows Delta Walkthrough.ipynb in order to understand how the library works. If you don't have the correct environment set up on your machine, then you can run the walkthrough notebook easily using this link to create a notebook in Google Colab.

💡 Usage examples

Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.

We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.

You can get the training corpus by cloning https://github.com/fastdatascience/faststylometry, the data is in data. Or you can call download_examples() from Python after importing Fast Stylometry:

from faststylometry import download_examples download_examples()

📖 Create a corpus

The Burrows Delta Walkthrough.ipynb Jupyter notebook is the best place to start, but here are the basic commands to use the library:

To create a corpus and add books, the pattern is as follows:

from faststylometry import Corpus corpus = Corpus() corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])

Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method util.load_corpus_from_folder(folder, pattern).

``` import os import re

from faststylometry.corpus import Corpus

corpus = Corpus() for root, , files in os.walk(folder): for filename in files: if filename.endswith(".txt") and "" in filename: with open(os.path.join(root, filename), "r", encoding="utf-8") as f: text = f.read() author, book = re.split("-", re.sub(r'.txt', '', filename))

        corpus.add_book(author, book, text)

```

💡 Example 1

Download some example data (Project Gutenberg texts) from the Fast Stylometry repository:

from faststylometry import download_examples download_examples()

Load a corpus and calculate Burrows' Delta

``` from faststylometry.util import loadcorpusfromfolder from faststylometry.en import tokeniseremovepronounsen from faststylometry.burrowsdelta import calculateburrows_delta

traincorpus = loadcorpusfromfolder("data/train")

traincorpus.tokenise(tokeniseremovepronounsen)

testcorpussenseandsensibility = loadcorpusfrom_folder("data/test", pattern="sense")

testcorpussenseandsensibility.tokenise(tokeniseremovepronouns_en)

calculateburrowsdelta(traincorpus, testcorpussenseand_sensibility) ```

returns a Pandas dataframe of Burrows' Delta scores

💡 Example 2

Using the probability calibration functionality, you can calculate the probability of two books being by the same author.

from faststylometry.probability import predict_proba, calibrate calibrate(train_corpus) predict_proba(train_corpus, test_corpus_sense_and_sensibility)

outputs a Pandas dataframe of probabilities.

✉️ Who to contact

Thomas Wood at Fast Data Science

🤝 Contributing to the project

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our Github repository. You can also raise an issue.

Developing the library

Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD.

Use tox locally

Install tox and run it:

pip install tox tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

🤖 Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

uses GitHub Actions for both testing and publishing
is tested when pushing master or main branch, and is published when create a release
includes test files in the source distribution
uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

🧍 Re-releasing the package manually

The code to re-release Fast Stylometry on PyPI is as follows:

source activate py311 pip install twine rm -rf dist python setup.py sdist twine upload dist/*

😊 Who worked on the Fast Stylometry NLP library?

The tool was developed by:

Thomas Wood, Natural Language Processing consultant and data scientist at Fast Data Science.

📜 License of Fast Stylometry library

✍️ Citing the Fast Stylometry library

If you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.

Wood, T.A., Fast Stylometry Computer software. Data Science Ltd. DOI: 10.5281/zenodo.11096941, accessed at https://fastdatascience.com/fast-stylometry-python-library, Fast Data Science (2024)

A BibTeX entry for LaTeX users is:

@software{faststylometry, author = {Wood, T.A.}, title = {Fast Stylometry (Computer software), Version 1.0.15}, year = {2024}, url = {https://fastdatascience.com/fast-stylometry-python-library/}, doi = {10.5281/zenodo.11096941}, }

Owner

Name: Fast Data Science
Login: fastdatascience
Kind: organization

Website: https://fastdatascience.com
Twitter: fastdatascienc1
Repositories: 15
Profile: https://github.com/fastdatascience

NLP and data science consulting

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Wood"
  given-names: "Thomas Andrew"
  orcid: "https://orcid.org/0000-0001-8962-8571"
title: "Fast Stylometry (Computer software)"
version: 1.0.15
doi: 10.5281/zenodo.11096941
date-released: 2024-05-01
url: "https://fastdatascience.com/fast-stylometry-python-library"

GitHub Events

Total

Create event: 3
Release event: 5
Issues event: 2
Watch event: 11
Issue comment event: 3
Push event: 18
Fork event: 1

Last Year

Create event: 3
Release event: 5
Issues event: 2
Watch event: 11
Issue comment event: 3
Push event: 18
Fork event: 1

Packages

Total packages: 1
Total downloads:
- pypi 360 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 18
Total maintainers: 1

pypi.org: faststylometry

Python library for calculating the Burrows Delta.

Documentation: https://fastdatascience.com/fast-stylometry-python-library
License: MIT License Copyright (c) 2023 Fast Data Science (maintainer: Thomas Wood, tutorial: https://fastdatascience.com/fast-stylometry-python-library/) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 1.0.15
published 7 months ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 360 Last month

Rankings

Dependent packages count: 10.0%

Forks count: 14.2%

Stargazers count: 14.8%

Average: 16.4%

Downloads: 21.1%

Dependent repos count: 21.7%

Maintainers (1)

fastdatascience

Last synced: 6 months ago

Dependencies

.github/workflows/release.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/test.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

pyproject.toml pypi

setup.py pypi

nltk ==3.7
numpy ==1.24.3
pandas ==2.0.0
scikit-learn ==1.3.0