superstyl

Supervised Stylometry

https://github.com/supervisedstylometry/superstyl

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

stylometry svm
Last synced: 6 months ago · JSON representation ·

Repository

Supervised Stylometry

Basic Info
  • Host: GitHub
  • Owner: SupervisedStylometry
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 135 MB
Statistics
  • Stars: 23
  • Watchers: 1
  • Forks: 5
  • Open Issues: 8
  • Releases: 3
Topics
stylometry svm
Created almost 5 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

SUPERvised STYLometry

codecov DOI

Installing

You will need python3.9 or a later version, pip and optionnaly virtualenv

bash git clone https://github.com/SupervisedStylometry/SuperStyl.git cd SuperStyl virtualenv -p python3.9 env #or later source env/bin/activate pip install -r requirements.txt

Basic usage

To use Superstyl, you have two options:

  1. Use the provided command-line interface from your OS terminal (tested on Linux)
  2. Import Superstyl in a Python script or notebook, and use the API commands

You also need a collection of files containing the text that you wish to analyse. The naming conventions of source files in Superstyl are as such:

Class_anythingthatyouwant

For instance: Moliere_Amphitryon.txt

The text before the first underscore will be used as the class for training models.

Command-Line Interface

A very simple usage, for building a corpus of text character 3-grams frequencies, training a SVM model with leave-one-out cross-validation, and predicting the class of unknown texts, would be:

```bash

Creating the corpus and extracting characters 3-grams from text files

python loadcorpus.py -s data/train/*.txt -t chars -n 3 -o train python loadcorpus.py -s data/test/*.txt -t chars -n 3 -o unknown -f train_feats.json

Training a SVM, with cross-validation, and using it to predict the class of unknown sample

python trainsvm.py train.csv --testpath unknown.csv --cross_validate leave-one-out --final ```

The two first commands will write to the disk the files train.csv and unknown.csv containing the metadata and features frequencies for both sets of files, and a file train_feats.json containing a list of used features.

The last one will print the scores of the cross-validation, and then write to disk a file FINAL_PREDICTIONS.csv, containing the class predictions for the unknown texts.

This is just a small sample of all available corpus and training options.

To know more, do: commandline python load_corpus.py --help python train_svm.py --help

Python API

A very simple usage, for building a corpus, training a SVM model with cross-validation, and predicting the class of an unknown text, would be:

```python import superstyl as sty import glob

Creating the corpus and extracting characters 3-grams from text files

train, trainfeats = sty.loadcorpus(glob.glob("data/train/.txt"), feats="chars", n=3) unknown, unknownfeats = sty.loadcorpus(glob.glob("data/test/.txt"), featlist=trainfeats, feats="chars", n=3)

Training a SVM, with cross-validation, and using it

to predict the class of unknown sample

sty.trainsvm(train, unknown, crossvalidate="leave-one-out", final_pred=True) ```

This is just a small sample of all available corpus and training options.

To know more, do: python help(sty.load_corpus) help(sty.train_svm)

Advanced usage

FIXME: look inside the scripts, or do

bash python load_corpus.py --help python train_svm.py --help

for full documentation on the main functionnalities of the CLI, regarding data generation (main.py) and SVM training (train_svm.py).

For more particular data processing usages (splitting and merging datasets), see also:

bash python split.py --help python merge_datasets.csv.py --help

Get feats

With or without preexisting feature list:

```bash python load_corpus.py -s path/to/docs/* -t chars -n 3

with it

python loadcorpus.py -s path/to/docs/* -f featurelist.json -t chars -n 3

There are several other available options

See --help

```

Alternatively, you can build samples out of the data, for a given number of verses or words:

```bash

words from txt

python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sampleunits words --sample_size 1000

verses from TEI encoded docs

python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sampleunits verses --sample_size 200 ```

You have a lot of options for feats extraction, inclusion or not of punctuation and symbols, sampling, source file formats, …, that can be accessed through the help.

Optional: Merge different features

You can merge several sets of features, extracted in csv with the previous commands, by doing:

bash python merge_datasets.csv.py -o merged.csv char3grams.csv words.csv affixes.csv

Optional: Do a fixed split

You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.

If you want to do initial random split, bash python split.py feats_tests.csv

If you want to split according to existing json file, bash python split.py feats_tests.csv -s split.json

There are other available options, see --help, e.g.

bash python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv

Train svm

It's quite simple really,

bash python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]

For instance, using leave-one-out or 10-fold cross-validation

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --norms --crossvalidate leave-one-out python trainsvm.py data/featsteststrain.csv --norms --crossvalidate k-fold --k 10 ```

Or a train/test split

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath test_feats.csv --norms ```

And for a final analysis, applied on unseen data:

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --final ```

With a little more options,

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --classweights --final --getcoefs ```

Rolling Stylometry Visualization

If you've created samples using --sampling to segment your text into consecutive slices (e.g., every 1000 words):

bash python load_corpus.py -s data/text/*.txt -t chars -n 3 -o rolling_train --sampling --units words --sample_size 1000 python load_corpus.py -s data/text_to_predict/*.txt -t chars -n 3 -o rolling_unknown -f rolling_train_feats.json --sampling --units words --sample_size 1000 You can then train and produce final predictions, and directly visualize how the decision function changes across these segments:

bash python train_svm.py rolling_train.csv --test_path rolling_unknown.csv --final --plot_rolling --plot_smoothing 5

This will produce FINALPREDICTIONS.csv and a plot showing how the classifier's authorial signals vary segment by segment through the text. The --plotsmoothing option applies a simple moving average smoothing to make trends clearer. If the smoothing is not defined, default value is 3. Smoothing can be set to None.

Sources

Cite this repository

You can cite it using the CITATION.cff file (and Github cite functionnalities), following:

BIBTEX:

bibtex @software{camps_cafiero_2024, author = {Jean-Baptiste Camps and Florian Cafiero}, title = {{SUPERvised STYLometry (SuperStyl)}}, month = {11}, year = {2024}, version = {v1.0}, doi = {10.5281/zenodo.14069799}, url = {https://doi.org/10.5281/zenodo.14069799} }

MLA:

plaintext Camps, Jean-Baptiste, and Florian Cafiero. *SUPERvised STYLometry (SuperStyl)*. Version 1.0, 11 Nov. 2024, doi:10.5281/zenodo.14069799.

APA:

plaintext Camps, J.-B., & Cafiero, F. (2024). SUPERvised STYLometry (SuperStyl) (Version v1.0) [Computer software]. https://doi.org/10.5281/zenodo.14069799

Owner

  • Name: SupervisedStylometry
  • Login: SupervisedStylometry
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.3.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Camps
    given-names: Jean-Baptiste
    orcid: https://orcid.org/0000-0003-0385-7037
  - family-names: Cafiero
    given-names: Florian
    orcid: https://orcid.org/0000-0002-1951-6942
title: "SUPERvised STYLometry (SuperStyl)"
version: v1.0
doi: 10.5281/zenodo.14069799
date-released: 2024-11-11

GitHub Events

Total
  • Create event: 10
  • Issues event: 14
  • Release event: 2
  • Watch event: 2
  • Delete event: 7
  • Member event: 1
  • Issue comment event: 22
  • Push event: 52
  • Pull request review event: 2
  • Pull request event: 20
Last Year
  • Create event: 10
  • Issues event: 14
  • Release event: 2
  • Watch event: 2
  • Delete event: 7
  • Member event: 1
  • Issue comment event: 22
  • Push event: 52
  • Pull request review event: 2
  • Pull request event: 20

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 5
  • Average time to close issues: almost 2 years
  • Average time to close pull requests: 10 days
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.4
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 10 days
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.4
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Jean-Baptiste-Camps (19)
  • gabays (3)
  • EtienneFerrandi (1)
  • floriancafiero (1)
  • mailanlopez (1)
Pull Request Authors
  • Jean-Baptiste-Camps (18)
  • floriancafiero (10)
  • TheoMoins (1)
Top Labels
Issue Labels
bug (3) feature_request (2)
Pull Request Labels

Dependencies

requirements.txt pypi
  • argparse ==1.4.0
  • click *
  • fasttext ==0.9.1
  • imbalanced-learn ==0.8.1
  • joblib ==1.2.0
  • lxml ==4.9.1
  • matplotlib *
  • nltk ==3.6.6
  • numpy ==1.22.0
  • pandas ==1.3.4
  • pybind11 ==2.8.1
  • regex *
  • scikit-learn ==1.0.1
  • scipy ==1.7.3
  • six ==1.16.0
  • tqdm *
  • unidecode ==1.3.2
.github/workflows/python-package.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • codecov/codecov-action v4 composite