superstyl

Supervised Stylometry

https://github.com/supervisedstylometry/superstyl

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

stylometry svm

Last synced: 6 months ago · JSON representation ·

Repository

Supervised Stylometry

Basic Info

Host: GitHub
Owner: SupervisedStylometry
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 135 MB

Statistics

Stars: 23
Watchers: 1
Forks: 5
Open Issues: 8
Releases: 3

Topics

stylometry svm

Created almost 5 years ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

SUPERvised STYLometry

Installing

You will need python3.9 or a later version, pip and optionnaly virtualenv

bash git clone https://github.com/SupervisedStylometry/SuperStyl.git cd SuperStyl virtualenv -p python3.9 env #or later source env/bin/activate pip install -r requirements.txt

Basic usage

To use Superstyl, you have two options:

Use the provided command-line interface from your OS terminal (tested on Linux)
Import Superstyl in a Python script or notebook, and use the API commands

You also need a collection of files containing the text that you wish to analyse. The naming conventions of source files in Superstyl are as such:

Class_anythingthatyouwant

For instance: Moliere_Amphitryon.txt

The text before the first underscore will be used as the class for training models.

Command-Line Interface

A very simple usage, for building a corpus of text character 3-grams frequencies, training a SVM model with leave-one-out cross-validation, and predicting the class of unknown texts, would be:

```bash

Creating the corpus and extracting characters 3-grams from text files

python loadcorpus.py -s data/train/*.txt -t chars -n 3 -o train python loadcorpus.py -s data/test/*.txt -t chars -n 3 -o unknown -f train_feats.json

Training a SVM, with cross-validation, and using it to predict the class of unknown sample

python trainsvm.py train.csv --testpath unknown.csv --cross_validate leave-one-out --final ```

The two first commands will write to the disk the files train.csv and unknown.csv containing the metadata and features frequencies for both sets of files, and a file train_feats.json containing a list of used features.

The last one will print the scores of the cross-validation, and then write to disk a file FINAL_PREDICTIONS.csv, containing the class predictions for the unknown texts.

This is just a small sample of all available corpus and training options.

To know more, do: commandline python load_corpus.py --help python train_svm.py --help

Python API

A very simple usage, for building a corpus, training a SVM model with cross-validation, and predicting the class of an unknown text, would be:

```python import superstyl as sty import glob

Creating the corpus and extracting characters 3-grams from text files

train, trainfeats = sty.loadcorpus(glob.glob("data/train/.txt"), feats="chars", n=3) unknown, unknownfeats = sty.loadcorpus(glob.glob("data/test/.txt"), featlist=trainfeats, feats="chars", n=3)

Training a SVM, with cross-validation, and using it

to predict the class of unknown sample

sty.trainsvm(train, unknown, crossvalidate="leave-one-out", final_pred=True) ```

This is just a small sample of all available corpus and training options.

To know more, do: python help(sty.load_corpus) help(sty.train_svm)

Advanced usage

FIXME: look inside the scripts, or do

bash python load_corpus.py --help python train_svm.py --help

for full documentation on the main functionnalities of the CLI, regarding data generation (main.py) and SVM training (train_svm.py).

For more particular data processing usages (splitting and merging datasets), see also:

bash python split.py --help python merge_datasets.csv.py --help

Get feats

With or without preexisting feature list:

```bash python load_corpus.py -s path/to/docs/* -t chars -n 3

with it

python loadcorpus.py -s path/to/docs/* -f featurelist.json -t chars -n 3

There are several other available options

See --help

```

Alternatively, you can build samples out of the data, for a given number of verses or words:

```bash

words from txt

python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sampleunits words --sample_size 1000

verses from TEI encoded docs

python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sampleunits verses --sample_size 200 ```

You have a lot of options for feats extraction, inclusion or not of punctuation and symbols, sampling, source file formats, …, that can be accessed through the help.

Optional: Merge different features

You can merge several sets of features, extracted in csv with the previous commands, by doing:

bash python merge_datasets.csv.py -o merged.csv char3grams.csv words.csv affixes.csv

Optional: Do a fixed split

You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.

If you want to do initial random split, bash python split.py feats_tests.csv

If you want to split according to existing json file, bash python split.py feats_tests.csv -s split.json

There are other available options, see --help, e.g.

bash python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv

Train svm

It's quite simple really,

bash python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]

For instance, using leave-one-out or 10-fold cross-validation

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --norms --crossvalidate leave-one-out python trainsvm.py data/featsteststrain.csv --norms --crossvalidate k-fold --k 10 ```

Or a train/test split

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath test_feats.csv --norms ```

And for a final analysis, applied on unseen data:

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --final ```

With a little more options,

```bash

e.g.

python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --classweights --final --getcoefs ```

Rolling Stylometry Visualization

If you've created samples using --sampling to segment your text into consecutive slices (e.g., every 1000 words):

bash python load_corpus.py -s data/text/*.txt -t chars -n 3 -o rolling_train --sampling --units words --sample_size 1000 python load_corpus.py -s data/text_to_predict/*.txt -t chars -n 3 -o rolling_unknown -f rolling_train_feats.json --sampling --units words --sample_size 1000 You can then train and produce final predictions, and directly visualize how the decision function changes across these segments:

bash python train_svm.py rolling_train.csv --test_path rolling_unknown.csv --final --plot_rolling --plot_smoothing 5

This will produce FINALPREDICTIONS.csv and a plot showing how the classifier's authorial signals vary segment by segment through the text. The --plotsmoothing option applies a simple moving average smoothing to make trends clearer. If the smoothing is not defined, default value is 3. Smoothing can be set to None.

Sources

Cite this repository

You can cite it using the CITATION.cff file (and Github cite functionnalities), following:

BIBTEX:

bibtex @software{camps_cafiero_2024, author = {Jean-Baptiste Camps and Florian Cafiero}, title = {{SUPERvised STYLometry (SuperStyl)}}, month = {11}, year = {2024}, version = {v1.0}, doi = {10.5281/zenodo.14069799}, url = {https://doi.org/10.5281/zenodo.14069799} }

MLA:

plaintext Camps, Jean-Baptiste, and Florian Cafiero. *SUPERvised STYLometry (SuperStyl)*. Version 1.0, 11 Nov. 2024, doi:10.5281/zenodo.14069799.

APA:

plaintext Camps, J.-B., & Cafiero, F. (2024). SUPERvised STYLometry (SuperStyl) (Version v1.0) [Computer software]. https://doi.org/10.5281/zenodo.14069799

Owner

Name: SupervisedStylometry
Login: SupervisedStylometry
Kind: organization

Repositories: 1
Profile: https://github.com/SupervisedStylometry

Citation (CITATION.cff)

cff-version: 1.3.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Camps
    given-names: Jean-Baptiste
    orcid: https://orcid.org/0000-0003-0385-7037
  - family-names: Cafiero
    given-names: Florian
    orcid: https://orcid.org/0000-0002-1951-6942
title: "SUPERvised STYLometry (SuperStyl)"
version: v1.0
doi: 10.5281/zenodo.14069799
date-released: 2024-11-11

GitHub Events

Total

Create event: 10
Issues event: 14
Release event: 2
Watch event: 2
Delete event: 7
Member event: 1
Issue comment event: 22
Push event: 52
Pull request review event: 2
Pull request event: 20

Last Year

Create event: 10
Issues event: 14
Release event: 2
Watch event: 2
Delete event: 7
Member event: 1
Issue comment event: 22
Push event: 52
Pull request review event: 2
Pull request event: 20

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 3
Total pull requests: 5
Average time to close issues: almost 2 years
Average time to close pull requests: 10 days
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.67
Average comments per pull request: 0.4
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: 10 days
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.4
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Jean-Baptiste-Camps (19)
gabays (3)
EtienneFerrandi (1)
floriancafiero (1)
mailanlopez (1)

Pull Request Authors

Jean-Baptiste-Camps (18)
floriancafiero (10)
TheoMoins (1)

Top Labels

Issue Labels

bug (3) feature_request (2)

Pull Request Labels

Dependencies

requirements.txt pypi

argparse ==1.4.0
click *
fasttext ==0.9.1
imbalanced-learn ==0.8.1
joblib ==1.2.0
lxml ==4.9.1
matplotlib *
nltk ==3.6.6
numpy ==1.22.0
pandas ==1.3.4
pybind11 ==2.8.1
regex *
scikit-learn ==1.0.1
scipy ==1.7.3
six ==1.16.0
tqdm *
unidecode ==1.3.2

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
codecov/codecov-action v4 composite

superstyl

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SUPERvised STYLometry

Installing

Basic usage

Command-Line Interface

Creating the corpus and extracting characters 3-grams from text files

Training a SVM, with cross-validation, and using it to predict the class of unknown sample

Python API

Creating the corpus and extracting characters 3-grams from text files

Training a SVM, with cross-validation, and using it

to predict the class of unknown sample

Advanced usage

Get feats

with it

There are several other available options

See --help

words from txt

verses from TEI encoded docs

Optional: Merge different features

Optional: Do a fixed split

Train svm

e.g.

e.g.

e.g.

e.g.

Rolling Stylometry Visualization

Sources

Cite this repository

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies