Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
Supervised Stylometry
Basic Info
Statistics
- Stars: 23
- Watchers: 1
- Forks: 5
- Open Issues: 8
- Releases: 3
Topics
Metadata Files
README.md
SUPERvised STYLometry
Installing
You will need python3.9 or a later version, pip and optionnaly virtualenv
bash
git clone https://github.com/SupervisedStylometry/SuperStyl.git
cd SuperStyl
virtualenv -p python3.9 env #or later
source env/bin/activate
pip install -r requirements.txt
Basic usage
To use Superstyl, you have two options:
- Use the provided command-line interface from your OS terminal (tested on Linux)
- Import Superstyl in a Python script or notebook, and use the API commands
You also need a collection of files containing the text that you wish to analyse. The naming conventions of source files in Superstyl are as such:
Class_anythingthatyouwant
For instance:
Moliere_Amphitryon.txt
The text before the first underscore will be used as the class for training models.
Command-Line Interface
A very simple usage, for building a corpus of text character 3-grams frequencies, training a SVM model with leave-one-out cross-validation, and predicting the class of unknown texts, would be:
```bash
Creating the corpus and extracting characters 3-grams from text files
python loadcorpus.py -s data/train/*.txt -t chars -n 3 -o train python loadcorpus.py -s data/test/*.txt -t chars -n 3 -o unknown -f train_feats.json
Training a SVM, with cross-validation, and using it to predict the class of unknown sample
python trainsvm.py train.csv --testpath unknown.csv --cross_validate leave-one-out --final ```
The two first commands will write to the disk the files train.csv and unknown.csv
containing the metadata and features frequencies for both sets of files,
and a file train_feats.json containing a list of used features.
The last one will print the scores of the cross-validation, and then write
to disk a file FINAL_PREDICTIONS.csv, containing the class predictions
for the unknown texts.
This is just a small sample of all available corpus and training options.
To know more, do:
commandline
python load_corpus.py --help
python train_svm.py --help
Python API
A very simple usage, for building a corpus, training a SVM model with cross-validation, and predicting the class of an unknown text, would be:
```python import superstyl as sty import glob
Creating the corpus and extracting characters 3-grams from text files
train, trainfeats = sty.loadcorpus(glob.glob("data/train/.txt"), feats="chars", n=3) unknown, unknownfeats = sty.loadcorpus(glob.glob("data/test/.txt"), featlist=trainfeats, feats="chars", n=3)
Training a SVM, with cross-validation, and using it
to predict the class of unknown sample
sty.trainsvm(train, unknown, crossvalidate="leave-one-out", final_pred=True) ```
This is just a small sample of all available corpus and training options.
To know more, do:
python
help(sty.load_corpus)
help(sty.train_svm)
Advanced usage
FIXME: look inside the scripts, or do
bash
python load_corpus.py --help
python train_svm.py --help
for full documentation on the main functionnalities of the CLI, regarding data generation (main.py) and SVM training (train_svm.py).
For more particular data processing usages (splitting and merging datasets), see also:
bash
python split.py --help
python merge_datasets.csv.py --help
Get feats
With or without preexisting feature list:
```bash python load_corpus.py -s path/to/docs/* -t chars -n 3
with it
python loadcorpus.py -s path/to/docs/* -f featurelist.json -t chars -n 3
There are several other available options
See --help
```
Alternatively, you can build samples out of the data, for a given number of verses or words:
```bash
words from txt
python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sampleunits words --sample_size 1000
verses from TEI encoded docs
python loadcorpus.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sampleunits verses --sample_size 200 ```
You have a lot of options for feats extraction, inclusion or not of punctuation and symbols, sampling, source file formats, …, that can be accessed through the help.
Optional: Merge different features
You can merge several sets of features, extracted in csv with the previous commands, by doing:
bash
python merge_datasets.csv.py -o merged.csv char3grams.csv words.csv affixes.csv
Optional: Do a fixed split
You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.
If you want to do initial random split,
bash
python split.py feats_tests.csv
If you want to split according to existing json file,
bash
python split.py feats_tests.csv -s split.json
There are other available options, see --help, e.g.
bash
python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv
Train svm
It's quite simple really,
bash
python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]
For instance, using leave-one-out or 10-fold cross-validation
```bash
e.g.
python trainsvm.py data/featsteststrain.csv --norms --crossvalidate leave-one-out python trainsvm.py data/featsteststrain.csv --norms --crossvalidate k-fold --k 10 ```
Or a train/test split
```bash
e.g.
python trainsvm.py data/featsteststrain.csv --testpath test_feats.csv --norms ```
And for a final analysis, applied on unseen data:
```bash
e.g.
python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --final ```
With a little more options,
```bash
e.g.
python trainsvm.py data/featsteststrain.csv --testpath unseen.csv --norms --classweights --final --getcoefs ```
Rolling Stylometry Visualization
If you've created samples using --sampling to segment your text into consecutive slices (e.g., every 1000 words):
bash
python load_corpus.py -s data/text/*.txt -t chars -n 3 -o rolling_train --sampling --units words --sample_size 1000
python load_corpus.py -s data/text_to_predict/*.txt -t chars -n 3 -o rolling_unknown -f rolling_train_feats.json --sampling --units words --sample_size 1000
You can then train and produce final predictions, and directly visualize how the decision function changes across these segments:
bash
python train_svm.py rolling_train.csv --test_path rolling_unknown.csv --final --plot_rolling --plot_smoothing 5
This will produce FINALPREDICTIONS.csv and a plot showing how the classifier's authorial signals vary segment by segment through the text. The --plotsmoothing option applies a simple moving average smoothing to make trends clearer. If the smoothing is not defined, default value is 3. Smoothing can be set to None.
Sources
Cite this repository
You can cite it using the CITATION.cff file (and Github cite functionnalities), following:
BIBTEX:
bibtex
@software{camps_cafiero_2024,
author = {Jean-Baptiste Camps and Florian Cafiero},
title = {{SUPERvised STYLometry (SuperStyl)}},
month = {11},
year = {2024},
version = {v1.0},
doi = {10.5281/zenodo.14069799},
url = {https://doi.org/10.5281/zenodo.14069799}
}
MLA:
plaintext
Camps, Jean-Baptiste, and Florian Cafiero. *SUPERvised STYLometry (SuperStyl)*. Version 1.0, 11 Nov. 2024, doi:10.5281/zenodo.14069799.
APA:
plaintext
Camps, J.-B., & Cafiero, F. (2024). SUPERvised STYLometry (SuperStyl) (Version v1.0) [Computer software]. https://doi.org/10.5281/zenodo.14069799
Owner
- Name: SupervisedStylometry
- Login: SupervisedStylometry
- Kind: organization
- Repositories: 1
- Profile: https://github.com/SupervisedStylometry
Citation (CITATION.cff)
cff-version: 1.3.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Camps
given-names: Jean-Baptiste
orcid: https://orcid.org/0000-0003-0385-7037
- family-names: Cafiero
given-names: Florian
orcid: https://orcid.org/0000-0002-1951-6942
title: "SUPERvised STYLometry (SuperStyl)"
version: v1.0
doi: 10.5281/zenodo.14069799
date-released: 2024-11-11
GitHub Events
Total
- Create event: 10
- Issues event: 14
- Release event: 2
- Watch event: 2
- Delete event: 7
- Member event: 1
- Issue comment event: 22
- Push event: 52
- Pull request review event: 2
- Pull request event: 20
Last Year
- Create event: 10
- Issues event: 14
- Release event: 2
- Watch event: 2
- Delete event: 7
- Member event: 1
- Issue comment event: 22
- Push event: 52
- Pull request review event: 2
- Pull request event: 20
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 5
- Average time to close issues: almost 2 years
- Average time to close pull requests: 10 days
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.67
- Average comments per pull request: 0.4
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 10 days
- Issue authors: 1
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.4
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Jean-Baptiste-Camps (19)
- gabays (3)
- EtienneFerrandi (1)
- floriancafiero (1)
- mailanlopez (1)
Pull Request Authors
- Jean-Baptiste-Camps (18)
- floriancafiero (10)
- TheoMoins (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- argparse ==1.4.0
- click *
- fasttext ==0.9.1
- imbalanced-learn ==0.8.1
- joblib ==1.2.0
- lxml ==4.9.1
- matplotlib *
- nltk ==3.6.6
- numpy ==1.22.0
- pandas ==1.3.4
- pybind11 ==2.8.1
- regex *
- scikit-learn ==1.0.1
- scipy ==1.7.3
- six ==1.16.0
- tqdm *
- unidecode ==1.3.2
- actions/checkout v3 composite
- actions/setup-python v3 composite
- codecov/codecov-action v4 composite