https://github.com/antoinelemor/augmentedsocialscientistfork
A lightweight fork of AugmentedSocialScientist that streamlines BERT/CamemBERT fine-tuning for social-science datasets with per-epoch logging, intelligent best-model selection, optional reinforced training, and seamless CPU/CUDA/MPS support.
https://github.com/antoinelemor/augmentedsocialscientistfork
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary
Repository
A lightweight fork of AugmentedSocialScientist that streamlines BERT/CamemBERT fine-tuning for social-science datasets with per-epoch logging, intelligent best-model selection, optional reinforced training, and seamless CPU/CUDA/MPS support.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
AugmentedSocialScientist enhanced fork
Fine‑tuning BERT & friends for social‑science projects, with robust tracking, smart model selection, and a reinforced‑learning safety‑net.
1. Overview
This repository is a fork with new functionnalities of the original rubingshen/AugmentedSocialScientist.
All base classes (BertBase, CamembertBase, …) function identically while exposing the additional capabilities listed below.
2 Key capabilities
| Capability | Description |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Comprehensive metric logging | Every epoch is appended to training_logs/training_metrics.csv (and reinforced_training_metrics.csv if applicable) with losses, per‑class precision/recall/F1, and macro F1. |
| Per‑epoch checkpoints | A lightweight checkpoint is written after each epoch; only the best checkpoint (see below) is retained to save disk space. |
| Smart best‑model selection | By default the model maximising 0.7 × F1₁ + 0.3 × macro‑F1 is kept. The weight and even the formula can be overridden. |
| Automatic reinforced training | When the positive‑class F1 stays below 0.60, the library launches an adaptive reinforced phase with class‑weighted loss, oversampling, larger batches and a reduced learning rate. |
| Rescue logic for class 1 | If the best normal model achieved F1₁ == 0, reinforced training considers any epoch where F1₁ > f1_1_rescue_threshold (default 0) as an improvement
| Apple Silicon / MPS support | Native GPU acceleration on macOS (M‑series) sits alongside CUDA and CPU fall‑backs—no flags required. |
3 Metric tracking
- Calling
run_trainingautomatically creates the CSV logs mentioned above. - A concise summary of any newly selected checkpoint (normal or reinforced) is appended to
training_logs/best_models.csv.
4 Checkpointing & model selection
- The combined metric is re‑evaluated after every epoch.
- When it improves, the corresponding checkpoint is moved to
models/<name>/and the previous best is deleted. - Upon completion, the folder
models/<save_model_as>always contains the best checkpoint (whether it came from the main loop or the reinforced phase).
5 Reinforced‑training safety‑net
If reinforced_learning=True and F1(class 1) < 0.60 at the end of the main loop, an additional cycle starts:
- Oversampling of the minority class through
WeightedRandomSampler. - Batch size doubled to 64 and learning rate reduced (default 5 e‑6).
- Weighted cross‑entropy with
pos_weight = 2.0emphasises the positive class. - Full logging to
reinforced_training_metrics.csvand standard checkpoint selection. - Optional rescue logic (
rescue_low_class1_f1=True) promotes any epoch whereF1₁breaks the zero‑barrier (threshold configurable).
6 Device auto‑detection
BertBase.__init__() selects the computation device in this order:
- CUDA
- Apple Silicon MPS
- CPU
A one‑line message confirms the choice at runtime.
7 Quick‑start
```python from AugmentedSocialScientistFork import BertBase
1 – encode data
model = BertBase(modelname="bert-base-cased") trainloader = model.encode(traintexts, trainlabels) valloader = model.encode(valtexts, val_labels)
2 – train & keep the best checkpoint
aftertrainingscores = model.runtraining( trainloader, valloader, nepochs=10, savemodelas="mypolicymodel", reinforcedlearning=True, rescuelowclass1f1=True )
3 – reload & predict
bestmodel = model.loadmodel("./models/mypolicymodel") probas = model.predictwithmodel(valloader, "./models/mypolicy_model") ```
Typical console excerpt:
======== Epoch 4 / 10 ========
Training...
Average training loss: 0.35
Running Validation...
New best model found at epoch 4 with combined metric = 0.7123
Resulting layout:
models/
└── my_policy_model/ # final checkpoint
training_logs/
├── training_metrics.csv # main loop
├── best_models.csv # checkpoint summary
└── reinforced_training_metrics.csv # only if reinforced phase executed
8 Configuration reference
| Argument | Default | Purpose |
| ----------------------- | ------------------- | --------------------------------------------------------- |
| n_epochs | 3 | Epochs in the main loop. |
| lr | 5e‑5 | Learning rate in the main loop. |
| f1_class_1_weight | 0.7 | Weight of F1₁ in the combined metric. |
| metrics_output_dir | "./training_logs" | Location of CSV logs. |
| pos_weight | None | Class weights for the loss in the main loop. |
| reinforced_learning | False | Enable the safety‑net phase. |
| n_epochs_reinforced | 2 | Epochs in the reinforced phase. |
| rescue_low_class1_f1 | False | Activate the rescue logic for stalled F1₁. |
| f1_1_rescue_threshold | 0.0 | Minimal F1₁ improvement that triggers rescue promotion. |
Hyper‑parameters inside reinforced_training (batch size, LR, pos_weight) can be overridden by subclassing or editing the method.
9 Installation
bash
git clone https://github.com/antoinelemor/AugmentedSocialScientistFork.git
cd AugmentedSocialScientistFork
pip install -e .
Requirements : Python 3.10+, torch >= 2.0, transformers >= 4.40.
10 License & citation
This fork remains under the original MIT License.
If used academically, please cite the upstream repository: rubingshen/AugmentedSocialScientist, and this repo if you're cool.
Happy fine‑tuning!
Owner
- Name: Antoine Lemor
- Login: antoinelemor
- Kind: user
- Location: Montréal
- Company: Université de Montréal
- Twitter: AntoineLemor
- Repositories: 1
- Profile: https://github.com/antoinelemor
Candidat au doctorat de science politique • science & politiques publiques
GitHub Events
Total
- Public event: 1
- Push event: 14
Last Year
- Public event: 1
- Push event: 14
Dependencies
- pandas >=1.5
- torch >=1.13
- transformers >=4.30