Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Repository
Enriching Social Science Research via Survey Item Linking
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Enriching Social Science Research via Survey Item Linking (SIL)
This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).
Requirements
To install requirements, use either poetry or pip:
setup
poetry install
poetry install --only data_s44k # if you want to reproduce the S44k dataset
poetry install --only data_gsim # if you want to reproduce the GSIM dataset
setup
pip install -r requirements/requirements.txt
pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset
pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset
[!IMPORTANT] After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the
/data/sild/directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.
Experiments
To run the experiments in the paper, run the following commands:
train
bash ./experiments/md/pretrain.slurm # continue pretraining PLMs on S44k
bash ./experiments/md/train_linear.sh # train linear classifiers on SILD
bash ./experiments/md/train_linear_da.sh # train linear classifiers using data augmentation
bash ./experiments/md/train_plms.sh # fine-tune PLMs on SILD
bash ./experiments/md/train_plms_da.sh # fine-tune PLMS using data augmentation
bash ./experiments/md/train_knn.sh # train kNN on SILD
bash ./experiments/md/eval_rac.sh # combine the best PLM w/ the best kNN
bash ./experiments/md/eval_icl.sh # evaluate In-Context Learning
bash ./experiments/ed/eval_bm25.sh # evaluate BM25
bash ./experiments/ed/eval_plms.sh # evaluate PLMs (including sentence transformers)
bash ./experiments/ed/train_sosse.sh # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen
bash ./experiments/ed/eval_sosse.sh # evaluate SoSSE models
Models
[!IMPORTANT] The models will be uploaded to HuggingFace Hub soon!
You can download multilingual pretrained models for the social science domain:
- SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.
You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:
- XLM-R-base-SILD is fine-tuned on SILD using ... .
- XLM-R-large-SILD is fine-tuned on SILD using ... .
- SSOAR-XLM-R-base-SILD is pre-trained on S44k and then fine-tuned on SILD using ... .
You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:
- SoSSE-mE5-base is fine-tuned on LLM-Gen using ... .
Results
Our model achieves the following performance on:
Mention Detection (MD) on SILD
| Model name | F1-binary (English) | F1-binary (German) | F1-binary (Total) | ------------------ |---------------- | -------------- | -------------- | | XLM-R-base-SILD | 58.5% | 53.9% | 57.1% | | SSOAR-XLM-R-base-SILD | 60.7% | 61.8% | 61.0% | | XLM-R-large-SILD | 61.4% | 65.1% | 62.6% |
Entity Disambiguation (ED) on SILD
| Model name | MAP@10 (English) | MAP@10 (German) | | ----------------------- | ----------------- | --------------- | | mE5-base (baseline) | 57.9% | 65.6% | | SoSSE-mE5-base | 63.2% | 68.1% |
Licensing Information
Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.
Owner
- Name: ₸ornike
- Login: e-tornike
- Kind: user
- Location: Germany
- Website: e-tornike.github.io/
- Twitter: e_tornike
- Repositories: 4
- Profile: https://github.com/e-tornike
PhD Candidate | Computational Linguistics | NLP | Machine Learning @e_tornike@sigmoid.social
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Tsereteli" given-names: "Tornike" orcid: "https://orcid.org/0000-0003-4298-3570" - family-names: "Ruffinelli" given-names: "Daniel" orcid: "https://orcid.org/0000-0002-4831-2930" - family-names: "Ponzetto" given-names: "Simone Paolo" orcid: "https://orcid.org/0000-0001-7484-2049" title: "Enriching Social Science Research via Survey Item Linking" version: 1.0.0 doi: date-released: 2024-06-14 url: "https://github.com/e-tornike/SIL"
GitHub Events
Total
- Push event: 8
- Fork event: 1
Last Year
- Push event: 8
- Fork event: 1
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- e-tornike (1)