sil

Enriching Social Science Research via Survey Item Linking

https://github.com/e-tornike/sil

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Enriching Social Science Research via Survey Item Linking

Basic Info

Host: GitHub
Owner: e-tornike
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 706 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

Enriching Social Science Research via Survey Item Linking (SIL)

This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).

A figure showing the pipeline for Survey Item Linking

Requirements

To install requirements, use either poetry or pip:

setup poetry install poetry install --only data_s44k # if you want to reproduce the S44k dataset poetry install --only data_gsim # if you want to reproduce the GSIM dataset

setup pip install -r requirements/requirements.txt pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset

[!IMPORTANT] After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/ directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.

Experiments

To run the experiments in the paper, run the following commands:

train bash ./experiments/md/pretrain.slurm # continue pretraining PLMs on S44k bash ./experiments/md/train_linear.sh # train linear classifiers on SILD bash ./experiments/md/train_linear_da.sh # train linear classifiers using data augmentation bash ./experiments/md/train_plms.sh # fine-tune PLMs on SILD bash ./experiments/md/train_plms_da.sh # fine-tune PLMS using data augmentation bash ./experiments/md/train_knn.sh # train kNN on SILD bash ./experiments/md/eval_rac.sh # combine the best PLM w/ the best kNN bash ./experiments/md/eval_icl.sh # evaluate In-Context Learning bash ./experiments/ed/eval_bm25.sh # evaluate BM25 bash ./experiments/ed/eval_plms.sh # evaluate PLMs (including sentence transformers) bash ./experiments/ed/train_sosse.sh # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen bash ./experiments/ed/eval_sosse.sh # evaluate SoSSE models

Models

[!IMPORTANT] The models will be uploaded to HuggingFace Hub soon!

You can download multilingual pretrained models for the social science domain:

SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.

You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:

XLM-R-base-SILD is fine-tuned on SILD using ... .
XLM-R-large-SILD is fine-tuned on SILD using ... .
SSOAR-XLM-R-base-SILD is pre-trained on S44k and then fine-tuned on SILD using ... .

You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:

SoSSE-mE5-base is fine-tuned on LLM-Gen using ... .

Results

Our model achieves the following performance on:

Mention Detection (MD) on SILD

| Model name | F1-binary (English) | F1-binary (German) | F1-binary (Total) | ------------------ |---------------- | -------------- | -------------- | | XLM-R-base-SILD | 58.5% | 53.9% | 57.1% | | SSOAR-XLM-R-base-SILD | 60.7% | 61.8% | 61.0% | | XLM-R-large-SILD | 61.4% | 65.1% | 62.6% |

Entity Disambiguation (ED) on SILD

| Model name | MAP@10 (English) | MAP@10 (German) | | ----------------------- | ----------------- | --------------- | | mE5-base (baseline) | 57.9% | 65.6% | | SoSSE-mE5-base | 63.2% | 68.1% |

Licensing Information

Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.

Owner

Name: ₸ornike
Login: e-tornike
Kind: user
Location: Germany

Website: e-tornike.github.io/
Twitter: e_tornike
Repositories: 4
Profile: https://github.com/e-tornike

PhD Candidate | Computational Linguistics | NLP | Machine Learning @e_tornike@sigmoid.social

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Tsereteli"
  given-names: "Tornike"
  orcid: "https://orcid.org/0000-0003-4298-3570"
- family-names: "Ruffinelli"
  given-names: "Daniel"
  orcid: "https://orcid.org/0000-0002-4831-2930"
- family-names: "Ponzetto"
  given-names: "Simone Paolo"
  orcid: "https://orcid.org/0000-0001-7484-2049"
title: "Enriching Social Science Research via Survey Item Linking"
version: 1.0.0
doi: 
date-released: 2024-06-14
url: "https://github.com/e-tornike/SIL"

GitHub Events

Total

Push event: 8
Fork event: 1

Last Year

Push event: 8
Fork event: 1

Committers

Last synced: 12 months ago

All Time

Total Commits: 9
Total Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 8
Committers: 1
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
₸ornike	2****e	9

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sil

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Enriching Social Science Research via Survey Item Linking (SIL)

Requirements

Experiments

Models

Results

Mention Detection (MD) on SILD

Entity Disambiguation (ED) on SILD

Licensing Information

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels