sil

Enriching Social Science Research via Survey Item Linking

https://github.com/e-tornike/sil

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Enriching Social Science Research via Survey Item Linking

Basic Info
  • Host: GitHub
  • Owner: e-tornike
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 706 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

DOI

Enriching Social Science Research via Survey Item Linking (SIL)

This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).

A figure showing the pipeline for Survey Item Linking

Requirements

To install requirements, use either poetry or pip:

setup poetry install poetry install --only data_s44k # if you want to reproduce the S44k dataset poetry install --only data_gsim # if you want to reproduce the GSIM dataset

setup pip install -r requirements/requirements.txt pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset

[!IMPORTANT] After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/ directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.

Experiments

To run the experiments in the paper, run the following commands:

train bash ./experiments/md/pretrain.slurm # continue pretraining PLMs on S44k bash ./experiments/md/train_linear.sh # train linear classifiers on SILD bash ./experiments/md/train_linear_da.sh # train linear classifiers using data augmentation bash ./experiments/md/train_plms.sh # fine-tune PLMs on SILD bash ./experiments/md/train_plms_da.sh # fine-tune PLMS using data augmentation bash ./experiments/md/train_knn.sh # train kNN on SILD bash ./experiments/md/eval_rac.sh # combine the best PLM w/ the best kNN bash ./experiments/md/eval_icl.sh # evaluate In-Context Learning bash ./experiments/ed/eval_bm25.sh # evaluate BM25 bash ./experiments/ed/eval_plms.sh # evaluate PLMs (including sentence transformers) bash ./experiments/ed/train_sosse.sh # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen bash ./experiments/ed/eval_sosse.sh # evaluate SoSSE models

Models

[!IMPORTANT] The models will be uploaded to HuggingFace Hub soon!

You can download multilingual pretrained models for the social science domain:

  • SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.

You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:

You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:

Results

Our model achieves the following performance on:

Mention Detection (MD) on SILD

| Model name | F1-binary (English) | F1-binary (German) | F1-binary (Total) | ------------------ |---------------- | -------------- | -------------- | | XLM-R-base-SILD | 58.5% | 53.9% | 57.1% | | SSOAR-XLM-R-base-SILD | 60.7% | 61.8% | 61.0% | | XLM-R-large-SILD | 61.4% | 65.1% | 62.6% |

Entity Disambiguation (ED) on SILD

| Model name | MAP@10 (English) | MAP@10 (German) | | ----------------------- | ----------------- | --------------- | | mE5-base (baseline) | 57.9% | 65.6% | | SoSSE-mE5-base | 63.2% | 68.1% |

Licensing Information

Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.

Owner

  • Name: ₸ornike
  • Login: e-tornike
  • Kind: user
  • Location: Germany

PhD Candidate | Computational Linguistics | NLP | Machine Learning @e_tornike@sigmoid.social

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Tsereteli"
  given-names: "Tornike"
  orcid: "https://orcid.org/0000-0003-4298-3570"
- family-names: "Ruffinelli"
  given-names: "Daniel"
  orcid: "https://orcid.org/0000-0002-4831-2930"
- family-names: "Ponzetto"
  given-names: "Simone Paolo"
  orcid: "https://orcid.org/0000-0001-7484-2049"
title: "Enriching Social Science Research via Survey Item Linking"
version: 1.0.0
doi: 
date-released: 2024-06-14
url: "https://github.com/e-tornike/SIL"

GitHub Events

Total
  • Push event: 8
  • Fork event: 1
Last Year
  • Push event: 8
  • Fork event: 1

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 9
  • Total Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 8
  • Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
₸ornike 2****e 9

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • e-tornike (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels