systematic-review-datasets

[NeurIPS 2023] CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

https://github.com/wojciechkusa/systematic-review-datasets

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 13 DOI reference(s) in README
✓
Academic publication links
Links to: acm.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.5%) to scientific vocabulary

Keywords

bigbio citation-screening cochrane cochrane-systematic-reviews datasets systematic-literature-reviews systematic-reviews

Last synced: 10 months ago · JSON representation ·

Repository

[NeurIPS 2023] CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Basic Info

Host: GitHub
Owner: WojciechKusa
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://systematic-review-datasets.streamlit.app
Size: 31.4 MB

Statistics

Stars: 23
Watchers: 1
Forks: 2
Open Issues: 1
Releases: 0

Topics

bigbio citation-screening cochrane cochrane-systematic-reviews datasets systematic-literature-reviews systematic-reviews

Created about 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

This package serves as basis for the paper: "CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews" by Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury (2023)

Table of Contents

CSMeD: Title and abstract screening datasets
CSMeD-FT: Full-text screening dataset
Installation
Examples
Visualisations
Experiments

1. CSMeD: Citation screening datasets for title and abstract screening

Original datasets used to create CSMeD are described in the table below:

TA stands for Title + Abstract screening phase, FT for Full-text screening phase. Avg. size describes the size of a review in terms of the number records retrieved from the search query. Avg. ratio of included (TA) describes the average ratio of included records in the TA phase. Avg. ratio of included (FT) describes the average ratio of included records in the FT phase.

CSMeD datasets

CSMeD beyond offering unified access to the original datasets, provides a unified meta-dataset containing all the original datasets. Statistics of the CSMeD datasets are presented in the table below.

| Dataset name | #reviews | #docs | #included | Avg. #docs | Avg. %included | Avg. #words in document | |-------------------------------------------------------------------------|----------|---------|-----------|------------|----------------|-------------------------| | CSMeD-basic | | | | | | | | CSMeD-basic-train | 30 | 128,438 | 7,958 | 4,281 | 9.6% | 229 | | | | | | | | | | CSMeD-cochrane | | | | | | | | CSMeD-cochrane-train | 195 | 372,422 | 7,589 | 1,910 | 21.9% | 180 | | CSMeD-cochrane-dev | 100 | 229,376 | 4,365 | 2,294 | 20.8% | 201 | | | | | | | | | | CSMeD-all | 325 | 730,236 | 19,912 | 2,247 | 20.5% | 195 |

2. CSMeD-FT: Full-text screening dataset

| Dataset name | #reviews | #docs. | #included | %included | Avg. #words in document | Avg. #words in review | |---------------------|----------|--------|-----------|-----------|-------------------------|-----------------------| | CSMeD-FT-train | 148 | 2,053 | 904 | 44.0% | 4,535 | 1,493 | | CSMeD-FT-dev | 36 | 644 | 202 | 31.4% | 4,419 | 1,402 | | CSMeD-FT-test | 29 | 636 | 278 | 43.7% | 4,957 | 2,318 | | CSMeD-FT-test-small | 16 | 50 | 22 | 44.0% | 5,042 | 2,354 |

Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.

3. Installation

Requirements

Assuming you have conda installed, to create environment for loading CSMeD run:

zsh $ conda create -n csmed python=3.10 $ conda activate csmed (csmed)$ pip install -r requirements.txt

Data acquisition prerequisites

To obtain the metadata for CSMeD-Cochrane datasets, you need to configure the cookie for the Cochrane Library website.

Furthermore, to obtain full-text PDFs for CSMeD-FT, you need to configure the following:

SemanticScholar API key: https://www.semanticscholar.org/product/api
CORE API key: https://core.ac.uk/services/api
GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

If you have all the prerequisites, run:

zsh (csmed)$ python confgure.py

And follow the prompts providing API keys, cookies, email address to use PubMed Entrez APIs and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.

Downloading raw full-text datasets

First install additional requirements:

zsh (csmed)$ pip install -r dev-requirements.txt

To download the datasets, run:

zsh (csmed)$ python scripts/prepare_full_texts.py

4. Examples

Examples presenting how to use the datasets are available in the notebooks/ directory.

5. Visualisations

To run visualisations first you need to install additional requirements:

zsh (csmed)$ pip install -r vis-requirements.txt

Then you can run the visualisations using streamlit:

zsh (csmed)$ streamlit run visualisation/_🏠_Home.py.py

6. Experiments

Baseline experiments from the paper are described in the at: WojciechKusa/CSMeD-baselines repository.

Owner

Name: Wojciech Kusa
Login: WojciechKusa
Kind: user
Company: NASK National Research Institute

Website: https://wojciechkusa.github.io
Twitter: WojciechKusa
Repositories: 28
Profile: https://github.com/WojciechKusa

NLP & IR researcher 👨‍💻 PhD @ TU Wien 🎓

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work, please cite it as below."
authors:
  - family-names: "Kusa"
    given-names: "Wojciech"
  - family-names: "Mendoza"
    given-names: "Oscar E"
  - family-names: "Samwald"
    given-names: "Matthias"
  - family-names: "Knoth"
    given-names: "Petr"
  - family-names: "Hanbury"
    given-names: "Allan"
title: "CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews"
version: 1.0
date-released: 2023
repository-code: "https://github.com/WojciechKusa/systematic-review-datasets"
preferred-citation:
  type: conference-paper
  authors:
    - family-names: "Kusa"
      given-names: "Wojciech"
    - family-names: "Mendoza"
      given-names: "Oscar E"
    - family-names: "Samwald"
      given-names: "Matthias"
    - family-names: "Knoth"
      given-names: "Petr"
    - family-names: "Hanbury"
      given-names: "Allan"
  title: "CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews"
  conference:
    name: "Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track"
    city: "New Orleans"
    region: "Louisiana"
    country: "USA"
    date-start: 2023-12-10
    date-end: 2023-12-16
  year: 2023

GitHub Events

Total

Issues event: 1
Watch event: 2

Last Year

Issues event: 1
Watch event: 2

Dependencies

requirements.txt pypi

beautifulsoup4 ==4.12.2
bibtexparser *
bioc ==2.0.post4
biopython ==1.81
bokeh ==2.4.3
colorcet ==3.0.1
datasets >=2.8.0,<3.0.0
datashader ==0.15.0
evaluate *
grobid_tei_xml *
holoviews ==1.15.0
langchain *
matplotlib *
matplotlib_venn *
nltk *
numpy *
openai *
openpyxl >=3.0.9,<3.1.0
pandas *
plotly *
requests *
rich *
scikit-image ==0.21.0
scikit-learn *
setuptools *
spacy *
streamlit *
tiktoken *
tqdm ==4.65.0
transformers *
umap-learn *
wandb *

setup.py pypi

dev-requirements.txt pypi

beautifulsoup4 ==4.12.2 development
grobid_tei_xml * development
requests * development

vis-requirements.txt pypi

bokeh ==2.4.3
colorcet ==3.0.1
datashader ==0.15.0
holoviews ==1.15.0
matplotlib *
matplotlib_venn *
nltk *
plotly *
rich *
scikit-image ==0.21.0
spacy *
streamlit *
umap-learn *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science