https://github.com/asreview/synergy-dataset

SYNERGY - Open machine learning dataset on study selection in systematic reviews

https://github.com/asreview/synergy-dataset

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary

Keywords

citation-network dataset graphs machine-learning natural-language-processing prisma research scholarly-articles systematic-reviews-datasets utrecht-university
Last synced: 5 months ago · JSON representation

Repository

SYNERGY - Open machine learning dataset on study selection in systematic reviews

Basic Info
  • Host: GitHub
  • Owner: asreview
  • License: cc0-1.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 1.21 GB
Statistics
  • Stars: 84
  • Watchers: 6
  • Forks: 32
  • Open Issues: 7
  • Releases: 1
Topics
citation-network dataset graphs machine-learning natural-language-processing prisma research scholarly-articles systematic-reviews-datasets utrecht-university
Created about 7 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

SYNERGY dataset

DOI PyPI

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points.

SYNERGY-banner.png

Get the data

The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. Install the package with:

bash pip install synergy-dataset

To download and build the SYNERGY dataset, run the following command in the command line:

python python -m synergy_dataset get

To get an overview of the datasets and their properties, use synergy_dataset list and synergy_dataset show <DATASET_NAME>.

Datasets and variables

The SYNERGY dataset comprises the study selection of 26 systematic reviews. The dataset contains 169,288 records of which 2,834 records are manually labeled as inclusion by the authors of the systematic review. The eligibility criteria are available as block quotations.

The list of systematic reviews included with basic properties:

| Nr | Dataset | Topic(s) | Records | Included | % | |------|-------------------------|---------------------------------|-----------|------------|------| | 1 | Appenzeller-Herzog2019 | Medicine | 2873 | 26 | 0.9 | | 2 | Bos2018 | Medicine | 4878 | 10 | 0.2 | | 3 | Brouwer2019 | Psychology, Medicine | 38114 | 62 | 0.2 | | 4 | Chou2003 | Medicine | 1908 | 15 | 0.8 | | 5 | Chou2004 | Medicine | 1630 | 9 | 0.6 | | 6 | Donners2021 | Medicine | 258 | 15 | 5.8 | | 7 | Hall2012 | Computer science | 8793 | 104 | 1.2 | | 8 | Jeyaraman2020 | Medicine | 1175 | 96 | 8.2 | | 9 | Leenaars2019 | Psychology, Chemistry, Medicine | 5812 | 17 | 0.3 | | 10 | Leenaars2020 | Medicine | 7216 | 583 | 8.1 | | 11 | Meijboom2021 | Medicine | 882 | 37 | 4.2 | | 12 | Menon2022 | Medicine | 975 | 74 | 7.6 | | 13 | Moran2021 | Biology, Medicine | 5214 | 111 | 2.1 | | 14 | Muthu2021 | Medicine | 2719 | 336 | 12.4 | | 15 | Nelson2002 | Medicine | 366 | 80 | 21.9 | | 16 | Oud2018 | Psychology, Medicine | 952 | 20 | 2.1 | | 17 | Radjenovic2013 | Computer science | 5935 | 48 | 0.8 | | 18 | Sep2021 | Psychology | 271 | 40 | 14.8 | | 19 | Smid2020 | Computer science, Mathematics | 2627 | 27 | 1 | | 20 | vandeSchoot2018 | Psychology, Medicine | 4544 | 38 | 0.8 | | 21 | vanderValk2021 | Medicine, Psychology | 725 | 89 | 12.3 | | 22 | vanderWaal2022 | Medicine | 1970 | 33 | 1.7 | | 23 | vanDis2020 | Psychology, Medicine | 9128 | 72 | 0.8 | | 24 | Walker2018 | Biology, Medicine | 48375 | 762 | 1.6 | | 25 | Wassenaar2017 | Medicine, Biology, Chemistry | 7668 | 111 | 1.4 | | 26 | Wolters_2018 | Medicine | 4280 | 19 | 0.4 |

Each record in the dataset is an OpenAlex Work object (Copy at web.archive.org extracted on 2023-03-31).

Some of the notable variables are:

| Variable | Type | Description | |------|-------------------------|-------------------------------| | id | String | The OpenAlex ID for this work. | | doi | String | The DOI identifier of the object if available | | labelincluded | Integer | 1 for included records, 0 for excluded records after full text screening | | title | String | The title of this work. | | abstract | String | The abstract of this work. Stored as `abstractinvertedindex`, but available as plaintext abstract for machine learning purposes. | | authorships | List | List of Authorship objects, each representing an author and their institution. | | type | String | The type or genre of the work as defined by https://api.crossref.org/types. | | publicationyear | Integer | The year this work was published. | | referencedworks | List | List of OpenAlex IDs for works that this work cites. | | concepts | List | List of wikidata concept objects (or topics). | | bestoalocation | Object | An object with the best available open access location for this work. | | citedby_count | Integer | The number of citations to this work at April 1st, 2023. |

For the full list of variables, see this persistent copy of the OpenAlex Work Object documention: https://web.archive.org/web/20230104092916/https://docs.openalex.org/api-entities/works/work-object

Benchmark

Work in progress.

Attribution & License

We would like to thank the following authors for openly sharing the data correponding to their systematic review:

Marlies L.S. Heeres, Marijn Vellinga, P Whaley, Mostafa Mohseni, P.M.J. Welsing, Marleen L.M. Hermens, Richard Torkar, Holger Schielzeth, Marjan Hericko, Arnoud Arntz, Lisanne A. H. Bevers, Christian Appenzeller-Herzog, Michael J. DeVito, Juliette Legler, Rosalie W. M. Kempkes, Daniel Bos, Sanne C. Smid, Robyn B. Blain, Carin M. A. Rademaker, David De Jong, Antoine C. G. Egberts, Tijmen Geurts, Sathish Muthu, Suzanne C. van Veen, Janet D. Allan, Pamela Hartman, Eline S van der Valk, Mitzy Kennis, Wilhelmus Drinkenburg, R. Angela Sarabdjitsingh, Nicola P. Klein, Helga Gardarsdottir, Anouk A. M. T. Donners, Sonja D. Winter, Muriel A. Hagenaars, Erica L T van den Akker, Amir Abdelmoumen, Derek W. R. Gray, Kim Peterson, Eswar Ramakrishnan, Trevor J. Hall, Maurice Dematteis, Merel Ritskes-Hoitinga, Andrew A. Shapiro, Meike W. Vernooij, Maria Brouwer, Katherine E. Pelch, Milica Miočević, Eva A.M. van Dis, Ozair Abawi, Dimitrije Radjenović, Daniel McNeish, Peggy Nygren, Maikel van Berlo, Alwin D. R. Huitema, Nicholas P. Moran, Chad R. Blystone, Alishia D. Williams, Ruud N. J. M. A. Joosten, Klaus Reinhold, Pim N.H. Wassenaar, Sanne E. Hoeks, Anand Krishnan V. Iyer, Sjoerd A.A. van den Berg, Tim Kendall, Lieke H. van Huis, Rens van de Schoot, Nancy E. E. Van Loey, Julia M.L. Menon, Cathalijn H. C. Leenaars, Rogier E. J. Verhoef, Sarah Depaoli, Frank de Wolf, M.E. Hamaker, Rinske M van den Heuvel, Leonardo Trasande, Miranda Olff, Alfredo Sánchez-Tójar, M.H. Emmelot-Vonk, Kristina A. Thayer, Steven M. Teutsch, Elisabeth F.C. van Rossum, Bibian van der Voorn, Stephanie Holmgren, André Bleich, M.S. van der Waal, Frank J. Wolters, Hannah Ewald, Marian Joëls, Franck L. B. Meijboom, Yolanda B. de Rijke, Tobias Stalder, M. Arfan Ikram, P.A.L. Seghers, Marit Sijbrandij, Vincent L. Wester, Behnam Sabayan, Tim Mathes, Parvez Ahmad Ganie, Matthijs G. P. Feenstra, Abee L. Boyles, Matthijs Oud, Andrew A. Rooney, Rosanne W. Meijboom, Karl Heinz Weiss, Jan-Bas Prins, F. Struijs, David Bowes, Neeltje M. Batelaan, Reffat A. Segufa, Serena J. Counsell, Milou S. C. Sep, Aleš Živkovič, Madhan Jeyaraman, Sirwan K.L. Darweesh, Tineke Coenen-de Roo, Heidi Nelson, Roger Chou, Vickie R. Walker, Albert Hofman, Roger E. G. Schutgens, Rob B. M. de Vries, Zhongfang Fu, Pim Cuijpers, Christ Nolten, Krista Fischer, Janneke Elzinga, Roderick H. J. Houwen, Iris M. Engelhard, Linda Humphrey, Frans A. Stafleu, Simon Beecham, Mark Helfand, Thijs J. Giezen, Retha R. Newbold, Claudi L H Bockting, Sanaz Sedaghat, Elizabeth A. Clark

Run synergy_dataset attribution or see ATTRIBUTION.md for a complete attribution including references.

SYNERGY dataset is released under the CC0 1.0 license. SYNERGY consists of CC0 1.0 licensed metadata works published by OpenAlex. The Lens was used for data quality checks and imputing some missing variables.

Citing SYNERGY dataset

If you use SYNERGY in a scientific publication, we would appreciate references to:

De Bruin, Jonathan; Ma, Yongchao; Ferdinands, Gerbrich; Teijema, Jelle; Van de Schoot, Rens, 2023, "SYNERGY - Open machine learning dataset on study selection in systematic reviews", https://doi.org/10.34894/HE6NAQ, DataverseNL, V1

BibTeX reference:

bib @data{HE6NAQ_2023, author = {De Bruin, Jonathan and Ma, Yongchao and Ferdinands, Gerbrich and Teijema, Jelle and Van de Schoot, Rens}, publisher = {DataverseNL}, title = {{SYNERGY - Open machine learning dataset on study selection in systematic reviews}}, year = {2023}, version = {V1}, doi = {10.34894/HE6NAQ}, url = {https://doi.org/10.34894/HE6NAQ} }

Contributing

We are welcoming contributions of all kinds. Some examples are:

  • Do you have an openly published systematic review dataset? Read about our ambition to develop SYNERGY+ (SYNERGY Plus), a much larger dataset with lots of new features.
  • Write an example or tutorial on how to use SYNERGY and all of its hidden capabilities.
  • Write integration to load SYNERGY into existing software like Spacy, Gensim, Tensorflow, Docker, Hugging Face.

Contact

Reach out on the Discussion forum.

Owner

  • Name: ASReview
  • Login: asreview
  • Kind: organization
  • Email: asreview@uu.nl
  • Location: Utrecht University

ASReview - Active learning for Systematic Reviews

GitHub Events

Total
  • Create event: 3
  • Issues event: 28
  • Watch event: 21
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 67
  • Push event: 84
  • Pull request review comment event: 75
  • Pull request review event: 145
  • Pull request event: 160
  • Fork event: 8
Last Year
  • Create event: 3
  • Issues event: 28
  • Watch event: 21
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 67
  • Push event: 84
  • Pull request review comment event: 75
  • Pull request review event: 145
  • Pull request event: 160
  • Fork event: 8

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 11,025 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 12
  • Total maintainers: 1
pypi.org: synergy-dataset

Python package for the SYNERGY dataset

  • Versions: 12
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 11,025 Last month
Rankings
Downloads: 2.9%
Dependent packages count: 4.7%
Average: 9.8%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • pandas *
  • requests *