rs_repos_analysis

Analysis of Research Software repositories

https://github.com/indoc-research/rs_repos_analysis

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Analysis of Research Software repositories

Basic Info
  • Host: GitHub
  • Owner: Indoc-Research
  • License: agpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 5.52 MB
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Analysis of Research Software on GitHub

Scope

We investigated the common usage of GitHub in the biomedical field. This work is inspired by the "Repository timelines on GitHub" analysis [2]

Dataset

We used the CZ Software Mention dataset [1] as a starting point for this analysis. We downloaded specifically the linked folder and analysed the "metadata.tsv" file, which contains the list of repos urls, mined from biomedical publications. From the metadata we selected the portion of the dataset that came from GitHub.

Metrics

We downloaded generic metadata (e.g. license type, creation and update time), number of commits on default branch (e.g. main or master), README data and releases number from each of the available repos.

Results

Lifespan

We analyzed the difference between the date of creation of the repo, and the date of the last update. The majority of repos became inactive after a few months. Surprisingly we retrieved few repos with "negative lifespans". These rare cases can arise when the repo was forked or cloned so that it inherited the commit history, and was never updated again. So the "last push" results before the creation of the new repo.

alt text

Computing the Empirical Cumulative Distribution Function (ECDF) underlines that about half of the repos are inactive after 12 months.

alt text

Commits

We split the commits data between below 95% percentile and above. This is to aid the visualization as the "above" repos (e.g. Linux) strongly skew the representation The vast majority of repos only includes less than 10 commits, 11% of the repos have only 1 commit. alt text There is a weak correlation between the lifespan and the number of commits on default branch. alt text

Analysing more in depth the distribution of commits across the year, we can highlight a trend leading to more and more repos with very low numbers of commits, without a specific effect by the introduction of the FAIR principles. alt text

README

We classified the README in: none, short (<250 words), intermediate (between 250 and 2500 words) and lengthy (>2500 words). And computed the proportion of each in the dataset. An important metric in biomedical/research software is the connection between the bare software and the original publication. To gain insights on this relation, we computed the proportion of repos that mention a DOI in the README alt text

License

Repositories can have an associated license, which details if and how the code can be used. We investigated how many repos include a license, if at all, and what is the proportion of permissive licenses. alt text

Releases, Forks and Stars

Package releases are stable version of package that are published on GitHub. We investigated how many, if at all, releases each repo has and whether this correlates with forks number, as a proxy for the repo relevance. Interestingly we did not see a strong correlation between the number of releases and forks. We hypothesize that packages with high number of releases are stable software that the users tend to use "as is" rather than modifying it. We then investigated the correlation between forks and stars, which can be used as a proxy for community interest and could observe a correlation.

alt text alt text alt text alt text

FAIR principles

We set to investigate general adherence to the FAIR principles, from the repos in the dataset. [3]

F

A global, unique and persistent identifier is achieved, amon others, by either having a GitHub release or a published PyPI package (pip installable). We computed the percentage of repos that have Python listed as the main language and that feature in the PyPI dataset. And we computed the percentage of repos that have at least one GitHub release.

alt text

A

The software can always be downloaded from GitHub using https thus granting adherence to A1. However, the metadata is not necessarily independent of the software (A2).

I

The interoperability is highly dependent on the filed the software is developed for, as it should "exchanges data in a way that meets domain-relevant community standards." [3]

R

The software should be usable and reusable. To investigate this point, we computed the percentage of repos, with a permissive licence.

alt text

References

[1] Istrate, Ana-Maria; Veytsman, Boris; Li, Donghui et al. (2022). CZ Software Mentions: A large dataset of software mentions in the biomedical literature [Dataset]. Dryad. https://doi.org/10.5061/dryad.6wwpzgn2c

[2] RSE Repository Analysis https://github.com/softwaresaved/rse-repo-analysis

[3] FAIR principles https://www.nature.com/articles/s41597-022-01710-x

Owner

  • Name: Indoc-Research
  • Login: Indoc-Research
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Sustainability Analysis of Research Software Repositories
  on GitHub
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Elisa L.
    family-names: Garulli
    orcid: 'https://orcid.org/0000-0003-3909-0683'
  - given-names: Dennis
    family-names: Doll
    email: ddoll@indocresearch.org
    affiliation: Indoc Research Europe gGmbH
    orcid: 'https://orcid.org/0000-0002-3806-9324'
repository-code: 'https://github.com/Indoc-Research/RS_repos_analysis'
abstract: >-
  Analysis of the need for a Research Software
  Infrastructure that improves the sustainability of
  Research Software, based on the publicly accessible CZI
  Software Mentions dataset.
keywords:
  - Research Software
  - Research Software Infrastructure
  - Open Science
  - Software Sustainability
license: AGPL-3.0
commit: 8b010766b76fde6ae27c51f2f0f9f3759bcc70c6
date-released: '2025-01-24'

GitHub Events

Total
  • Release event: 1
  • Member event: 1
  • Push event: 18
  • Create event: 4
Last Year
  • Release event: 1
  • Member event: 1
  • Push event: 18
  • Create event: 4