rs_repos_analysis
Analysis of Research Software repositories
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Analysis of Research Software repositories
Basic Info
- Host: GitHub
- Owner: Indoc-Research
- License: agpl-3.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 5.52 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Analysis of Research Software on GitHub
Scope
We investigated the common usage of GitHub in the biomedical field. This work is inspired by the "Repository timelines on GitHub" analysis [2]
Dataset
We used the CZ Software Mention dataset [1] as a starting point for this analysis. We downloaded specifically the linked folder and analysed the "metadata.tsv" file, which contains the list of repos urls, mined from biomedical publications. From the metadata we selected the portion of the dataset that came from GitHub.
Metrics
We downloaded generic metadata (e.g. license type, creation and update time), number of commits on default branch (e.g. main or master), README data and releases number from each of the available repos.
Results
Lifespan
We analyzed the difference between the date of creation of the repo, and the date of the last update. The majority of repos became inactive after a few months. Surprisingly we retrieved few repos with "negative lifespans". These rare cases can arise when the repo was forked or cloned so that it inherited the commit history, and was never updated again. So the "last push" results before the creation of the new repo.

Computing the Empirical Cumulative Distribution Function (ECDF) underlines that about half of the repos are inactive after 12 months.

Commits
We split the commits data between below 95% percentile and above. This is to aid the visualization as the
"above" repos (e.g. Linux) strongly skew the representation
The vast majority of repos only includes less than 10 commits, 11% of the repos have only 1 commit.
There is a weak correlation between the lifespan and the number of commits on default branch.

Analysing more in depth the distribution of commits across the year, we can highlight a trend leading to more and
more repos with very low numbers of commits, without a specific effect by the introduction of the FAIR
principles.

README
We classified the README in: none, short (<250 words), intermediate (between 250 and 2500 words)
and lengthy (>2500 words). And computed the proportion of each in the dataset.
An important metric in biomedical/research software is the connection between the bare software
and the original publication. To gain insights on this relation, we computed the proportion of
repos that mention a DOI in the README

License
Repositories can have an associated license, which details if and how the code can be used. We investigated
how many repos include a license, if at all, and what is the proportion of permissive licenses.

Releases, Forks and Stars
Package releases are stable version of package that are published on GitHub. We investigated how many, if at all, releases each repo has and whether this correlates with forks number, as a proxy for the repo relevance. Interestingly we did not see a strong correlation between the number of releases and forks. We hypothesize that packages with high number of releases are stable software that the users tend to use "as is" rather than modifying it. We then investigated the correlation between forks and stars, which can be used as a proxy for community interest and could observe a correlation.

FAIR principles
We set to investigate general adherence to the FAIR principles, from the repos in the dataset. [3]
F
A global, unique and persistent identifier is achieved, amon others, by either having a GitHub release or a published PyPI package (pip installable). We computed the percentage of repos that have Python listed as the main language and that feature in the PyPI dataset. And we computed the percentage of repos that have at least one GitHub release.

A
The software can always be downloaded from GitHub using https thus granting adherence to A1. However, the metadata is not necessarily independent of the software (A2).
I
The interoperability is highly dependent on the filed the software is developed for, as it should "exchanges data in a way that meets domain-relevant community standards." [3]
R
The software should be usable and reusable. To investigate this point, we computed the percentage of repos, with a permissive licence.

References
[1] Istrate, Ana-Maria; Veytsman, Boris; Li, Donghui et al. (2022). CZ Software Mentions: A large dataset of software mentions in the biomedical literature [Dataset]. Dryad. https://doi.org/10.5061/dryad.6wwpzgn2c
[2] RSE Repository Analysis https://github.com/softwaresaved/rse-repo-analysis
[3] FAIR principles https://www.nature.com/articles/s41597-022-01710-x
Owner
- Name: Indoc-Research
- Login: Indoc-Research
- Kind: organization
- Repositories: 1
- Profile: https://github.com/Indoc-Research
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Sustainability Analysis of Research Software Repositories
on GitHub
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Elisa L.
family-names: Garulli
orcid: 'https://orcid.org/0000-0003-3909-0683'
- given-names: Dennis
family-names: Doll
email: ddoll@indocresearch.org
affiliation: Indoc Research Europe gGmbH
orcid: 'https://orcid.org/0000-0002-3806-9324'
repository-code: 'https://github.com/Indoc-Research/RS_repos_analysis'
abstract: >-
Analysis of the need for a Research Software
Infrastructure that improves the sustainability of
Research Software, based on the publicly accessible CZI
Software Mentions dataset.
keywords:
- Research Software
- Research Software Infrastructure
- Open Science
- Software Sustainability
license: AGPL-3.0
commit: 8b010766b76fde6ae27c51f2f0f9f3759bcc70c6
date-released: '2025-01-24'
GitHub Events
Total
- Release event: 1
- Member event: 1
- Push event: 18
- Create event: 4
Last Year
- Release event: 1
- Member event: 1
- Push event: 18
- Create event: 4