repositories-extraction

This is a repository for extracting GitHub and GitLab links from 6 European clusters using custom APIs or already given APIs by said clusters

https://github.com/anas-elhounsri/repositories-extraction

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

This is a repository for extracting GitHub and GitLab links from 6 European clusters using custom APIs or already given APIs by said clusters

Basic Info
  • Host: GitHub
  • Owner: Anas-Elhounsri
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 25.2 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

DOI

[!WARNING] This is still in progress and subject to change, as additional requirements may arise.

Repositories Extraction:

  • ESCAPE OSSR (Available)
  • SSHOC (Available)
  • LI-RS (Available)
  • PANOSC (Available)
  • ENVRI (Available)
  • RSD (Available)

How to run:

  • Install the required libraries with requirements.txt: bash pip install -r requirements.txt
  • Run the script by running main.py file, then choose the community from the list that you wish you extract data from : (Note for SSHOC and ESCAPE communities, you need to get an access token from Zenodo)

``` bash Choose the community you wish to extract data from (type 's' to stop):

  • sshoc
  • escape2020
  • LS-RI
  • PANOSC
  • ENVRI
  • RSD

Input: - Example output: bash Input:sshoc Please input your access token:


Successfully retrieved data Opened & loaded successfully

Extracted 6 GitHub links and saved to githublinkssshoc.txt

`` - On you local directory you will see the datasets extracted, both theJSONand thetxtfile of GitHub links for *SSHOC, ESCAPE OSSR* and *LS-RI*, as for *PANOSC*, we extract the links directly withbs4` library.

The Process:

For SSHOC & ESCAPE OSSR:

Zenodo is offers an API service that allows us to extract the necessary information needed for our case, by accessing their repository through the endpoint https://zenodo.org/api/records : python response = requests.get("https://zenodo.org/api/records", params={ 'communities': community, 'type': type, 'access_token': token}) Where in community, we specify whether we want SSHOC or ESCAPE OSSR, In type we spcify to only list software since the rest of the tags are publications, presentations videos etc... that are outside the scope of our project. Finally we have access_token, where you can get it from Zenodo after creating an account.

This will extract all available tools for either SSHOC or ESCAPE OSSR as a JSON file, then it will access the file to extract only the GitHub or GitLab repository from the "Identifier" section.

This will store two JSON files, one has the original metadata, and the other as GitHub and GitLab links.

For LS-RI (bio.tools):

Similarly, bio.tools offers an API to extract all the tools available on their repository, this time, all the data in LS-RI are tools, some have the GitHub or GitLab link, we use format to extract it as JSON file, the links are usually stored in the hompage section of the JSON file.

This script extracts approximately 15,000 links, and stores it separately as two JSON files

For PANOSC, ENVRI: & RSD:

For these clusters, since they did not offer an API, I extracted data with webscraping using bs4 library, and it extracts 23 links. and directly stores them as JSON.

Ackowlegement:

The authors acknowledge the OSCARS project, which has received funding from the European Commission's Horizon Europe Research and Innovation programme under grant agreement No. 101129751

logo

Owner

  • Name: Papa Zodd
  • Login: Anas-Elhounsri
  • Kind: user

I'm a CE student aspiring to become a a Solution Architect with Data Science and Deep Learning orientation.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Repositories Extraction
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Anas
    family-names: El Hounsri
identifiers:
  - type: doi
    value: 10.5281/zenodo.14803009
repository-code: >-
  https://github.com/Anas-Elhounsri/Repositories-Extraction/tree/main
license: CC0-1.0
version: 1.0.0

GitHub Events

Total
  • Release event: 1
  • Push event: 7
  • Create event: 3
Last Year
  • Release event: 1
  • Push event: 7
  • Create event: 3

Dependencies

requirements.txt pypi
  • requests *