https://github.com/alpha-unito/software-heritage-analytics

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

Basic Info

Host: GitHub
Owner: alpha-unito
License: mit
Language: JavaScript
Default Branch: main
Size: 4.53 MB

Statistics

Stars: 0
Watchers: 8
Forks: 0
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Software-Heritage-Analytics

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

General Description

Software Heritage Analytics (SHA) is a framework developed specifically in the context of the ADMIRE project. SHA was created from the need of a development environment able to perform analysis of open source software preserved over time by Software Heritage (SWH). SWH hosts over 100 million different projects with over 400 TB of data in constant growth. The data can be turned into valuable knowledge: copyright and license violations, error propagation, programming patterns, evolution of coding paradigms and languages. Another important aspect is the meaning of algorithms, which is a fundamental prerequisite for trust in analytics and machine learning. SHA architecture is designed to allow both fast access to data and to provide a parallel computing environment based on Apache Spark Streaming. SHA is composed of three main components:

Data and metadata cache
Data orchestration layer
Web console

To speed up data transfer and data access it was chosen to implement a distributed cache component called Cachemire. As for the parallel computing environment, it was decided to adopt the Apache Spark framework, an open-source parallel computing environment for applications that analyze Big Data. Specifically, the Spark Streaming version was chosen to optimize data transfer time to the analysis computation. The orchestration layer and console use the official SWH APIs to search, retrieve and get the git status of the given project. SHA allows the execution of custom analysis applications written in SCALA, compatible with Apache Spark Streaming. Applications can analyze a set of projects. The set of projects must be specified in files (recipes) containing the SWH identifier of the project and other metadata useful for analysis purposes (e.g., the type of programming language). The SHA web console component is a browser app in which an authenticated user can: Search one or more projects by name inside the SWH archive which can be added to a recipe file Upload a custom analytic app in JAR format Create the correspondences between recipe and app (which application with which recipe) Run the application Both recipe files and application JAR files are stored in a local repository. It is possible to use the same application with different recipes and the same recipe with different applications. The “recipe” term is inherited by SWH because the process of preparing to download a project from SWH is called “cooking”. The cooking process is basically a preparation of a tar.gz with all the files of the requested project.

Prerequisite

The three main components listed above are developed as separate modules, in the three directories: * Data and metadata cache * Data orchestration layer * Web console

Owner

Name: Parallel programming: Alpha group
Login: alpha-unito
Kind: organization
Location: Torino, IT

Website: http://alpha.di.unito.it
Repositories: 9
Profile: https://github.com/alpha-unito

Parallel Computing research cluster, Department of Computer Science, University of Torino

GitHub Events

Total

Last Year

Dependencies

Webconsole/package-lock.json npm

741 dependencies

Webconsole/package.json npm

@tailwindcss/forms ^0.4.0 development
@tailwindcss/typography ^0.5.0 development
alpinejs ^3.0.6 development
axios ^0.21 development
laravel-mix ^6.0.6 development
lodash ^4.17.19 development
postcss ^8.1.14 development
postcss-import ^14.0.1 development
tailwindcss ^3.0.0 development

Webconsole/composer.json packagist

facade/ignition ^2.5 development
fakerphp/faker ^1.9.1 development
laravel/sail ^1.0.1 development
mockery/mockery ^1.4.4 development
nunomaduro/collision ^5.10 development
phpunit/phpunit ^9.5.10 development
blade-ui-kit/blade-icons ^1.2
fruitcake/laravel-cors ^2.0
guzzlehttp/guzzle ^7.0.1
laravel/framework ^8.75
laravel/jetstream ^2.7
laravel/sanctum ^2.11
laravel/tinker ^2.5
livewire/livewire ^2.5
php ^7.3|^8.0

Webconsole/composer.lock packagist

121 dependencies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/alpha-unito/software-heritage-analytics

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Software-Heritage-Analytics

General Description

Prerequisite

Owner

GitHub Events

Total

Last Year

Dependencies