https://github.com/alpha-unito/software-heritage-analytics

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

https://github.com/alpha-unito/software-heritage-analytics

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

Basic Info
  • Host: GitHub
  • Owner: alpha-unito
  • License: mit
  • Language: JavaScript
  • Default Branch: main
  • Size: 4.53 MB
Statistics
  • Stars: 0
  • Watchers: 8
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Software-Heritage-Analytics

The Software Heritage Analytics framework (a parallel cache for analysing Software Heritage data with Spark Streaming applications)

General Description

Software Heritage Analytics (SHA) is a framework developed specifically in the context of the ADMIRE project. SHA was created from the need of a development environment able to perform analysis of open source software preserved over time by Software Heritage (SWH). SWH hosts over 100 million different projects with over 400 TB of data in constant growth. The data can be turned into valuable knowledge: copyright and license violations, error propagation, programming patterns, evolution of coding paradigms and languages. Another important aspect is the meaning of algorithms, which is a fundamental prerequisite for trust in analytics and machine learning. SHA architecture is designed to allow both fast access to data and to provide a parallel computing environment based on Apache Spark Streaming. SHA is composed of three main components:

  • Data and metadata cache
  • Data orchestration layer
  • Web console

To speed up data transfer and data access it was chosen to implement a distributed cache component called Cachemire. As for the parallel computing environment, it was decided to adopt the Apache Spark framework, an open-source parallel computing environment for applications that analyze Big Data. Specifically, the Spark Streaming version was chosen to optimize data transfer time to the analysis computation. The orchestration layer and console use the official SWH APIs to search, retrieve and get the git status of the given project. SHA allows the execution of custom analysis applications written in SCALA, compatible with Apache Spark Streaming. Applications can analyze a set of projects. The set of projects must be specified in files (recipes) containing the SWH identifier of the project and other metadata useful for analysis purposes (e.g., the type of programming language). The SHA web console component is a browser app in which an authenticated user can: Search one or more projects by name inside the SWH archive which can be added to a recipe file Upload a custom analytic app in JAR format Create the correspondences between recipe and app (which application with which recipe) Run the application Both recipe files and application JAR files are stored in a local repository. It is possible to use the same application with different recipes and the same recipe with different applications. The “recipe” term is inherited by SWH because the process of preparing to download a project from SWH is called “cooking”. The cooking process is basically a preparation of a tar.gz with all the files of the requested project.

Prerequisite

The three main components listed above are developed as separate modules, in the three directories: * Data and metadata cache * Data orchestration layer * Web console

Owner

  • Name: Parallel programming: Alpha group
  • Login: alpha-unito
  • Kind: organization
  • Location: Torino, IT

Parallel Computing research cluster, Department of Computer Science, University of Torino

GitHub Events

Total
Last Year

Dependencies

Webconsole/package-lock.json npm
  • 741 dependencies
Webconsole/package.json npm
  • @tailwindcss/forms ^0.4.0 development
  • @tailwindcss/typography ^0.5.0 development
  • alpinejs ^3.0.6 development
  • axios ^0.21 development
  • laravel-mix ^6.0.6 development
  • lodash ^4.17.19 development
  • postcss ^8.1.14 development
  • postcss-import ^14.0.1 development
  • tailwindcss ^3.0.0 development
Webconsole/composer.json packagist
  • facade/ignition ^2.5 development
  • fakerphp/faker ^1.9.1 development
  • laravel/sail ^1.0.1 development
  • mockery/mockery ^1.4.4 development
  • nunomaduro/collision ^5.10 development
  • phpunit/phpunit ^9.5.10 development
  • blade-ui-kit/blade-icons ^1.2
  • fruitcake/laravel-cors ^2.0
  • guzzlehttp/guzzle ^7.0.1
  • laravel/framework ^8.75
  • laravel/jetstream ^2.7
  • laravel/sanctum ^2.11
  • laravel/tinker ^2.5
  • livewire/livewire ^2.5
  • php ^7.3|^8.0
Webconsole/composer.lock packagist
  • 121 dependencies