corelay

CoRelAy is a tool to compose small-scale (single-machine) analysis pipelines.

https://github.com/virelay/corelay

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    3 of 7 committers (42.9%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.5%) to scientific vocabulary

Keywords

interpretability machine-learning pipeline-framework python spectral-clustering xai
Last synced: 6 months ago · JSON representation ·

Repository

CoRelAy is a tool to compose small-scale (single-machine) analysis pipelines.

Basic Info
  • Host: GitHub
  • Owner: virelay
  • License: lgpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 3 MB
Statistics
  • Stars: 28
  • Watchers: 3
  • Forks: 2
  • Open Issues: 0
  • Releases: 4
Topics
interpretability machine-learning pipeline-framework python spectral-clustering xai
Created about 5 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation

README.md

CoRelAy Logo # Composing Relevance Analysis [![License](https://img.shields.io/pypi/l/corelay)](https://github.com/virelay/corelay/blob/main/COPYING.LESSER) [![GitHub Actions Workflow Status](https://github.com/virelay/corelay/actions/workflows/tests.yml/badge.svg)](https://github.com/virelay/corelay/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/corelay/badge?version=latest)](https://corelay.readthedocs.io/en/latest) [![GitHub Release](https://img.shields.io/github/v/release/virelay/corelay)](https://github.com/virelay/corelay/releases/latest) [![PyPI Package](https://img.shields.io/pypi/v/corelay)](https://pypi.org/project/corelay/) **CoRelAy** is a library designed for composing efficient, single-machine data analysis pipelines. It enables the rapid implementation of pipelines that can be used to analyze and process data. CoRelAy is primarily meant for the use in explainable artificial intelligence (XAI), often with the goal of producing output suitable for visualization in tools like [**ViRelAy**](https://github.com/virelay/virelay).

At the core of CoRelAy are pipelines (Pipeline), which consist of a series of tasks (Task). Each task is a modular unit that can be populated with operations (Processor) to perform specific data processing tasks. These operations, known as processors, can be customized by assigning new instances or modifying their default configurations.

Tasks in CoRelAy are highly flexible and can be tailored to meet the needs of your analysis pipeline. By leveraging a wide range of configurable processors with their respective parameters (Param), you can easily adapt and optimize your data processing workflow.

For more information about CoRelAy, getting started guides, in-depth tutorials, and API documentation, please refer to the documentation.

If you find CoRelAy useful for your research, why not cite our related paper:

bibtex @article{anders2021software, author = {Anders, Christopher J. and Neumann, David and Samek, Wojciech and Müller, Klaus-Robert and Lapuschkin, Sebastian}, title = {Software for Dataset-wide XAI: From Local Explanations to Global Insights with {Zennit}, {CoRelAy}, and {ViRelAy}}, year = {2021}, volume = {abs/2106.13200}, journal = {CoRR} }

Features

  • Pipeline Composition: CoRelAy allows you to compose pipelines of processors, which can be executed in parallel or sequentially.
  • Task-based Design: Each step in the pipeline is represented as a task, which can be easily modified or replaced.
  • Processor Library: CoRelAy comes with a library of built-in processors for common tasks, such as clustering, embedding, and dimensionality reduction.
  • Memoization: CoRelAy supports memoization of intermediate results, allowing you to reuse previously computed results and speed up your analysis.

Getting Started

Installation

To get started, you first have to install CoRelAy on your system. The recommended and easiest way to install CoRelAy is to use pip, the Python package manager. You can install CoRelAy using the following command:

shell $ pip install corelay

[!NOTE] CoRelAy depends on the metrohash-python library, which requires a C++ compiler to be installed. This may mean that you will have to install extra packages (GCC or Clang) for the installation to succeed. For example, on Fedora, you may have to install the gcc-c++ package in order to make the c++ command available, which can be done using the following command:

shell $ sudo dnf install gcc-c++

To install CoRelAy with optional HDBSCAN and UMAP support, use

shell $ pip install corelay[umap,hdbscan]

Usage

Examples to highlight some features of CoRelAy can be found in docs/examples.

We mainly use HDF5 files to store results. If you wish to visualize your analysis results using ViRelAy, please have a look at the ViRelAy documentation to find out more about its database specification. An example to create HDF5 files which can be used with ViRelAy is shown in docs/examples/hdf5_structure.py.

To do a full SpRAy analysis which can be visualized with ViRelAy, an advanced script can be found in docs/examples/virelay_analysis.py.

The following shows the contents of docs/examples/memoize_spectral_pipeline.py:

```python """An example script, which uses memoization to store (intermediate) results."""

import time import typing from collections.abc import Sequence from typing import Annotated, SupportsIndex

import h5py import numpy

from corelay.base import Param from corelay.io.storage import HashedHDF5 from corelay.pipeline.spectral import SpectralClustering from corelay.processor.base import Processor from corelay.processor.clustering import KMeans from corelay.processor.embedding import TSNEEmbedding, EigenDecomposition from corelay.processor.flow import Sequential, Parallel

class Flatten(Processor): """Represents a :py:class:~corelay.processor.base.Processor, which flattens its input data."""

def function(self, data: typing.Any) -> typing.Any:
    """Applies the flattening to the input data.

    Args:
        data (typing.Any): The input data that is to be flattened.

    Returns:
        typing.Any: Returns the flattened data.
    """

    input_data: numpy.ndarray[typing.Any, typing.Any] = data
    input_data.sum()
    return input_data.reshape(input_data.shape[0], numpy.prod(input_data.shape[1:]))

class SumChannel(Processor): """Represents a :py:class:~corelay.processor.base.Processor, which sums its input data across channels, i.e., its second axis."""

def function(self, data: typing.Any) -> typing.Any:
    """Applies the summation over the channels to the input data.

    Args:
        data (typing.Any): The input data that is to be summed over its channels.

    Returns:
        typing.Any: Returns the data that was summed up over its channels.
    """

    input_data: numpy.ndarray[typing.Any, typing.Any] = data
    return input_data.sum(axis=1)

class Normalize(Processor): """Represents a :py:class:~corelay.processor.base.Processor, which normalizes its input data."""

axes: Annotated[SupportsIndex | Sequence[SupportsIndex], Param((SupportsIndex, Sequence), (1, 2))]
"""A parameter of the :py:class:`~corelay.processor.base.Processor`, which determines the axis over which the data is to be normalized. Defaults
to the second and third axes.
"""

def function(self, data: typing.Any) -> typing.Any:
    """Normalizes the specified input data.

    Args:
        data (typing.Any): The input data that is to be normalized.

    Returns:
        typing.Any: Returns the normalized input data.
    """

    input_data: numpy.ndarray[typing.Any, typing.Any] = data
    return input_data / input_data.sum(self.axes, keepdims=True)

def main() -> None: """The entrypoint to the :py:mod:memoize_spectral_pipeline script."""

# Fixes the random seed for reproducibility
numpy.random.seed(0xDEADBEEF)

# Opens an HDF5 file in append mode for the storing the results of the analysis and the memoization of intermediate pipeline results
with h5py.File('test.analysis.h5', 'a') as analysis_file:

    # Creates a HashedHDF5 IO object, which is an IO object that stores outputs of processors based on hashes in an HDF5 file
    io_object = HashedHDF5(analysis_file.require_group('proc_data'))

    # Generates some exemplary data
    data = numpy.random.normal(size=(64, 3, 32, 32))
    number_of_clusters = range(2, 20)

    # Creates a SpectralClustering pipeline, which is one of the pre-defined built-in pipelines
    pipeline = SpectralClustering(

        # Processors, such as EigenDecomposition, can be assigned to pre-defined tasks
        embedding=EigenDecomposition(n_eigval=8, io=io_object),

        # Flow-based processors, such as Parallel, can combine multiple processors; broadcast=True copies the input as many times as there are
        # processors; broadcast=False instead attempts to match each input to a processor
        clustering=Parallel([
            Parallel([
                KMeans(n_clusters=k, io=io_object) for k in number_of_clusters
            ], broadcast=True),

            # IO objects will be used during computation when supplied to processors, if a corresponding output value (here identified by hashes)
            # already exists, the value is not computed again but instead loaded from the IO object
            TSNEEmbedding(io=io_object)
        ], broadcast=True, is_output=True)
    )

    # Processors (and Params) can be updated by simply assigning corresponding attributes
    pipeline.preprocessing = Sequential([
        SumChannel(),
        Normalize(),
        Flatten()
    ])

    # Processors flagged with "is_output=True" will be accumulated in the output; the output will be a tree of tuples, with the same hierarchy as
    # the pipeline (i.e., _clusterings here contains a tuple of the k-means outputs)
    start_time = time.perf_counter()
    _clusterings, _tsne = pipeline(data)

    # Since we memoize our results in an HDF5 file, subsequent calls will not compute the values (for the same inputs), but rather load them from
    # the HDF5 file; try running the script multiple times
    duration = time.perf_counter() - start_time
    print(f'Pipeline execution time: {duration:.4f} seconds')

if name == 'main': main() ```

Contributing

If you would like to contribute, there are multiple ways you can help out. If you find a bug or have a feature request, please feel free to open an issue on GitHub. If you want to contribute code, please fork the repository and use a feature branch. Pull requests are always welcome. Before forking, please open an issue where you describe what you want to do. This helps to align your ideas with ours and may prevent you from doing work, that we are already planning on doing. If you have contributed to the project, please add yourself to the contributors list.

To help speed up the merging of your pull request, please comment and document your code extensively, try to emulate the coding style of the project, and update the documentation if necessary.

For more information on how to contribute, please refer to our contributor's guide.

License

CoRelAy is dual-licensed under the GNU General Public License Version 3 (GPL-3.0) or later, and the GNU Lesser General Public License Version 3 (LGPL-3.0) or later. For more information see the GPL-3.0 and LGPL-3.0 license files.

Owner

  • Name: virelay
  • Login: virelay
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Software for Dataset-wide XAI: From Local Explanations to
  Global Insights with Zennit, CoRelAy, and ViRelAy
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Christopher J.
    family-names: Anders
    orcid: 'https://orcid.org/0000-0003-3295-8486'
  - given-names: David
    family-names: Neumann
    orcid: 'https://orcid.org/0000-0003-1907-8329'
  - given-names: Wojciech
    family-names: Samek
    orcid: 'https://orcid.org/0000-0002-6283-3265'
  - given-names: Klaus-Robert
    family-names: Müller
    orcid: 'https://orcid.org/0000-0002-3861-7685'
  - given-names: Sebastian
    family-names: Lapuschkin
    orcid: 'https://orcid.org/0000-0002-0762-7258'
identifiers:
  - type: doi
    value: 10.48550/arXiv.2106.13200
    description: arXiv Preprint
  - type: url
    value: 'https://arxiv.org/abs/2106.13200'
    description: arXiv Preprint
repository-code: 'https://github.com/virelay/corelay.git'
url: 'https://corelay.readthedocs.io/en/latest/'
abstract: >-
  Deep Neural Networks (DNNs) are known to be strong
  predictors, but their prediction strategies can rarely be
  understood. With recent advances in Explainable Artificial
  Intelligence (XAI), approaches are available to explore
  the reasoning behind those complex models' predictions.
  Among post-hoc attribution methods, Layer-wise Relevance
  Propagation (LRP) shows high performance. For deeper
  quantitative analysis, manual approaches exist, but
  without the right tools they are unnecessarily labor
  intensive. In this software paper, we introduce three
  software packages targeted at scientists to explore model
  reasoning using attribution approaches and beyond: (1)
  Zennit - a highly customizable and intuitive attribution
  framework implementing LRP and related approaches in
  PyTorch, (2) CoRelAy - a framework to easily and quickly
  construct quantitative analysis pipelines for dataset-wide
  analyses of explanations, and (3) ViRelAy - a
  web-application to interactively explore data,
  attributions, and analysis results. With this, we provide
  a standardized implementation solution for XAI, to
  contribute towards more reproducibility in our field.
keywords:
  - Explainable Artificial Intelligence
  - XAI
  - Layer-Wise Relevance Propagation
  - LRP
  - Spectral Relevance Analysis
  - SpRAy
  - Zennit
  - CoRelAy
  - ViRelAy
license: GPL-3.0-or-later AND LGPL-3.0-or-later

GitHub Events

Total
  • Create event: 21
  • Release event: 1
  • Issues event: 21
  • Watch event: 1
  • Delete event: 19
  • Issue comment event: 1
  • Member event: 1
  • Push event: 82
  • Pull request event: 35
Last Year
  • Create event: 21
  • Release event: 1
  • Issues event: 21
  • Watch event: 1
  • Delete event: 19
  • Issue comment event: 1
  • Member event: 1
  • Push event: 82
  • Pull request event: 35

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 301
  • Total Committers: 7
  • Avg Commits per committer: 43.0
  • Development Distribution Score (DDS): 0.336
Top Committers
Name Email Commits
chrstphr c****r@p****u 200
David Neumann d****n@h****e 43
Talmaj Marinc t****c@h****e 39
Sebastian Lapuschkin s****n@h****e 11
Sebastian Lapuschkin s****n@h****m 6
Pattarawat Chormai p****i@g****m 1
David Neumann d****n@l****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 18
  • Total pull requests: 30
  • Average time to close issues: 10 days
  • Average time to close pull requests: about 17 hours
  • Total issue authors: 2
  • Total pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.03
  • Merged pull requests: 30
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 17
  • Pull requests: 16
  • Average time to close issues: 10 days
  • Average time to close pull requests: about 8 hours
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lecode-official (18)
  • sebastian-lapuschkin-sideprojects (1)
Pull Request Authors
  • lecode-official (31)
  • chr5tphr (18)
  • sebastian-lapuschkin (2)
  • p16i (1)
Top Labels
Issue Labels
Type: Documentation (10) Type: Idea or Request (7) Priority: High (6) Priority: Low (6) Type: Implementation (5) Type: CI/CD (5) Type: Testing & Linting (4) Priority: Critical (3) Type: Content (3) Type: Design (3) Priority: Medium (1)
Pull Request Labels
Type: Documentation (18) Type: Idea or Request (13) Type: CI/CD (12) Type: Implementation (11) Type: Testing & Linting (8) Type: Content (7) Type: Design (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 379 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: corelay

CoRelAy is a tool to compose small-scale (single-machine) analysis pipelines to generate analysis data which can then be visualized using ViRelAy.

  • Documentation: https://corelay.readthedocs.io/en/latest/
  • License: GNU General Public License v3 or later (GPLv3+),GNU Lesser General Public License v3 or later (LGPLv3+)
  • Latest release: 1.0.0
    published 7 months ago
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 379 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 12.9%
Forks count: 19.1%
Average: 20.2%
Dependent repos count: 21.7%
Downloads: 37.3%
Maintainers (1)
Last synced: 6 months ago