iscc-sum

Fast Single-Pass ISCC Data-Code & Instance Code

https://github.com/bio-codes/iscc-sum

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Fast Single-Pass ISCC Data-Code & Instance Code

Basic Info
  • Host: GitHub
  • Owner: bio-codes
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage: https://sum.iscc.codes
  • Size: 1.87 MB
Statistics
  • Stars: 8
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 3
Created 9 months ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation Zenodo

README.md

iscc-sum

CI PyPI Crates.io DOI

A blazing-fast ISCC Data-Code and Instance-Code hashing tool built in Rust with Python bindings. Delivers 50-130x faster performance than reference implementations, processing data at over 1 GB/s.

Originally created to handle massive microscopic imaging datasets where existing tools were too slow.

Project Status

Version 0.1.0 — Initial release for Data-Code and Instance-Code generation.

[!WARNING] This package is under active development, and breaking changes may be released at any time. Be sure to pin to specific versions if you're using this package in a production environment.

Performance

  • 950-1050 MB/s processing speed (vs 7-8 MB/s reference)
  • 50-130x faster than existing implementations
  • Consistent performance on multi-GB files

Ideal for large-scale data processing: microscopic imaging, video files, scientific datasets.

Installation

Python Package

The recommended way to install the iscc-sum CLI tool is using uv:

bash uv tool install iscc-sum

Note: To install uv, run: curl -LsSf https://astral.sh/uv/install.sh | sh (or see other installation methods)

Usage

Command Line Interface

The iscc-sum command provides checksum generation and verification functionality similar to standard tools like md5sum or sha256sum, but using ISCC (International Standard Content Code) checksums.

Basic Usage

```bash

Generate checksum for a file

iscc-sum document.pdf

Output: ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *document.pdf

Generate checksums for multiple files

iscc-sum *.txt

Read from standard input

echo "Hello, World!" | iscc-sum cat document.txt | iscc-sum ```

[!NOTE] By default, this tool creates ISCC-CODEs of SubType WIDE, introduced for large-scale secure checksum support with data similarity matching capabilities. This SubType is not yet part of the ISO 24138:2024 standard but is supported by the latest version of the Iscc-Core reference implementation. For ISO 24138:2024 conformant ISCC-CODEs, use the --narrow flag in the CLI tool.

Checksum Verification

```bash

Create a checksum file

iscc-sum *.txt > checksums.txt

Verify checksums

iscc-sum -c checksums.txt

Output:

file1.txt: OK

file2.txt: OK

Verify with quiet mode (only show failures)

iscc-sum -c -q checksums.txt ```

Output Formats

```bash

Default format (GNU style)

iscc-sum file.txt

ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *file.txt

BSD-style format

iscc-sum --tag file.txt

ISCC (file.txt) = ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY

Narrow format (128-bit)

iscc-sum --narrow file.txt

ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HU *file.txt

Show component codes

iscc-sum --units file.txt

ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *file.txt

ISCC:EAAW4BQTJSTJSHAI27AJSAGMGHNUKSKRTK3E6OZ5CXUS57SWQZXJQ

ISCC:IABXF3ZHYL6O6PM5P2HGV677CS3RBHINZSXEJCITE3WNOTQ2CYXRA

Process entire directory as single unit

iscc-sum --tree /path/to/project

ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY */path/to/project/

```

Similarity Matching

Find files with similar content:

```bash

Find similar files (default threshold: 12 bits)

iscc-sum --similar *.jpg

Output:

photo1.jpg

~8 photo2.jpg

~12 photo3.jpg

Adjust similarity threshold

iscc-sum --similar --threshold 6 *.pdf ```

Complete Options

```bash iscc-sum --help # Show all available options

Options: -c, --check Read checksums from files and check them --narrow Generate shorter 128-bit checksums --tag Create a BSD-style checksum --units Show Data-Code and Instance-Code components -z, --zero End each output line with NUL --similar Find files with similar Data-Codes --threshold Hamming distance threshold for similarity (default: 12) -t, --tree Process directory as single unit with combined checksum -q, --quiet Don't print OK for each verified file --status Don't output anything, exit code shows success -w, --warn Warn about improperly formatted lines --strict Exit non-zero for improperly formatted lines ```

Python API

Quick Start

Generate ISCC-SUM codes for files:

```pycon

from isccsum import codeiscc_sum

Generate extended ISCC-SUM for a file

result = codeisccsum("LICENSE", wide=True) result.iscc 'ISCC:K4AA2G6UMXGFJAO6ZOMIFZIYO6LYMOBT7Q6JDI3Z75IJWQY5WH372QA' result.datahash '1e203833fc3c91a379ff509b431db1f7fd40dea69a6614249f420ec62398957087b1' result.filesize 11357

```

Streaming API

For large files or streaming data, use the processor classes:

```python from iscc_sum import IsccSumProcessor

processor = IsccSumProcessor() with open("large_file.bin", "rb") as f: while chunk := f.read(1024 * 1024): # Read in 1MB chunks processor.update(chunk)

result = processor.result(wide=False, add_units=True) print(f"ISCC: {result.iscc}") print(f"Units: {result.units}") # Individual Data-Code and Instance-Code ```

Development

Prerequisites

  • Rust (latest stable) - Install from rustup.rs
  • Python 3.10+
  • UV (for Python dependency management) - Install from astral.sh/uv

Quick Setup

```bash

Clone the repository

git clone https://github.com/bio-codes/iscc-sum.git cd iscc-sum

Install Python dependencies

uv sync --all-extras

Setup Rust development components

uv run poe setup

Build Python extension and run all checks

uv run poe all ```

Development Commands

All development tasks are managed through poethepoet:

```bash

One-time setup (installs Rust components)

uv run poe setup

Pre-commit checks (format, lint, test everything)

uv run poe all

Individual commands

uv run poe format # Format all code (Rust + Python) uv run poe test # Run all tests (Rust + Python) uv run poe typecheck # Run Python type checking uv run poe rust-build # Build Rust binary uv run poe build-ext # Build Python extension

Check if Rust toolchain is properly installed

uv run poe check-rust ```

Manual Setup (if needed)

```bash

Install all dependencies including dev dependencies

uv sync --all-extras

Install Rust components manually

rustup component add rustfmt clippy

Build Rust extension for Python

uv run maturin develop

Run tests manually

cargo test # Rust tests uv run pytest # Python tests ```

Building

```bash

Build Rust binary (creates isum executable)

cargo build --release

Build Python wheels

maturin build --release ```

Funding

This work was supported through the Open Science Clusters’ Action for Research and Society (OSCARS) European project under grant agreement Nº101129751.

See: BIO-CODES project (Enhancing AI-Readiness of Bioimaging Data with Content-Based Identifiers).

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Citation

if you use ISCC-SUM in your research, please cite:

DOI

bibtex @software{pan2025isccsum, title = {BIO-CODES/ISCC-SUM: High-Performance ISCC Generation for Bioimaging Data - OSCARS Project}, author = {Pan, Titusz}, year = 2025, month = jul, publisher = {Zenodo}, doi = {10.5281/zenodo.16541262}, url = {https://doi.org/10.5281/zenodo.16541262}, note = {Supported by OSCARS (Open Science Clusters' Action for Research and Society) under European Commission grant agreement Nº101129751}, version = {0.1.0} }

Owner

  • Name: bio-codes
  • Login: bio-codes
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
authors:
  - family-names: Pan
    given-names: Titusz
    email: tp@iscc.io
    orcid: https://orcid.org/0000-0002-0521-4214
  - family-names: Etzrodt
    given-names: Martin
    email: etzrodt.martin@gmail.com
    orcid: https://orcid.org/0000-0003-1928-3904
title: "BIO-CODES/ISCC-SUM: High-Performance ISCC Generation for Bioimaging Data - OSCARS Project"
doi: 10.5281/zenodo.16541262
version: 0.1.0
date-released: 2025-07-28
url: "https://github.com/bio-codes/iscc-sum"
repository-code: "https://github.com/bio-codes/iscc-sum"
abstract: >-
  A high-performance ISCC Data-Code and Instance-Code hashing tool built with Python including a rust extension.
  Delivers 50-130x faster performance than reference implementations, processing data at over 1 GB/s.
  Originally created to handle massive microscopic imaging datasets where existing tools were too slow.
keywords:
  - iscc
  - identifier
  - provenance
  - hash
  - checksum
  - content-hash
  - similarity-hash
  - deduplication
license: Apache-2.0
funding:
  - name: "OSCARS - Open Science Clusters' Action for Research and Society"
    award: "101129751"

GitHub Events

Total
  • Release event: 4
  • Watch event: 6
  • Delete event: 1
  • Push event: 68
  • Pull request event: 7
  • Create event: 7
Last Year
  • Release event: 4
  • Watch event: 6
  • Delete event: 1
  • Push event: 68
  • Pull request event: 7
  • Create event: 7

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • etzm (4)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 30 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: iscc-sum

High-performance ISCC Data-Code and Instance-Code hashing

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 30 Last month
Rankings
Dependent packages count: 9.0%
Average: 29.7%
Dependent repos count: 50.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • Swatinem/rust-cache v2 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • astral-sh/setup-uv v3 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/upload-artifact v4 composite
  • dtolnay/rust-toolchain stable composite
  • pypa/cibuildwheel v3.0.0 composite
  • pypa/gh-action-pypi-publish release/v1 composite
  • softprops/action-gh-release v1 composite
Cargo.lock cargo
  • aho-corasick 1.1.3
  • anstream 0.6.19
  • anstyle 1.0.11
  • anstyle-parse 0.2.7
  • anstyle-query 1.1.3
  • anstyle-wincon 3.0.9
  • arrayref 0.3.9
  • arrayvec 0.7.6
  • assert_cmd 2.0.17
  • autocfg 1.4.0
  • base32 0.5.1
  • bitflags 2.9.1
  • blake3 1.8.2
  • bstr 1.12.0
  • cc 1.2.27
  • cfg-if 1.0.1
  • clap 4.5.40
  • clap_builder 4.5.40
  • clap_derive 4.5.40
  • clap_lex 0.7.5
  • colorchoice 1.0.4
  • constant_time_eq 0.3.1
  • crossbeam-deque 0.8.6
  • crossbeam-epoch 0.9.18
  • crossbeam-utils 0.8.21
  • difflib 0.4.0
  • doc-comment 0.3.3
  • either 1.15.0
  • errno 0.3.12
  • fastrand 2.3.0
  • float-cmp 0.10.0
  • getrandom 0.3.3
  • globset 0.4.16
  • heck 0.5.0
  • hex 0.4.3
  • indoc 2.0.6
  • is_terminal_polyfill 1.70.1
  • libc 0.2.172
  • linux-raw-sys 0.9.4
  • log 0.4.27
  • memchr 2.7.5
  • memoffset 0.9.1
  • normalize-line-endings 0.3.0
  • num-traits 0.2.19
  • once_cell 1.21.3
  • once_cell_polyfill 1.70.1
  • portable-atomic 1.11.1
  • predicates 3.1.3
  • predicates-core 1.0.9
  • predicates-tree 1.0.12
  • proc-macro2 1.0.95
  • pyo3 0.25.1
  • pyo3-build-config 0.25.1
  • pyo3-ffi 0.25.1
  • pyo3-macros 0.25.1
  • pyo3-macros-backend 0.25.1
  • quote 1.0.40
  • r-efi 5.2.0
  • rayon 1.10.0
  • rayon-core 1.12.1
  • regex 1.11.1
  • regex-automata 0.4.9
  • regex-syntax 0.8.5
  • rustix 1.0.7
  • same-file 1.0.6
  • serde 1.0.219
  • serde_derive 1.0.219
  • shlex 1.3.0
  • strsim 0.11.1
  • syn 2.0.103
  • target-lexicon 0.13.2
  • tempfile 3.20.0
  • termtree 0.5.1
  • tinyvec 1.9.0
  • tinyvec_macros 0.1.1
  • unicode-ident 1.0.18
  • unicode-normalization 0.1.24
  • unindent 0.2.4
  • utf8parse 0.2.2
  • wait-timeout 0.2.1
  • walkdir 2.5.0
  • wasi 0.14.2+wasi-0.2.4
  • winapi-util 0.1.9
  • windows-sys 0.59.0
  • windows-targets 0.52.6
  • windows_aarch64_gnullvm 0.52.6
  • windows_aarch64_msvc 0.52.6
  • windows_i686_gnu 0.52.6
  • windows_i686_gnullvm 0.52.6
  • windows_i686_msvc 0.52.6
  • windows_x86_64_gnu 0.52.6
  • windows_x86_64_gnullvm 0.52.6
  • windows_x86_64_msvc 0.52.6
  • wit-bindgen-rt 0.39.0
  • xxhash-rust 0.8.15
Cargo.toml cargo
  • assert_cmd 2.0 development
  • predicates 3.1 development
  • tempfile 3.10 development
  • base32 0.5.0
  • blake3 1.8.2
  • clap 4.5
  • globset 0.4
  • hex 0.4.3
  • pyo3 0.25.1
  • rayon 1.10.0
  • unicode-normalization 0.1
  • walkdir 2.5
  • xxhash-rust 0.8.15
pyproject.toml pypi
  • blake3 >=1.0.5
  • click >=8.0.0
  • pathspec >=0.12.1
  • universal-pathlib >=0.2.6
  • xxhash >=3.5.0
uv.lock pypi
  • babel 2.17.0
  • backrefs 5.8
  • bandit 1.8.5
  • blake3 1.0.5
  • build 1.2.2.post1
  • certifi 2025.6.15
  • charset-normalizer 3.4.2
  • click 8.2.1
  • colorama 0.4.6
  • coverage 7.9.1
  • exceptiongroup 1.3.0
  • fsspec 2025.5.1
  • ghp-import 2.1.0
  • idna 3.10
  • importlib-metadata 8.7.0
  • iniconfig 2.1.0
  • iscc-sum 0.1.0
  • jinja2 3.1.6
  • markdown 3.8.1
  • markdown-it-py 3.0.0
  • markupsafe 3.0.2
  • maturin 1.8.7
  • mdformat 0.7.22
  • mdformat-footnote 0.1.1
  • mdformat-gfm 0.4.1
  • mdformat-gfm-alerts 2.0.0
  • mdformat-mkdocs 4.3.0
  • mdformat-tables 1.0.0
  • mdit-py-plugins 0.4.2
  • mdurl 0.1.2
  • mergedeep 1.3.4
  • mkdocs 1.6.1
  • mkdocs-get-deps 0.2.0
  • mkdocs-glightbox 0.4.0
  • mkdocs-material 9.6.14
  • mkdocs-material-extensions 1.3.1
  • more-itertools 10.7.0
  • mypy 1.16.1
  • mypy-extensions 1.1.0
  • packaging 25.0
  • paginate 0.5.7
  • pastel 0.2.1
  • pathspec 0.12.1
  • pbr 6.1.1
  • platformdirs 4.3.8
  • pluggy 1.6.0
  • poethepoet 0.35.0
  • pyfakefs 5.8.0
  • pygments 2.19.1
  • pymdown-extensions 10.15
  • pyproject-hooks 1.2.0
  • pytest 8.4.1
  • pytest-cov 6.2.1
  • python-dateutil 2.9.0.post0
  • pyyaml 6.0.2
  • pyyaml-env-tag 1.1
  • requests 2.32.4
  • rich 14.0.0
  • ruff 0.12.0
  • setuptools 80.9.0
  • six 1.17.0
  • stevedore 5.4.1
  • tomli 2.2.1
  • typing-extensions 4.14.0
  • universal-pathlib 0.2.6
  • urllib3 2.5.0
  • watchdog 6.0.0
  • wcwidth 0.2.13
  • xxhash 3.5.0
  • zipp 3.23.0