Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: riccardoc95
  • License: mit
  • Language: C++
  • Default Branch: main
  • Size: 83 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

HyperC: Scalable Distributed Workflow for Long Reads Self-Correction

HyperC is a high-performance, distributed workflow designed to accelerate long-read self-correction pipelines. It integrates with existing tools like CONSENT to efficiently process third-generation sequencing (TGS) data at scale using a hybrid MPI + OpenMP parallelization strategy.

Developed to address the limitations of slow correction processes in genomic studies, HyperC enables fast and scalable analysis suitable for population-scale datasets.


Features

  • Distributed parallelization using MPI across compute nodes.
  • Intra-node multithreading using OpenMP.
  • Optimized I/O: compressed FASTQ loading (via Zstd) and balanced PAF partitioning.
  • Modular design: easily plug in any MSA-based correction tool.
  • No need for parallel programming knowledge to run.

Use Case

Using real datasets (like NA12878 from the Nanopore WGS Consortium), it significantly reduced runtime, scaling efficiently across multiple nodes.


Repository Structure

  • main.cpp – Entry point for the HyperC workflow.
  • CMakeLists.txt – Build configuration for the project.
  • compile.sh, compile_and_run.sh – Scripts for compilation and execution.
  • consent.h – Interface for integrating CONSENT's correction module (an example of correction module).
  • robin_hood.h – Efficient hash map implementation used for data structures (needed for consent.h code).
  • utils.h – Utility functions for data handling and decompression.
  • .gitmodules – Contains references to submodules:
    • spoa: MSA library based on partial order alignment.
    • Complete-Striped-Smith-Waterman-Library: Optimized Smith-Waterman alignment.

Installation

Prerequisites

  • GCC ≥ 10 with OpenMP support
  • MPI (e.g. OpenMPI)
  • CMake ≥ 3.10
  • Zstandard (libzstd)

Clone the Repository

Make sure to clone the repository with submodules to include required dependencies:

bash git clone --recurse-submodules https://github.com/riccardoc95/hipesc.git

Build

bash ./compile.sh

Or, for building and running:

bash ./compile_and_run.sh


Running the Pipeline

To run the pipeline, you need:

  • A FASTQ file with long reads
  • A corresponding PAF file (generated via Minimap2)

Example (SLURM):

bash srun -n 4 ./hyperc /path/to/input.fq /path/to/input.paf

Each MPI rank will process a portion of the input, and correction results will be written to separate output files.


Plug in Your Own Correction Module

To integrate another MSA-based correction module:

  1. Implement a C/C++ function that accepts a target read and its overlapping reads.
  2. Modify consent.h to wrap the new module.
  3. Recompile with compile.sh.

HyperC will handle all job distribution and parallelization transparently.


License

MIT License


🔗 Reference

If you use HyperC in your research, please cite:

Ceccaroni, R., Di Rocco, L., Ferraro Petrillo, U., & Brutti, P. (2025). A Distributed Workflow for Long Reads Self-correction. In S. Caino-Lores, D. Zeinalipour, T. D. Doudali, D. E. Singh, G. E. M. Garzón, L. Sousa, … S. Neuwirth (Eds.), Euro-Par 2024: Parallel Processing Workshops (pp. 105–116). Cham: Springer Nature Switzerland.

Owner

  • Name: Riccardo Ceccaroni
  • Login: riccardoc95
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite the following paper:"
title: "A Distributed Workflow for Long Reads Self-correction"
authors:
  - family-names: Ceccaroni
    given-names: Riccardo
  - family-names: Di Rocco
    given-names: Lorenzo
  - family-names: Ferraro Petrillo
    given-names: Umberto
  - family-names: Brutti
    given-names: Pierpaolo
date-released: 2025-08-01
doi: "https://doi.org/10.1007/978-3-031-90203-1_10"

GitHub Events

Total
  • Public event: 1
  • Push event: 23
Last Year
  • Public event: 1
  • Push event: 23