hipesc

https://github.com/riccardoc95/hipesc

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: riccardoc95
License: mit
Language: C++
Default Branch: main
Size: 83 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

HyperC: Scalable Distributed Workflow for Long Reads Self-Correction

HyperC is a high-performance, distributed workflow designed to accelerate long-read self-correction pipelines. It integrates with existing tools like CONSENT to efficiently process third-generation sequencing (TGS) data at scale using a hybrid MPI + OpenMP parallelization strategy.

Developed to address the limitations of slow correction processes in genomic studies, HyperC enables fast and scalable analysis suitable for population-scale datasets.

Features

Distributed parallelization using MPI across compute nodes.
Intra-node multithreading using OpenMP.
Optimized I/O: compressed FASTQ loading (via Zstd) and balanced PAF partitioning.
Modular design: easily plug in any MSA-based correction tool.
No need for parallel programming knowledge to run.

Use Case

Using real datasets (like NA12878 from the Nanopore WGS Consortium), it significantly reduced runtime, scaling efficiently across multiple nodes.

Repository Structure

main.cpp – Entry point for the HyperC workflow.
CMakeLists.txt – Build configuration for the project.
compile.sh, compile_and_run.sh – Scripts for compilation and execution.
consent.h – Interface for integrating CONSENT's correction module (an example of correction module).
robin_hood.h – Efficient hash map implementation used for data structures (needed for consent.h code).
utils.h – Utility functions for data handling and decompression.
.gitmodules – Contains references to submodules:
- spoa: MSA library based on partial order alignment.
- Complete-Striped-Smith-Waterman-Library: Optimized Smith-Waterman alignment.

Installation

Prerequisites

GCC ≥ 10 with OpenMP support
MPI (e.g. OpenMPI)
CMake ≥ 3.10
Zstandard (libzstd)

Clone the Repository

Make sure to clone the repository with submodules to include required dependencies:

bash git clone --recurse-submodules https://github.com/riccardoc95/hipesc.git

Build

bash ./compile.sh

Or, for building and running:

bash ./compile_and_run.sh

Running the Pipeline

To run the pipeline, you need:

A FASTQ file with long reads
A corresponding PAF file (generated via Minimap2)

Example (SLURM):

bash srun -n 4 ./hyperc /path/to/input.fq /path/to/input.paf

Each MPI rank will process a portion of the input, and correction results will be written to separate output files.

Plug in Your Own Correction Module

To integrate another MSA-based correction module:

Implement a C/C++ function that accepts a target read and its overlapping reads.
Modify consent.h to wrap the new module.
Recompile with compile.sh.

HyperC will handle all job distribution and parallelization transparently.

License

MIT License

🔗 Reference

If you use HyperC in your research, please cite:

Ceccaroni, R., Di Rocco, L., Ferraro Petrillo, U., & Brutti, P. (2025). A Distributed Workflow for Long Reads Self-correction. In S. Caino-Lores, D. Zeinalipour, T. D. Doudali, D. E. Singh, G. E. M. Garzón, L. Sousa, … S. Neuwirth (Eds.), Euro-Par 2024: Parallel Processing Workshops (pp. 105–116). Cham: Springer Nature Switzerland.

Owner

Name: Riccardo Ceccaroni
Login: riccardoc95
Kind: user

Repositories: 1
Profile: https://github.com/riccardoc95

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite the following paper:"
title: "A Distributed Workflow for Long Reads Self-correction"
authors:
  - family-names: Ceccaroni
    given-names: Riccardo
  - family-names: Di Rocco
    given-names: Lorenzo
  - family-names: Ferraro Petrillo
    given-names: Umberto
  - family-names: Brutti
    given-names: Pierpaolo
date-released: 2025-08-01
doi: "https://doi.org/10.1007/978-3-031-90203-1_10"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science