longreadsum
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: WGLab
- License: mit
- Language: C++
- Default Branch: main
- Size: 2.48 MB
Statistics
- Stars: 22
- Watchers: 7
- Forks: 3
- Open Issues: 5
- Releases: 8
Metadata Files
README.md
LongReadSum: A fast and flexible QC tool for long read sequencing data
LongReadSum supports FASTA, FASTQ, BAM, FAST5, and sequencing_summary.txt file formats for quick generation of QC data in HTML and text format.
README Contents
- Installation using Anaconda (recommended)
- Installation using Docker
- Building from source
- General usage for common filetypes:
- Revision history
- Getting help
- Citing LongReadSum
Installation using Anaconda
First, install Anaconda.
Next, create a new environment. This installation has been tested with Python 3.10, Linux 64-bit.
conda create -n longreadsum python=3.9
conda activate longreadsum
LongReadSum and its dependencies can then be installed using the following command:
conda install -c wglab -c conda-forge -c jannessp -c bioconda longreadsum=1.5.0
Installation using Docker
First, install Docker. Pull the latest image from Docker hub, which contains the latest longreadsum release and its dependencies.
docker pull genomicslab/longreadsum
Running
On Unix/Linux:
docker run -v C:/Users/.../DataDirectory:/mnt/ -it genomicslab/longreadsum bam -i /mnt/input.bam -o /mnt/output
Note that the -v command is required for Docker to find the input file. Use a directory under C:/Users/ to ensure volume files are mounted correctly. In the above example, the local directory C:/Users/.../DataDirectory containing the input file input.bam is mapped to a directory /mnt/ in the Docker container. Thus, the input file and output directory arguments are relative to the /mnt/ directory, but the output files will also be saved locally in C:/Users/.../DataDirectory under the specified subdirectory output.
Building from source
To get the latest updates in longreadsum, you can build from source. First install Anaconda. Then follow the instructions below to install LongReadSum and its dependencies:
```
Pull the latest updates
git clone https://github.com/WGLab/LongReadSum cd LongReadSum
Create the longreadsum environment, install dependencies, and activate
conda env create -f environment.yml conda activate longreadsum
Build the program
make ```
Running
Activate the conda environment and then run with arguments:
conda activate longreadsum
longreadsum <FILETYPE> [arguments]
General Usage
Specify the filetype followed by parameters:
longreadsum <FILETYPE> -i $INPUT_FILE -o $OUTPUT_DIRECTORY
Common parameters
To see all parameters for a filetype, run:
longreadsum <FILETYPE> --help
This section describes parameters common to all filetypes:
| Parameter | Description | Default | | --- | --- | --- | | -i, --input | A single input filepath | -I, --inputs | Multiple comma-separated input filepaths | -P, --pattern | Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes. | -g, --log | Log file path | logoutput.log | -G, --log-level |Logging level (1: DEBUG, 2: INFO, 3: WARNING, 4: ERROR, 5: CRITICAL) | 2 | -o, --outputfolder | Output directory | outputlongreadsum | -t, --threads | The number of threads used | 1 | -Q, --outprefix | Output file prefix | QC_
WGS BAM
This section describes how to generate QC reports for BAM files from whole-genome sequencing (WGS) with alignments to a linear reference genome such as GRCh38 (data shown is HG002 sequenced with ONT Kit V14 Promethion R10.4.1 from https://labs.epi2me.io/askenazi-kit14-2022-12/)
General usage
longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY
BAM with base modifications
This section describes how to generate QC reports for BAM files with MM, ML base modification tags (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)
Parameters
| Parameter | Description | Default | | --- | --- | --- | | --mod | Run base modification analysis on the BAM file | False | --modprob | Base modification filtering threshold. Above/below this value, the base is considered modified/unmodified. | 0.8 | --ref | The reference genome FASTA file to use for identifying CpG sites (optional)
General usage
longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY --mod --modprob 0.8 --ref $REF_GENOME
RRMS BAM
This section describes describes how to generate QC reports for ONT RRMS BAM files and associated CSVs (data shown is HG002 RRMS using ONT R9.4.1).
Accepted reads:
Rejected reads:
Parameters
| Parameter | Description | Default | | --- | --- | --- | | -c, --csv | CSV file containing read IDs to extract from the BAM file*
The CSV file should contain a read_id column with the read IDs in the BAM
file, and a decision column with the accepted/rejected status of the read.
Accepted reads will have stop_receiving in the decision column, while rejected
reads will have unblock:
batch_time,read_number,channel,num_samples,read_id,sequence_length,decision
1675186897.6034577,93,4,4011,f943c811-3f97-4971-8aed-bb9f36ffb8d1,361,unblock
1675186897.7544408,80,68,4025,fab0c19d-8085-454c-bfb7-c375bbe237a1,462,unblock
1675186897.7544408,93,127,4028,5285e0ba-86c0-4b5d-ba27-5783acad6105,438,unblock
1675186897.7544408,103,156,4023,65d8befa-eec0-4496-bf2b-aa1a84e6dc5e,362,stop_receiving
...
General usage
longreadsum rrms -i $INPUT_FILE -o $OUTPUT_DIRECTORY -c $RRMS_CSV
RNA-Seq BAM
This section describes how to generate QC reports for TIN (transcript integrity number) scores from RNA-Seq BAM files (data shown is Adult GTEx v9 long-read RNA-seq data sequenced with ONT cDNA-PCR protocol from https://www.gtexportal.org/home/downloads/adult-gtex/longreaddata).
Outputs
A TSV file with scores for each transcript:
geneID chrom tx_start tx_end TIN
ENST00000456328.2 chr1 11868 14409 2.69449577083296
ENST00000450305.2 chr1 12009 13670 0.00000000000000
ENST00000488147.2 chr1 14695 24886 94.06518975035769
ENST00000619216.1 chr1 17368 17436 0.00000000000000
ENST00000473358.1 chr1 29553 31097 0.00000000000000
...
An TSV file with TIN score summary statistics:
Bam_file TIN(mean) TIN(median) TIN(stddev)
/mnt/isilon/wang_lab/perdomoj/data/GTEX/GTEX-14BMU-0526-SM-5CA2F_rep.FAK93376.bam 67.06832655372376 74.24996965188242 26.03788585287367
A summary table in the HTML report:
Parameters
| Parameter | Description | Default | | --- | --- | --- | | --genebed | Gene BED12 file required for calculating TIN scores | --sample-size | Sample size for TIN calculation | 100 | --min-coverage | Minimum coverage for TIN calculation | 10
General usage
longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY --genebed $BED_FILE --min-coverage <COVERAGE> --sample-size <SIZE>
Download an example HTML report here (data is Adult GTEx v9 long-read RNA-seq data sequenced with ONT cDNA-PCR protocol from https://www.gtexportal.org/home/downloads/adult-gtex/longreaddata)
PacBio unaligned BAM
This section describes how to generate QC reports for PacBio BAM files without alignments (data shown is HG002 sequenced with PacBio Revio HiFi long reads obtained from https://www.pacb.com/connect/datasets/#WGS-datasets).
General usage
longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY
ONT POD5
This section describes how to generate QC reports for ONT POD5 (signal) files and their corresponding basecalled BAM files (data shown is HG002 using ONT R10.4.1 and LSK114 downloaded from the tutorial https://github.com/epi2me-labs/wf-basecalling).
[!NOTE] This requires generating basecalled BAM files with the move table output. For example, for dorado, the parameter is
--emit-moves
Parameters
[!NOTE] The interactive signal-base correspondence plots in the HTML report use a lot of memory (RAM) which can make your web browser slow. Thus by default, we randomly sample only a few reads, and the user can specify a list of read IDs as well (e.g. from a specific region of interest).
| Parameter | Description | Default | | --- | --- | --- | | -b, --basecalls | The basecalled BAM file to use for signal extraction | -r, --read_ids | A comma-separated list of read IDs to extract from the file | -R, --read-count | Set the number of reads to randomly sample from the file | 3
General usage
```
Individual file:
longreadsum pod5 -i $INPUTFILE -o $OUTPUTDIRECTORY --basecalls $INPUT_BAM [--read-count
Directory:
longreadsum pod5 -P "$INPUTDIRECTORY/*.fast5" -o $OUTPUTDIRECTORY --basecalls $INPUT_BAM [--read-count
ONT FAST5
Signal QC
This section describes how to generate QC reports for generating a signal and basecalling QC report from ONT FAST5 files with signal and basecall information (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)
Parameters
[!NOTE] The interactive signal-base correspondence plots in the HTML report use a lot of memory (RAM) which can make your web browser slow. Thus by default, we randomly sample only a few reads, and the user can specify a list of read IDs as well (e.g. from a specific region of interest).
| Parameter | Description | Default | | --- | --- | --- | | -r, --read_ids | A comma-separated list of read IDs to extract from the file | -R, --read-count | Set the number of reads to randomly sample from the file | 3
General usage
```
Individual file:
longreadsum f5s -i $INPUTFILE -o $OUTPUTDIRECTORY [--read-count
Directory:
longreadsum f5s -P "$INPUTDIRECTORY/*.fast5" -o $OUTPUTDIRECTORY [--read-count
Sequence QC
This section describes how to generate QC reports for sequence data from ONT FAST5 files (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)
General usage
longreadsum f5 -i $INPUT_FILE -o $OUTPUT_DIRECTORY
Basecall summary
This section describes how to generate QC reports for ONT basecall summary (sequencingsummary.txt) files (data shown is HG002 sequenced with ONT PromethION R10.4 from https://labs.epi2me.io/gm24385q202021.10/, filename `gm24385q202021.10/analysis/2021080517135CPAH792570e41e938/guppy5.0.15sup/sequencingsummary.txt`)
General usage
longreadsum seqtxt -i $INPUT_FILE -o $OUTPUT_DIRECTORY
FASTQ
This section describes how to generate QC reports for FASTQ files (data shown is HG002 ONT 2D from GIAB FTP index)
General usage
longreadsum fq -i $INPUT_FILE -o $OUTPUT_DIRECTORY
FASTA
This section describes how to generate QC reports for FASTA files (data shown is HG002 ONT 2D from GIAB FTP index).
General usage
longreadsum fa -i $INPUT_FILE -o $OUTPUT_DIRECTORY
Revision history
For release history, please visit here.
Getting help
Please refer to the LongReadSum issue pages for posting your issues. We will also respond your questions quickly. Your comments are criticl to improve our tool and will benefit other users.
Citing LongReadSum
Please cite the article below if you use our tool:
1 Perdomo, J. E., Ahsan, M. U., Liu, Q., Fang, L. & Wang, K. LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data. Computational and Structural Biotechnology Journal 27, 556-563, doi:10.1016/j.csbj.2025.01.019 (2025).
Owner
- Name: Wang Genomics Lab
- Login: WGLab
- Kind: organization
- Location: Philadelphia, PA
- Website: https://wglab.org
- Repositories: 70
- Profile: https://github.com/WGLab
We develop software tools for genome analysis
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
type: article
authors:
- family-names: "Perdomo"
given-names: "Jonathan Elliot"
orcid: "https://orcid.org/0000-0001-7145-7401"
- family-names: "Ahsan"
given-names: "Mian Umair"
- family-names: "Liu"
given-names: "Qian"
- family-names: "Fang"
given-names: "Li"
- family-names: "Wang"
given-names: "Kai"
title: "LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data"
doi: https://doi.org/10.1016/j.csbj.2025.01.019
date-released: 2025-01-24
url: "https://github.com/WGLab/LongReadSum"
journal: "Computational and Structural Biotechnology Journal"
month: 1
start: 556
end: 563
volume: 27
year: 2025
GitHub Events
Total
- Create event: 3
- Issues event: 5
- Release event: 2
- Watch event: 10
- Delete event: 1
- Issue comment event: 9
- Push event: 40
- Pull request event: 4
- Fork event: 1
Last Year
- Create event: 3
- Issues event: 5
- Release event: 2
- Watch event: 10
- Delete event: 1
- Issue comment event: 9
- Push event: 40
- Pull request event: 4
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 3
- Average time to close issues: 10 months
- Average time to close pull requests: about 7 hours
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 1.67
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 3
- Average time to close issues: 28 days
- Average time to close pull requests: about 7 hours
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jonperdomo (4)
- hy09 (1)
- sklages (1)
Pull Request Authors
- jonperdomo (11)
- gchevignon (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- conda-incubator/setup-miniconda v2 composite
- dsaltares/fetch-gh-release-asset 1.0.0 composite
- continuumio/miniconda3 latest build
- hdf5
- htslib 1.20.*
- numpy
- ont_vbz_hdf_plugin
- plotly
- pod5
- pyarrow
- pytest
- python
- swig