Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Repository
A kernel density estimation based peak caller
Basic Info
- Host: GitHub
- Owner: settylab
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 294 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
kdpeak
kdpeak is a Python package designed to identifying genomic peaks from genomic reads in bed format using kernel density estimation (KDE).
Installation
Install via PyPI with:
bash
pip install kdpeak
Alternatively, install directly from GitHub:
bash
pip install git+https://github.com/settylab/kdpeak.git
Using kdpeak
kdpeak allows for processing of a bed file. It applies KDE, pinpoints peaks based on the fragemnt end density, and records the results in an output bed file. The package enables customization of parameters like KDE bandwidth, sequence blacklist, minimum peak size, and others. Designed to deliver a specific fraction-in-peaks (FRiP), it defaults to 0.3.
Elementary usage:
bash
kdpeak reads.bed --frip 0.3 --out peaks.bed
Parameters
bash
usage: kdpeak [-h] [--out OUTPUT_FILE] [--summits-out SUMMITS_FILE] [--density-out DENSITY_FILE] [--chrom-sizes CHROM_SIZES] [-l LEVEL] [--logfile LOGFILE] [--blacklisted-seqs chrN [chrN ...]] [--kde-bw FLOAT] [--min-peak-size INT] [--fraction-in-peaks FLOAT] [--span INT] READS.BED
Positional Argument:
reads.bed- Path to the bed file containing the genomic reads.
Options:
-h, --help- Show this help message and exit.--out output_file.bed- Path to the output file where the results will be saved. Peaks are saved in bed format with the columns: start, end, peak name, AUC (area under the cut density curve where cut-density is in cuts per 100 base pairs). Defaults to peaks.bed.--summits-out summits_file.bed- Path to the output file where the peak summits will be saved. The file will have columns for start, end (start+1), peak name, and summit height (in cuts per 100 base pairs). If nothing is specified the summits will not be saved.--density-out density_file.bw- Path to the output file where the event density will be saved. The event density is the internally computed signal on which the peaks are called based on a cutoff. It will be saved in the bigwig format for visualization in genome browsers like IGV or JBrowse.--chrom-sizes chrom-sizes-file- Chromosome sizes file with the two columns: seqname and size. This file is only needed if --density-out is specified. You can use a script like https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes to fetch the file.-l LEVEL, --log LEVEL- Set the logging level. Options include: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default is INFO.--logfile LOGFILE- Path to the file to write a detailed log.--blacklisted-seqs chrN [chrN ...]- List of sequences (e.g., chromosomes) to exclude from peak calling. Input as space-separated values.--kde-bw FLOAT- Bandwidth (standard deviation, sigma in base pairs) for the KDE. Increase for larger features to reduce noise. Default is 200.--min-peak-size INT- Minimal size (in base pairs) for a peak to be considered valid. Default is 100.--fraction-in-peaks FLOAT, --frip FLOAT- Expected fraction of total reads to be located in peaks. Default is 0.3.--span INT- Resolution of the analysis in base pairs, determining the granularity of the KDE and peak calling. Default is 10.
Utilizing All Available Options:
bash
kdpeak reads.bed --out peaks.bed --summits-out summits.bed --density-out density.bw --chrom-sizes hg38.chrom.sizes --log DEBUG --logfile debug.log --blacklisted-seqs chr1 chr2 --kde-bw 500 --min-peak-size 50 --frip 0.5 --span 5
BigWig Output for Visualization
kdpeak can export the computed kernel density estimation as BigWig files for visualization in genome browsers:
```bash
Basic usage with BigWig output
kdpeak reads.bed --out peaks.bed --density-out density.bw --chrom-sizes hg38.chrom.sizes
Get chromosome sizes file (example for hg38)
fetchChromSizes hg38 > hg38.chrom.sizes ```
The BigWig file contains the event density values (in cuts per 100 base pairs) that are used internally for peak calling. These files can be loaded into genome browsers like IGV or JBrowse to visualize the underlying signal alongside the called peaks.
BigWig Operations Tool (bwops)
The bwops utility provides mathematical operations and regression analysis on BigWig files with comprehensive chromosome filtering capabilities:
Basic Operations
```bash
Add multiple BigWig files
bwops add file1.bw file2.bw file3.bw --out sum.bw --chrom-sizes hg38.chrom.sizes
Add files excluding contigs and mitochondrial DNA
bwops add file1.bw file2.bw --out sum.bw \ --exclude-contigs --blacklisted-seqs chrM \ --chrom-sizes hg38.chrom.sizes
Multiply BigWig files (convolution) for human autosomes only
bwops multiply signal1.bw signal2.bw --out product.bw \ --chromosome-pattern "chr[0-9]+$" \ --chrom-sizes hg38.chrom.sizes
Output to different formats
bwops add file1.bw file2.bw --out results.csv --format csv ```
Regression Analysis
The bwops regress command provides flexible regression analysis with support for complex file paths and easy-to-read formulas:
```bash
Simple regression with auto-generated formula (target ~ a + b)
bwops regress --target "signal.bw" \ --predictors "control.bw" "treatment.bw" \ --out-prediction predictions.bw \ --chrom-sizes hg38.chrom.sizes
Named variables with custom formula for clarity
bwops regress --target "response=ChIPsignal.bw" \ --predictors "input=inputcontrol.bw" "dnase=DNase_signal.bw" \ --formula "response ~ input + dnase + input*dnase" \ --out-prediction predictions.bw \ --out-residuals residuals.bw \ --out-stats stats.json \ --chrom-sizes hg38.chrom.sizes
Complex file paths with spaces (no parsing issues)
bwops regress --target "signal=/path/with spaces/signal file.bw" \ --predictors "ctrl=/data/control samples/ctrl.bw" \ --formula "signal ~ ctrl" \ --format csv --out results.csv
Regression excluding problematic sequences
bwops regress --target "target.bw" --predictors "predictor.bw" \ --exclude-contigs --blacklisted-seqs chrM chrY \ --chromosome-pattern "chr[0-9XY]+$" \ --out-prediction predictions.bw \ --chrom-sizes hg38.chrom.sizes
Logistic regression with mixed naming
bwops regress --target "binarytarget.bw" \ --predictors "pred1=predictor1.bw" "predictor2.bw" \ --formula "target ~ pred1 + a" # a = predictor2.bw \ --type logistic \ --out-prediction logisticpred.bw \ --chrom-sizes hg38.chrom.sizes ```
Variable Naming System
- Explicit naming:
"varname=file.bw"creates variablevarnameforfile.bw - Default naming: Just
"file.bw"usestargetfor target,a,b,c... for predictors - Auto-generated formulas: If no
--formulaprovided, generatestarget ~ a + b + c - Clean formulas: Use variable names instead of file paths for readability
Required Arguments
--target: Target variable file with optional name (target=file.bworfile.bw)--predictors: One or more predictor files with optional names (pred1=file1.bw pred2=file2.bw)
Optional Arguments
--formula: R-style formula using variable names (auto-generated if omitted)--type: Regression type (linearorlogistic, default:linear)--out-prediction: Output file for predictions--out-residuals: Output file for residuals--out-stats: Output file for detailed statistics (JSON format)
Filtering Options
All bwops operations support comprehensive chromosome filtering (consistent with kdpeak):
--blacklisted-seqs: Exclude specific chromosomes (e.g.,chrM chrY)--exclude-contigs: Automatically exclude contigs/scaffolds--chromosome-pattern: Include only chromosomes matching regex pattern--chromosomes: Analyze only specified chromosomes--region: Limit analysis to genomic region (chr:start-end)
Additional Options
--span INT: Resolution in base pairs (default: auto-detect from BigWig files)--format: Output format (bigwig, csv, tsv, bed, json)
The regression analysis operates across all chromosomes simultaneously and prints summary statistics including R², p-values, and fitted coefficients to stdout. It can output predictions, residuals, and detailed statistics to separate files.
Correlation Analysis
The bwops correlate command computes pairwise correlation matrices between multiple BigWig files with support for different correlation methods and analysis scopes:
```bash
Basic pairwise correlation analysis
bwops correlate file1.bw file2.bw file3.bw --out correlations.csv --chrom-sizes hg38.chrom.sizes
Spearman correlation with per-chromosome analysis
bwops correlate *.bw --out spearman_corr.csv \ --method spearman --scope per-chromosome \ --chrom-sizes hg38.chrom.sizes
Correlation with visualization and detailed statistics
bwops correlate ChIP1.bw ChIP2.bw Control1.bw Control2.bw \ --out correlationmatrix.csv \ --method pearson \ --include-stats \ --heatmap correlationheatmap.png \ --scatter-plots scatter_plots/ \ --chrom-sizes hg38.chrom.sizes
Correlation analysis excluding problematic regions
bwops correlate signal1.bw signal2.bw signal3.bw \ --out filtered_correlations.tsv --format tsv \ --exclude-contigs --blacklisted-seqs chrM chrY \ --chromosome-pattern "chr[0-9XY]+$" \ --min-overlap 5000 \ --chrom-sizes hg38.chrom.sizes
Regional correlation analysis
bwops correlate enhancermarks.bw promotermarks.bw \ --out region_correlation.json --format json \ --region chr1:1000000-2000000 \ --method kendall \ --chrom-sizes hg38.chrom.sizes ```
Correlation Methods
- Pearson: Linear correlation coefficient (default) - measures linear relationships
- Spearman: Rank-based correlation - robust to outliers and non-linear monotonic relationships
- Kendall: Tau correlation coefficient - alternative rank-based method, more robust for small samples
Analysis Scope
- Global (default): Compute single correlation matrix using all genomic positions across chromosomes
- Per-chromosome: Compute separate correlation matrices for each chromosome
Required Arguments
- Input files: Two or more BigWig files for correlation analysis
--out: Output file for correlation matrix--chrom-sizes: Chromosome sizes file for BigWig processing
Optional Arguments
--method: Correlation method (pearson,spearman,kendall)--scope: Analysis scope (global,per-chromosome)--format: Output format (csv,tsv,json)--min-overlap: Minimum number of overlapping non-zero values required (default: 1000)--include-stats: Include additional statistics (mean, std, coverage) in output--heatmap: Generate correlation heatmap plot (requires matplotlib)--scatter-plots: Directory for pairwise scatter plots
Output Formats
The correlation analysis outputs a symmetric matrix showing correlations between all file pairs:
CSV/TSV format:
csv
File1,File2,File3
File1,1.000,0.847,0.623
File2,0.847,1.000,0.751
File3,0.623,0.751,1.000
JSON format (with --include-stats):
json
{
"correlations": {
"File1_vs_File2": 0.847,
"File1_vs_File3": 0.623,
"File2_vs_File3": 0.751
},
"statistics": {
"File1": {"mean": 2.34, "std": 1.45, "coverage": 0.23},
"File2": {"mean": 1.89, "std": 1.12, "coverage": 0.31}
}
}
Per-chromosome analysis includes separate correlation matrices for each chromosome, helping identify chromosome-specific correlation patterns.
Disclaimer
kdpeak, being in its Alpha stage, encourages usage with care. We warmly welcome users to report any issues experienced during utilization. Together, we can enhance kdpeak for a better genomic analysis experience.
Owner
- Name: Setty Lab @ Fred Hutch
- Login: settylab
- Kind: organization
- Location: United States of America
- Website: setty-lab.org
- Repositories: 4
- Profile: https://github.com/settylab
We are a multi-disciplinary research group investigation differentiation trajectories using single-cell data
Citation (CITATION.cff)
cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Otto"
given-names: "Dominik J."
orcid: "https://orcid.org/0000-0002-6116-053X"
- family-names: "Setty"
given-names: "Manu"
orcid: "https://orcid.org/0000-0002-0344-2627"
title: "settylab/kdpeak: v0.3.0"
version: v0.3.0
date-released: 2025-06-25
GitHub Events
Total
- Push event: 15
Last Year
- Push event: 15
Packages
- Total packages: 1
-
Total downloads:
- pypi 29 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
pypi.org: kdpeak
A tool to identify genomic peaks based on kernel density estimation.
- Homepage: https://github.com/settylab/kdpeak
- Documentation: https://kdpeak.readthedocs.io/
- License: GNU General Public License v3 (GPLv3)
-
Latest release: 0.2.0
published almost 3 years ago
Rankings
Maintainers (1)
Dependencies
- KDEpy >=1.0.6
- numpy >=1.20.0
- pandas >=1.3.0