kdpeak

A kernel density estimation based peak caller

https://github.com/settylab/kdpeak

Last synced: 11 months ago · JSON representation ·

Repository

A kernel density estimation based peak caller

Basic Info

Host: GitHub
Owner: settylab
License: gpl-3.0
Language: Python
Default Branch: main
Size: 294 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 3

Created almost 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

kdpeak

kdpeak is a Python package designed to identifying genomic peaks from genomic reads in bed format using kernel density estimation (KDE).

Installation

Install via PyPI with:

bash pip install kdpeak

Alternatively, install directly from GitHub:

bash pip install git+https://github.com/settylab/kdpeak.git

Using kdpeak

kdpeak allows for processing of a bed file. It applies KDE, pinpoints peaks based on the fragemnt end density, and records the results in an output bed file. The package enables customization of parameters like KDE bandwidth, sequence blacklist, minimum peak size, and others. Designed to deliver a specific fraction-in-peaks (FRiP), it defaults to 0.3.

Elementary usage:

bash kdpeak reads.bed --frip 0.3 --out peaks.bed

Parameters

bash usage: kdpeak [-h] [--out OUTPUT_FILE] [--summits-out SUMMITS_FILE] [--density-out DENSITY_FILE] [--chrom-sizes CHROM_SIZES] [-l LEVEL] [--logfile LOGFILE] [--blacklisted-seqs chrN [chrN ...]] [--kde-bw FLOAT] [--min-peak-size INT] [--fraction-in-peaks FLOAT] [--span INT] READS.BED

Positional Argument:

reads.bed - Path to the bed file containing the genomic reads.

Options:

-h, --help - Show this help message and exit.
--out output_file.bed - Path to the output file where the results will be saved. Peaks are saved in bed format with the columns: start, end, peak name, AUC (area under the cut density curve where cut-density is in cuts per 100 base pairs). Defaults to peaks.bed.
--summits-out summits_file.bed - Path to the output file where the peak summits will be saved. The file will have columns for start, end (start+1), peak name, and summit height (in cuts per 100 base pairs). If nothing is specified the summits will not be saved.
--density-out density_file.bw - Path to the output file where the event density will be saved. The event density is the internally computed signal on which the peaks are called based on a cutoff. It will be saved in the bigwig format for visualization in genome browsers like IGV or JBrowse.
--chrom-sizes chrom-sizes-file - Chromosome sizes file with the two columns: seqname and size. This file is only needed if --density-out is specified. You can use a script like https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes to fetch the file.
-l LEVEL, --log LEVEL - Set the logging level. Options include: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default is INFO.
--logfile LOGFILE - Path to the file to write a detailed log.
--blacklisted-seqs chrN [chrN ...] - List of sequences (e.g., chromosomes) to exclude from peak calling. Input as space-separated values.
--kde-bw FLOAT - Bandwidth (standard deviation, sigma in base pairs) for the KDE. Increase for larger features to reduce noise. Default is 200.
--min-peak-size INT - Minimal size (in base pairs) for a peak to be considered valid. Default is 100.
--fraction-in-peaks FLOAT, --frip FLOAT - Expected fraction of total reads to be located in peaks. Default is 0.3.
--span INT - Resolution of the analysis in base pairs, determining the granularity of the KDE and peak calling. Default is 10.

Utilizing All Available Options:

bash kdpeak reads.bed --out peaks.bed --summits-out summits.bed --density-out density.bw --chrom-sizes hg38.chrom.sizes --log DEBUG --logfile debug.log --blacklisted-seqs chr1 chr2 --kde-bw 500 --min-peak-size 50 --frip 0.5 --span 5

BigWig Output for Visualization

kdpeak can export the computed kernel density estimation as BigWig files for visualization in genome browsers:

```bash

Basic usage with BigWig output

kdpeak reads.bed --out peaks.bed --density-out density.bw --chrom-sizes hg38.chrom.sizes

Get chromosome sizes file (example for hg38)

fetchChromSizes hg38 > hg38.chrom.sizes ```

The BigWig file contains the event density values (in cuts per 100 base pairs) that are used internally for peak calling. These files can be loaded into genome browsers like IGV or JBrowse to visualize the underlying signal alongside the called peaks.

BigWig Operations Tool (bwops)

The bwops utility provides mathematical operations and regression analysis on BigWig files with comprehensive chromosome filtering capabilities:

Basic Operations

```bash

Add multiple BigWig files

bwops add file1.bw file2.bw file3.bw --out sum.bw --chrom-sizes hg38.chrom.sizes

Add files excluding contigs and mitochondrial DNA

bwops add file1.bw file2.bw --out sum.bw \ --exclude-contigs --blacklisted-seqs chrM \ --chrom-sizes hg38.chrom.sizes

Multiply BigWig files (convolution) for human autosomes only

bwops multiply signal1.bw signal2.bw --out product.bw \ --chromosome-pattern "chr[0-9]+$" \ --chrom-sizes hg38.chrom.sizes

Output to different formats

bwops add file1.bw file2.bw --out results.csv --format csv ```

Regression Analysis

The bwops regress command provides flexible regression analysis with support for complex file paths and easy-to-read formulas:

```bash

Simple regression with auto-generated formula (target ~ a + b)

bwops regress --target "signal.bw" \ --predictors "control.bw" "treatment.bw" \ --out-prediction predictions.bw \ --chrom-sizes hg38.chrom.sizes

Named variables with custom formula for clarity

bwops regress --target "response=ChIPsignal.bw" \ --predictors "input=inputcontrol.bw" "dnase=DNase_signal.bw" \ --formula "response ~ input + dnase + input*dnase" \ --out-prediction predictions.bw \ --out-residuals residuals.bw \ --out-stats stats.json \ --chrom-sizes hg38.chrom.sizes

Complex file paths with spaces (no parsing issues)

bwops regress --target "signal=/path/with spaces/signal file.bw" \ --predictors "ctrl=/data/control samples/ctrl.bw" \ --formula "signal ~ ctrl" \ --format csv --out results.csv

Regression excluding problematic sequences

bwops regress --target "target.bw" --predictors "predictor.bw" \ --exclude-contigs --blacklisted-seqs chrM chrY \ --chromosome-pattern "chr[0-9XY]+$" \ --out-prediction predictions.bw \ --chrom-sizes hg38.chrom.sizes

Logistic regression with mixed naming

bwops regress --target "binarytarget.bw" \ --predictors "pred1=predictor1.bw" "predictor2.bw" \ --formula "target ~ pred1 + a" # a = predictor2.bw \ --type logistic \ --out-prediction logisticpred.bw \ --chrom-sizes hg38.chrom.sizes ```

Variable Naming System

Explicit naming: "varname=file.bw" creates variable varname for file.bw
Default naming: Just "file.bw" uses target for target, a, b, c... for predictors
Auto-generated formulas: If no --formula provided, generates target ~ a + b + c
Clean formulas: Use variable names instead of file paths for readability

Required Arguments

--target: Target variable file with optional name (target=file.bw or file.bw)
--predictors: One or more predictor files with optional names (pred1=file1.bw pred2=file2.bw)

Optional Arguments

--formula: R-style formula using variable names (auto-generated if omitted)
--type: Regression type (linear or logistic, default: linear)
--out-prediction: Output file for predictions
--out-residuals: Output file for residuals
--out-stats: Output file for detailed statistics (JSON format)

Filtering Options

All bwops operations support comprehensive chromosome filtering (consistent with kdpeak):

--blacklisted-seqs: Exclude specific chromosomes (e.g., chrM chrY)
--exclude-contigs: Automatically exclude contigs/scaffolds
--chromosome-pattern: Include only chromosomes matching regex pattern
--chromosomes: Analyze only specified chromosomes
--region: Limit analysis to genomic region (chr:start-end)

Additional Options

--span INT: Resolution in base pairs (default: auto-detect from BigWig files)
--format: Output format (bigwig, csv, tsv, bed, json)

The regression analysis operates across all chromosomes simultaneously and prints summary statistics including R², p-values, and fitted coefficients to stdout. It can output predictions, residuals, and detailed statistics to separate files.

Correlation Analysis

The bwops correlate command computes pairwise correlation matrices between multiple BigWig files with support for different correlation methods and analysis scopes:

```bash

Basic pairwise correlation analysis

bwops correlate file1.bw file2.bw file3.bw --out correlations.csv --chrom-sizes hg38.chrom.sizes

Spearman correlation with per-chromosome analysis

bwops correlate *.bw --out spearman_corr.csv \ --method spearman --scope per-chromosome \ --chrom-sizes hg38.chrom.sizes

Correlation with visualization and detailed statistics

bwops correlate ChIP1.bw ChIP2.bw Control1.bw Control2.bw \ --out correlationmatrix.csv \ --method pearson \ --include-stats \ --heatmap correlationheatmap.png \ --scatter-plots scatter_plots/ \ --chrom-sizes hg38.chrom.sizes

Correlation analysis excluding problematic regions

bwops correlate signal1.bw signal2.bw signal3.bw \ --out filtered_correlations.tsv --format tsv \ --exclude-contigs --blacklisted-seqs chrM chrY \ --chromosome-pattern "chr[0-9XY]+$" \ --min-overlap 5000 \ --chrom-sizes hg38.chrom.sizes

Regional correlation analysis

bwops correlate enhancermarks.bw promotermarks.bw \ --out region_correlation.json --format json \ --region chr1:1000000-2000000 \ --method kendall \ --chrom-sizes hg38.chrom.sizes ```

Correlation Methods

Pearson: Linear correlation coefficient (default) - measures linear relationships
Spearman: Rank-based correlation - robust to outliers and non-linear monotonic relationships
Kendall: Tau correlation coefficient - alternative rank-based method, more robust for small samples

Analysis Scope

Global (default): Compute single correlation matrix using all genomic positions across chromosomes
Per-chromosome: Compute separate correlation matrices for each chromosome

Required Arguments

Input files: Two or more BigWig files for correlation analysis
--out: Output file for correlation matrix
--chrom-sizes: Chromosome sizes file for BigWig processing

Optional Arguments

--method: Correlation method (pearson, spearman, kendall)
--scope: Analysis scope (global, per-chromosome)
--format: Output format (csv, tsv, json)
--min-overlap: Minimum number of overlapping non-zero values required (default: 1000)
--include-stats: Include additional statistics (mean, std, coverage) in output
--heatmap: Generate correlation heatmap plot (requires matplotlib)
--scatter-plots: Directory for pairwise scatter plots

Output Formats

The correlation analysis outputs a symmetric matrix showing correlations between all file pairs:

CSV/TSV format: csv File1,File2,File3 File1,1.000,0.847,0.623 File2,0.847,1.000,0.751 File3,0.623,0.751,1.000

JSON format (with --include-stats): json { "correlations": { "File1_vs_File2": 0.847, "File1_vs_File3": 0.623, "File2_vs_File3": 0.751 }, "statistics": { "File1": {"mean": 2.34, "std": 1.45, "coverage": 0.23}, "File2": {"mean": 1.89, "std": 1.12, "coverage": 0.31} } }

Per-chromosome analysis includes separate correlation matrices for each chromosome, helping identify chromosome-specific correlation patterns.

Disclaimer

kdpeak, being in its Alpha stage, encourages usage with care. We warmly welcome users to report any issues experienced during utilization. Together, we can enhance kdpeak for a better genomic analysis experience.

Owner

Name: Setty Lab @ Fred Hutch
Login: settylab
Kind: organization
Location: United States of America

Website: setty-lab.org
Repositories: 4
Profile: https://github.com/settylab

We are a multi-disciplinary research group investigation differentiation trajectories using single-cell data

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Otto"
    given-names: "Dominik J."
    orcid: "https://orcid.org/0000-0002-6116-053X"
  - family-names: "Setty"
    given-names: "Manu"
    orcid: "https://orcid.org/0000-0002-0344-2627"
title: "settylab/kdpeak: v0.3.0"
version: v0.3.0
date-released: 2025-06-25

GitHub Events

Total

Push event: 15

Last Year

Push event: 15

Packages

Total packages: 1
Total downloads:
- pypi 29 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

pypi.org: kdpeak

A tool to identify genomic peaks based on kernel density estimation.

Homepage: https://github.com/settylab/kdpeak
Documentation: https://kdpeak.readthedocs.io/
License: GNU General Public License v3 (GPLv3)
Latest release: 0.2.0
published almost 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 29 Last month

Rankings

Dependent packages count: 7.6%

Forks count: 30.4%

Average: 36.7%

Stargazers count: 39.3%

Dependent repos count: 69.5%

Maintainers (1)

katosh

Last synced: 11 months ago

kdpeak

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

kdpeak

Installation

Using kdpeak

Parameters

Utilizing All Available Options:

BigWig Output for Visualization

Basic usage with BigWig output

Get chromosome sizes file (example for hg38)

BigWig Operations Tool (bwops)

Basic Operations

Add multiple BigWig files

Add files excluding contigs and mitochondrial DNA

Multiply BigWig files (convolution) for human autosomes only

Output to different formats

Regression Analysis

Simple regression with auto-generated formula (target ~ a + b)

Named variables with custom formula for clarity

Complex file paths with spaces (no parsing issues)

Regression excluding problematic sequences

Logistic regression with mixed naming

Variable Naming System

Required Arguments

Optional Arguments

Filtering Options

Additional Options

Correlation Analysis

Basic pairwise correlation analysis

Spearman correlation with per-chromosome analysis

Correlation with visualization and detailed statistics

Correlation analysis excluding problematic regions

Regional correlation analysis

Correlation Methods

Analysis Scope

Required Arguments

Optional Arguments

Output Formats

Disclaimer

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: kdpeak

Rankings

Maintainers (1)

Dependencies