cfdna_gccorrection

https://github.com/kircherlab/cfdna_gccorrection

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: kircherlab
License: other
Language: Python
Default Branch: master
Size: 118 MB

Statistics

Stars: 5
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 2

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Citation

cfDNA GCcorrection

Introduction
Documentation
Example
Acknowledgment

Introduction

cfDNA_GCcorrection is an easy-to-use tool to determine and correct GC biases in cell-free DNA (cfDNA) samples. Based on fragment length and GC distributions, it calculates read-level correction values and makes them available as tags in SAM/BAM format. This additional information can be used during signal extraction in various metrics (e.g., allelic balance, coverage, read ends/midpoints, WPS), while preserving the original read coverage patterns for specific analyses.

Documentation

Installation

Install by cloning this repository:

You can install cfDNA_GCcorrection on command line (linux/mac) by cloning this git repository :

bash git clone https://github.com/kircherlab/cfDNA_GCcorrection.git cd cfDNA_GCcorrection pip install -e .

note: pybedtools needs a locally installed version of bedtools! Please install it by other means. See the official documentation.

Quickstart

The GCbiascorrection described in this repo works in two steps.

Executing computeGCBias_readlen

First the GC bias is calculated with the help of the computeGCBias_readlen script, which returns a tab-separated file containing the expected, observed GC x fragment length distribution and the ratio of these measures.

Given a .bam file with WGS data from a liquid biopsy and a genome reference in .2bit format:

bash computeGCBias_readlen -b <INPUT BAMFILE> \ -g <2BIT GENOME> \ -i \ -p <NUMBER OF CPU CORES> \ --GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE>

For more options use the --help flag.

Note: Computation of expected GC x fragment length distribution is expensive. If multiple samples are to be processed, genome specific background profiles can be reused. For more details see this section

Executing correctGCBias_readlen

The second step uses these precomputed values to compute correction values and attach them as bam/sam tag to each read according to its length and GC content. This can be done by executing the correctGCBias_readlen with the output of the previous script.

bash correctGCBias_readlen -b <INPUT BAMFILE> \ --effectiveGenomeSize <EFFECTIVEGENOMESIZE> \ -g <2BIT GENOME> \ --GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE> \ -p <NUMBER OF CPU CORES> \ -o <GCWEIGHTED_OUTPUT_BAM>

For more options use the --help flag.

Getting required Files/parameters

Effective genome size:

The effective genome size is the portion of the genome that is mappable. This parameter will be used for creating the sampling distribution.

The most used values are:

| Genome | Effective size | |--------|----------------| | GRCh37 | 2864785220 | | GRCh38 | 2913022398 |

A more extensive documentation on determining the effective genome size can be found here.

getting 2bit genome files :

2bit versions of the most common genomes can be downloaded from UCSC and are indicated by the .2bit extension.

If the genome file is not available fasta files can be converted to 2bit by using faToTwoBit.

Speeding up GC bias calculation for multiple samples

The computation of expected GC x fragment length distribution (background profile) is expensive. As background profile is only dependent on the genome, it can be precalculated and reused for multiple samples mapped to the same genome build. This approach reduces the runtime of the GC bias calculation significantly, if multiple samples are to be processed.

The background profile can be calculated using the computeGCBias_background script.

bash computeGCBias_background -b <INPUT BAMFILE> \ -g <2BIT GENOME> \ -i \ -p <NUMBER OF CPU CORES> \ --output <GCBIAS_background_FILE>

The resulting output file can be used as an additional input for the computeGCBias_readlen script by specifying it using the--precomputed_background flag.

bash computeGCBias_readlen -b <INPUT BAMFILE> \ -g <2BIT GENOME> \ -i \ --precomputed_background <GCBIAS_background_FILE> \ -p <NUMBER OF CPU CORES> \ --GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE>

If a precomputed background profile was specified, the computeGCBias_readlen script will load it, check its integrity and use it during the GC bias calculations, effectively skipping the expensive computation of the background profile.

Example

In recent years, many approaches were developed to extract information from cfDNA. Methods range from identification of allelic differences at known disease markers, coverage alterations indicative of structural changes in tumor tissues, fragmentation differences to measuring methylation state. These methods measure different signals, but all rely on the quantification of read distributions and subtle changes in read recovery affect their results.

A common bias in sequencing and DNA library preparation is related to the Guanine+Cytosine (GC) content of fragments, leading to under- or overrepresentation of certain read populations (Fig 1).

**Figure 1. Measurement of GC biases in cfDNA data.** Heatmaps show the theoretical fragment representation in genome (left), distribution in the sample (middle) and the ratio (right).

From these values we can calculate readlength and GC dependent correction values, which can be used as weights fro each read. A correction of over- and underrepresented reads in a sample with an average GC content of 43% can be seen in Figure 2, showing the uncorrected ratio of Figure 1 on the left and the corrected ratio on the right.

One potential application is signal extraction around transcription afctor binding sites (TFBS). Firgure 3 shows the coverage signal around the TFBS of LYL1, a transcription factor associated with hematopoiesis.

Figure 3 shows the compound signal of the LYL1 TFBS from normalized coverage. The expected signal is shown on the left side (sample with 41% average GC), with higher coverage in close flanking regions and a drop in coverage at the TFBS indicative of higher accessibility compared to regional background. The signal on the right side of the plot (sample with 43% average GC) is distorted due to a technical GC bias, but can be rescued by applying our correction values.

Effects of GC bias correction on coverage signals. — **Figure 3. Effects of GC bias correction on coverage signals from cell-free DNA.** Effects of GC bias correction in cfDNA samples with different GC contents illustrated for an overlay of experimentally identified binding sites of the transcription factor LYL1.

Acknowledgment

cfDNA_GCcorrection started as a fork of deepTools, which was used as a skeleton for developing a GCcorrection method tailored towards cell-free DNA samples.

Owner

Name: kircherlab
Login: kircherlab
Kind: organization

Repositories: 13
Profile: https://github.com/kircherlab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Röner"
  given-names: "Sebastian"
  orcid: "https://orcid.org/0000-0002-8578-1269"
title: "cfDNA GCcorrection"
version: 1.0.0
date-released: 2024-11-04
url: "https://github.com/kircherlab/cfDNA_GCcorrection"

GitHub Events

Total

Create event: 4
Issues event: 3
Release event: 2
Watch event: 2
Delete event: 1
Issue comment event: 4
Push event: 3
Pull request event: 4

Last Year

Create event: 4
Issues event: 3
Release event: 2
Watch event: 2
Delete event: 1
Issue comment event: 4
Push event: 3
Pull request event: 4

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science