sv-channels

Deep learning-based structural variant filtering method

https://github.com/googlingthecancergenome/sv-channels

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Deep learning-based structural variant filtering method

Basic Info
Statistics
  • Stars: 39
  • Watchers: 3
  • Forks: 6
  • Open Issues: 23
  • Releases: 2
Created over 8 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog Contributing License Citation Zenodo

README.md

sv-channels

DOI CI

sv-channels is a Deep Learning workflow for filtering structural variants (SVs) in short read alignment data using a one-dimensional Convolutional Neural Network (CNN). Currently, only deletions (DEL) called with Manta are supported. The workflow includes the following key steps:

Transform read alignments into channels

For each pair of SV breakpoints, a 2D Numpy array called window-pair is constructed. The shape of a window is [windowsize*2+buffersize, numberofchannels], where the genomic interval encompassing each window is centered on the breakpoint position with a context of [-windowsize/2, +windowsize/2]. windowsize is 124 bp by default. From all the reads overlapping this genomic interval and from the relative segment subsequence of the reference sequence *numberofchannels* channels are constructed, where each channel encode a signal that can be used for SV calling. The list of channels can be found here. The two windows are joined as window-pair with a buffer, a 2D array of zeros with shape [8, numberof_channels] in between to avoid artifacts related to the CNN kernel passing at the interface between the two windows. The window-pairs are labelled as DEL when the breakpoint positions overlap the DEL callset used as ground truth and noDEL otherwise.

Labelling

Window-pairs are labelled as DEL (a true deletion) or noDEL (a false positive call) based on the overlap of the DEL breakpoints of the window-pair with the truth set.

Model training

The labelled window-pairs are used to train a 1D CNN to classify Manta SVs as either DEL (true deletions) or noDEL (false positives).

Scoring Manta DELs

The model is run on the window-pairs of a test sample. The SV qualities for the Manta DELs (QUAL) of the test sample are substituted with the posterior probabilities obtained by the model.

Dependencies

1. Clone this repo.

bash git clone https://github.com/GooglingTheCancerGenome/sv-channels.git cd sv-channels

2. Install dependencies.

```bash

download Miniconda3 installer

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

install Conda (respond by 'yes')

bash miniconda.sh

update Conda

conda update -y conda

install Mamba

conda install -n base -c conda-forge -y mamba

create a new environment with dependencies & activate it

mamba env create -n sv-channels -f environment.yaml conda activate sv-channels

install svchannels CLI

python setup.py install ```

3. Execution.

  • input:
    • read alignment in BAM format
    • reference genome used to map the reads in FASTA format
  • output:
    • SV callset generated by Manta in VCF format

Run on test data

  1. Extract signals.

svchannels extract-signals reference.fasta sample.bam -o signals

  1. Convert VCF file (Manta callset) to BEDPE format.

Rscript svchannels/utils/R/vcf2bedpe.R -i manta.vcf -o manta.bedpe

  1. Generate channels.

svchannels generate-channels --reference reference.fasta signals channels manta.bedpe

  1. Use the pretrained model.

  2. Score SVs.

svchannels score channels model.keras manta.vcf sv-channels.vcf

Train a new model

  1. Extract signals.

svchannels extract-signals reference.fasta training_sample.bam -o signals

  1. Convert VCF files (Manta callset and truth set) to BEDPE format.

Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_ground_truth.vcf \ -o training_sample_ground_truth.bedpe Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_manta.vcf \ -o training_sample_manta.bedpe

  1. Generate channels.

svchannels generate-channels --reference reference.fasta signals channels training_sample_manta.bedpe

  1. Label SVs.

svchannels label -f reference.fasta.fai -o labels channels/sv_positions.bedpe training_sample_ground_truth.bedpe

  1. Train the model.

svchannels train channels/channels.zarr.zip labels/labels.json.gz -m model.keras

If there are multiple training samples, step 1-4 are repeated for each sample to generate channels and labels. The channels and labels for the training samples are added as comma-separated arguments in step 5. See an example below:

svchannels train \ channels_sample1/channels.zarr.zip,channels_sample2/channels.zarr.zip \ labels_sample1/labels.json.gz,labels_sample2/labels.json.gz \ -m model.keras

Note: For the purpose of CI testing, the same BAM file is used for both model training and testing.

Contributing

If you want to contribute to the development of sv-channels, have a look at the CONTRIBUTING.md.

License

Copyright (c) 2023, Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

  • Name: Googling the cancer genome
  • Login: GooglingTheCancerGenome
  • Kind: organization
  • Location: Netherlands

Software repositories of the Netherlands eScience Center project: Googling the cancer genome

GitHub Events

Total
  • Watch event: 7
  • Fork event: 1
Last Year
  • Watch event: 7
  • Fork event: 1

Dependencies

.github/workflows/ci.yaml actions
  • docker/login-action v1 composite
environment.yaml pypi
notebooks/benchmark/environment.yml pypi
scripts/utils/environment.yaml pypi
setup.py pypi
svchannels/cross-validations/workflow_0/envs/environment.yaml pypi
svchannels/cross-validations/workflow_1/envs/environment.yaml pypi
svchannels/cross-validations/workflow_2/envs/environment.yaml pypi
svchannels/cross-validations/workflow_3/envs/environment.yaml pypi
svchannels/cross-validations/workflow_4/envs/environment.yaml pypi
svchannels/cross-validations/workflow_sim/envs/environment.yaml pypi