sv-channels
Deep learning-based structural variant filtering method
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
Deep learning-based structural variant filtering method
Basic Info
- Host: GitHub
- Owner: GooglingTheCancerGenome
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: https://research-software.nl/software/sv-channels
- Size: 449 MB
Statistics
- Stars: 39
- Watchers: 3
- Forks: 6
- Open Issues: 23
- Releases: 2
Metadata Files
README.md
sv-channels
sv-channels is a Deep Learning workflow for filtering structural variants (SVs) in short read alignment data using a one-dimensional Convolutional Neural Network (CNN). Currently, only deletions (DEL) called with Manta are supported. The workflow includes the following key steps:
Transform read alignments into channels
For each pair of SV breakpoints, a 2D Numpy array called window-pair is constructed. The shape of a window is [windowsize*2+buffersize, numberofchannels], where the genomic interval encompassing each window is centered on the breakpoint position with a context of [-windowsize/2, +windowsize/2]. windowsize is 124 bp by default. From all the reads overlapping this genomic interval and from the relative segment subsequence of the reference sequence *numberofchannels* channels are constructed, where each channel encode a signal that can be used for SV calling. The list of channels can be found here. The two windows are joined as window-pair with a buffer, a 2D array of zeros with shape [8, numberof_channels] in between to avoid artifacts related to the CNN kernel passing at the interface between the two windows. The window-pairs are labelled as DEL when the breakpoint positions overlap the DEL callset used as ground truth and noDEL otherwise.
Labelling
Window-pairs are labelled as DEL (a true deletion) or noDEL (a false positive call) based on the overlap of the DEL breakpoints of the window-pair with the truth set.
Model training
The labelled window-pairs are used to train a 1D CNN to classify Manta SVs as either DEL (true deletions) or noDEL (false positives).
Scoring Manta DELs
The model is run on the window-pairs of a test sample. The SV qualities for the Manta DELs (QUAL) of the test sample are substituted with the posterior probabilities obtained by the model.
Dependencies
- Python 3
- Conda including
environment.yaml
1. Clone this repo.
bash
git clone https://github.com/GooglingTheCancerGenome/sv-channels.git
cd sv-channels
2. Install dependencies.
```bash
download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
install Conda (respond by 'yes')
bash miniconda.sh
update Conda
conda update -y conda
install Mamba
conda install -n base -c conda-forge -y mamba
create a new environment with dependencies & activate it
mamba env create -n sv-channels -f environment.yaml conda activate sv-channels
install svchannels CLI
python setup.py install ```
3. Execution.
- input:
- output:
Run on test data
- Extract signals.
svchannels extract-signals reference.fasta sample.bam -o signals
- Convert VCF file (Manta callset) to BEDPE format.
Rscript svchannels/utils/R/vcf2bedpe.R -i manta.vcf -o manta.bedpe
- Generate channels.
svchannels generate-channels --reference reference.fasta signals channels manta.bedpe
Use the pretrained model.
Score SVs.
svchannels score channels model.keras manta.vcf sv-channels.vcf
Train a new model
- Extract signals.
svchannels extract-signals reference.fasta training_sample.bam -o signals
- Convert VCF files (Manta callset and truth set) to BEDPE format.
Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_ground_truth.vcf \
-o training_sample_ground_truth.bedpe
Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_manta.vcf \
-o training_sample_manta.bedpe
- Generate channels.
svchannels generate-channels --reference reference.fasta signals channels training_sample_manta.bedpe
- Label SVs.
svchannels label -f reference.fasta.fai -o labels channels/sv_positions.bedpe training_sample_ground_truth.bedpe
- Train the model.
svchannels train channels/channels.zarr.zip labels/labels.json.gz -m model.keras
If there are multiple training samples, step 1-4 are repeated for each sample to generate channels and labels. The channels and labels for the training samples are added as comma-separated arguments in step 5. See an example below:
svchannels train \
channels_sample1/channels.zarr.zip,channels_sample2/channels.zarr.zip \
labels_sample1/labels.json.gz,labels_sample2/labels.json.gz \
-m model.keras
Note: For the purpose of CI testing, the same BAM file is used for both model training and testing.
Contributing
If you want to contribute to the development of sv-channels, have a look at the CONTRIBUTING.md.
License
Copyright (c) 2023, Netherlands eScience Center
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Owner
- Name: Googling the cancer genome
- Login: GooglingTheCancerGenome
- Kind: organization
- Location: Netherlands
- Website: https://www.esciencecenter.nl/projects/googling-the-cancer-genome/
- Repositories: 3
- Profile: https://github.com/GooglingTheCancerGenome
Software repositories of the Netherlands eScience Center project: Googling the cancer genome
GitHub Events
Total
- Watch event: 7
- Fork event: 1
Last Year
- Watch event: 7
- Fork event: 1
Dependencies
- docker/login-action v1 composite