population-structural-var-calling-smoove
population structural variant calling with smoove
https://github.com/carolinapb/population-structural-var-calling-smoove
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
population structural variant calling with smoove
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
layout: page
First follow the instructions here:
Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake
ABOUT
This is a pipeline to perform structural variant calling in a population using Smoove. It also runs VEP and performs PCA. In addition to the VCF with the SVs, you also get a .tsv file with some summarized information on the SVs: it includes allele frequency per population, as well as VEP annotation and depth fold change as described in duphold:
DHBFC: fold-change for the variant depth relative to bins in the genome with similar GC-content.
DHFFC: fold-change for the variant depth relative to Flanking regions.
Tools used:
- Smoove - SV calling
- VEP - determines the effect of the variants
- Plink - perform PCA
- R - plot PCA
- SURVIVOR - basic SV stats
- Python
- PyVcf - add depth to vcf and create final table
- bamgroupreads.py + samblaster - create bam files with split and discordant reads
|
|
|:--:|
|Pipeline workflow |
Edit config.yaml with the paths to your files
OUTDIR: /path/to/output
READS_DIR: /path/to/reads/ # don't add the reads files, just the directory where they are
SAMPLE_LIST: /path/to/file
REFERENCE: /path/to/assembly
CONTIGS_IGNORE: /path/to/file
SPECIES: <species_name>
PREFIX: <output name>
NUM_CHRS: <number of chromosomes>
BWA_MEM_M: Y/N
- OUTDIR - directory where snakemake will run and where the results will be written to
- READS_DIR - path to the directory that contains the reads
- SAMPLELIST - three column csv with the sample name, name of the bam files to use in the second column and the name of the corresponding population on the third column. These bams should all be in the same directory (READSDIR)
- Example:
> sample1,sample1.bam,Pop1
> sample2,sample2.bam,Pop1
> sample3,sample3.bam,Pop2
> sample4,sample4.bam,Pop2
Tip: use the name of the bam file without the .bam extension as the sample name. Ex: from sample1.bam to sample1
- REFERENCE - path to the assembly file
- CONTIGSIGNORE - contigs to be excluded from SV calling (usually the small contigs)
- SPECIES - species name to be used for VEP
- PREFIX - prefix for the created files
- NUMCHRS - number of chromosomes for your species (necessary for plink). ex: 38
- BWAMEMM - if you mapped your reads with bwa mem using the -M parameter and you want split read support in your VCF you need to run an extra step. For this write Y.
- If BWA_MEM_M: Y an extra step will be done: splitdiscreads to create the split and disc bam files before smoove call runs
- If BWA_MEM_M: N splitdiscreads will not run and the pipeline will start with the smoove_call step
For a more detailed explanation see smoove SR support
If you want the results to be written to this directory (not to a new directory), comment out or remove
OUTDIR: /path/to/outdir
ADDITIONAL SET UP
Configuring VEP
This pipeline uses VEP in offline mode, which increases performance. In order to use it in this mode, the cache for the species used needs to be installed:
For people using WUR's Anunna:
Check if the cache file for your species already exist in /lustre/nobackup/SHARED/cache/. If it doesn't, create it with
/usr/bin/perl /cm/shared/apps/SHARED/ensembl-vep/INSTALL.pl --CACHEDIR /lustre/nobackup/SHARED/cache/ --AUTO c -n --SPECIES <species>
When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>, where "assembly name" is the name of the assembly you want to use.
For those not from WUR:
You can install VEP with
conda install -c bioconda ensembl-vep
and install the cache with
vep_install --CACHEDIR <where/to/install/cache> --AUTO c -n --SPECIES <species>
When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>, where "assembly name" is the name of the assembly you want to use.
In the Snakefile, in rule run_vep, replace /cm/shared/apps/SHARED/ensembl-vep/vep with vep
Installing R packages
First load R:
module load R/3.6.2
Enter the R environment by writing R and clicking enter. Install the packages:
list.of.packages <- c("optparse", "data.table", "ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
If you get an error like this:
Warning in install.packages(new.packages) :
'lib = "/cm/shared/apps/R/3.6.2/lib64/R/library"' is not writable
Follow the instructions on how to install R packages locally here and try to install the packages again.
RESULTS
_files.txt Dated file with an overview of the files used to run the pipeline (for documentation purposes)- 2_merged
- {prefix}.smoove-counts.html - shows a summary of the number of reads before and after filtering
- 5_postprocessing directory that contains the final VCF file containing the structural variants found. This file has been annotated with VEP
- {prefix}.smoove.square.vep.vcf.gz - Final VCF - with VEP annotation, not filtered for quality
- {prefix}.smoove.square.vep.vcf.gz_summary.html - statistics from VEP
- {prefix}.nosex, {prefix}.log, {prefix}.eigenvec, {prefix}.eigenval - output files from the PCA
- {prefix}DUPDELINVtable.tsv - table with the most important information extracted from the VCF. Contains information about the SV, allele frequency for each population, VEP annotation and depth information. The variants have been filtered with Minimum Quality score = 30
- {prefix}DUPDEL_INV.vcf - vcf file with annotated duplications, deletions and inversions. It has been filtered with Minimum Quality score = 30 and the DEPTH* field was added
- {prefix}_BND.vcf - vcf file with variants annotated with BND
- 6_metrics directory that contains general stats about the number of SVs found
- FIGURES directory that contains the PCA plot
What you do with the results from this structural variant calling pipeline depends on your research question: a possible next step would be to explore the {prefix}DUPDELINVtable.tsv file and look at the largest SVs found (sort by SVLEN) or at a specific effect in the ANNOTATION column, such as "frameshift_variant".
See VEP effect descriptions for a short description of the effects annotated by VEP
The *DEPTH** field in the vcf has six fields, corresponding to the average depth across all samples.
DEPTH=(DHBFC_1/1, DHBFC_0/1, DHBFC_0/0, DHFFC_1/1, DHFFC_0/1, DHFFC_0/0)
Depth fold change as described in duphold:
DHBFC: fold-change for the variant depth relative to bins in the genome with similar GC-content.
DHFFC: fold-change for the variant depth relative to Flanking regions.
These fields are also in the {prefix}_DUP_DEL_INV_table.tsv file
Owner
- Login: CarolinaPB
- Kind: user
- Location: The Netherlands
- Website: https://carolinapb.github.io/
- Repositories: 7
- Profile: https://github.com/CarolinaPB
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: population-structural-var-calling-smoove
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Carolina
family-names: Pita Barros
orcid: 'https://orcid.org/0000-0002-6755-0643'
identifiers:
- type: url
value: >-
https://github.com/CarolinaPB/population-structural-var-calling-smoove
description: A link to the GitHub page
repository-code: >-
https://github.com/CarolinaPB/population-structural-var-calling-smoove
url: >-
https://github.com/CarolinaPB/population-structural-var-calling-smoove
abstract: >-
This is a pipeline to perform structural variant calling
in a population using Smoove. It also runs VEP and
performs PCA. In addition to the VCF with the SVs, you
also get a .tsv file with some summarized information on
the SVs: it includes allele frequency per population, as
well as VEP annotation and depth fold change.
keywords:
- structural variance
- SV calling
- snakemake
- smoove
license: MIT
commit: bef171f
version: 0.1.0
date-released: '2021-11-01'
GitHub Events
Total
Last Year
Dependencies
- aioeasywebdav =2.4.0=py39hf3d152e_1001
- aiohttp =3.7.4.post0=py39h3811e60_0
- amply =0.1.4=py_0
- appdirs =1.4.4=pyh9f0ad1d_0
- async-timeout =3.0.1=py_1000
- atk-1.0 =2.36.0=h3371d22_4
- attmap =0.13.0=pyhd8ed1ab_0
- attrs =21.2.0=pyhd8ed1ab_0
- backports =1.0=py_2
- backports.functools_lru_cache =1.6.4=pyhd8ed1ab_0
- bcrypt =3.2.0=py39h3811e60_1
- biopython =1.79=py39h3811e60_0
- boto3 =1.18.44=pyhd8ed1ab_0
- botocore =1.21.44=pyhd8ed1ab_0
- brotlipy =0.7.0=py39h3811e60_1001
- bzip2 =1.0.8=h7f98852_4
- c-ares =1.17.2=h7f98852_0
- ca-certificates =2021.5.30=ha878542_0
- cachetools =4.2.2=pyhd8ed1ab_0
- cairo =1.16.0=h6cf1ce9_1008
- certifi =2021.5.30=py39hf3d152e_0
- cffi =1.14.6=py39he32792d_0
- chardet =4.0.0=py39hf3d152e_1
- charset-normalizer =2.0.0=pyhd8ed1ab_0
- coincbc =2.10.5=hcee13e7_1
- colorama =0.4.4=pyh9f0ad1d_0
- conda =4.10.3=py39hf3d152e_2
- conda-package-handling =1.7.3=py39h3811e60_0
- configargparse =1.5.2=pyhd8ed1ab_0
- connection_pool =0.0.3=pyhd3deb0d_0
- cryptography =3.4.7=py39hbca0aa6_0
- cycler =0.10.0=py_2
- datrie =0.8.2=py39h3811e60_2
- decorator =4.4.2=py_0
- docutils =0.17.1=py39hf3d152e_0
- dropbox =11.19.0=pyhd8ed1ab_0
- expat =2.4.1=h9c3ff4c_0
- fftw =3.3.10=nompi_hcdd671c_100
- filechunkio =1.8=py_2
- filelock =3.0.12=pyh9f0ad1d_0
- font-ttf-dejavu-sans-mono =2.37=hab24e00_0
- font-ttf-inconsolata =3.000=h77eed37_0
- font-ttf-source-code-pro =2.038=h77eed37_0
- font-ttf-ubuntu =0.83=hab24e00_0
- fontconfig =2.13.1=hba837de_1005
- fonts-conda-ecosystem =1=0
- fonts-conda-forge =1=0
- freetype =2.10.4=h0708190_1
- fribidi =1.0.10=h36c2ea0_0
- ftputil =5.0.1=pyhd8ed1ab_0
- gdk-pixbuf =2.42.6=h04a7f16_0
- gettext =0.19.8.1=h0b5b191_1005
- ghostscript =9.18=1
- giflib =5.2.1=h36c2ea0_2
- gitdb =4.0.7=pyhd8ed1ab_0
- gitpython =3.1.23=pyhd8ed1ab_1
- google-api-core =1.31.2=pyhd8ed1ab_0
- google-api-python-client =2.21.0=pyhd8ed1ab_0
- google-auth =1.35.0=pyh6c4a22f_0
- google-auth-httplib2 =0.1.0=pyhd8ed1ab_0
- google-cloud-core =1.7.2=pyh6c4a22f_0
- google-cloud-storage =1.19.0=py_0
- google-crc32c =1.1.2=py39hb81f231_0
- google-resumable-media =2.0.2=pyh6c4a22f_0
- googleapis-common-protos =1.53.0=py39hf3d152e_0
- graphite2 =1.3.13=h58526e2_1001
- graphviz =2.49.0=h85b4f2f_0
- grpcio =1.38.1=py39hff7568b_0
- gtk2 =2.24.33=h539f30e_1
- gts =0.7.6=h64030ff_2
- harfbuzz =2.9.1=h83ec7ef_0
- httplib2 =0.19.1=pyhd8ed1ab_0
- icu =68.1=h58526e2_0
- idna =3.1=pyhd3deb0d_0
- imagemagick =7.1.0_8=pl5321hb118871_0
- importlib-metadata =4.8.1=py39hf3d152e_0
- iniconfig =1.1.1=pyh9f0ad1d_0
- ipython_genutils =0.2.0=py_1
- jbig =2.1=h7f98852_2003
- jinja2 =3.0.1=pyhd8ed1ab_0
- jmespath =0.10.0=pyh9f0ad1d_0
- jpeg =9d=h36c2ea0_0
- jsonschema =3.2.0=pyhd8ed1ab_3
- jupyter_core =4.8.1=py39hf3d152e_0
- kiwisolver =1.3.2=py39h1a9c180_0
- krb5 =1.19.2=hcc1bbae_2
- lcms2 =2.12=hddcbb42_0
- ld_impl_linux-64 =2.36.1=hea4e1c9_2
- lerc =2.2.1=h9c3ff4c_0
- libarchive =3.5.2=hccf745f_1
- libblas =3.9.0=11_linux64_openblas
- libcblas =3.9.0=11_linux64_openblas
- libcrc32c =1.1.1=h9c3ff4c_2
- libcurl =7.79.1=h2574ce0_1
- libdeflate =1.7=h7f98852_5
- libedit =3.1.20191231=he28a2e2_2
- libev =4.33=h516909a_1
- libffi =3.3=h58526e2_2
- libgcc =7.2.0=h69d50b8_2
- libgcc-ng =11.2.0=h1d223b6_8
- libgd =2.3.2=h78a0170_0
- libgfortran-ng =11.2.0=h69a702a_8
- libgfortran5 =11.2.0=h5c6108e_8
- libglib =2.68.4=h3e27bee_0
- libgomp =11.2.0=h1d223b6_8
- libiconv =1.16=h516909a_0
- liblapack =3.9.0=11_linux64_openblas
- libnghttp2 =1.43.0=h812cca2_1
- libopenblas =0.3.17=pthreads_h8fe5266_1
- libpng =1.6.37=h21135ba_2
- libprotobuf =3.18.0=h780b84a_0
- librsvg =2.50.7=hc3c00ef_0
- libsodium =1.0.18=h36c2ea0_1
- libsolv =0.7.19=h780b84a_5
- libssh2 =1.10.0=ha56f1ee_2
- libstdcxx-ng =11.2.0=he4da1e4_8
- libtiff =4.3.0=hf544144_1
- libtool =2.4.6=h9c3ff4c_1008
- libuuid =2.32.1=h7f98852_1000
- libwebp =1.2.1=h3452ae3_0
- libwebp-base =1.2.1=h7f98852_0
- libxcb =1.13=h7f98852_1003
- libxml2 =2.9.12=h72842e0_0
- logmuse =0.2.6=pyh8c360ce_0
- lz4-c =1.9.3=h9c3ff4c_1
- lzo =2.10=h516909a_1000
- mamba =0.16.0=py39h951de11_0
- markupsafe =2.0.1=py39h3811e60_0
- matplotlib-base =3.4.3=py39h2fa2bec_0
- more-itertools =8.10.0=pyhd8ed1ab_0
- multidict =5.1.0=py39h3811e60_1
- nbformat =5.1.3=pyhd8ed1ab_0
- ncurses =6.2=h58526e2_4
- networkx =2.6.3=pyhd8ed1ab_0
- numpy =1.21.2=py39hdbf815f_0
- oauth2client =4.1.3=py_0
- olefile =0.46=pyh9f0ad1d_1
- openjpeg =2.4.0=hb52868f_1
- openssl =1.1.1l=h7f98852_0
- packaging =21.0=pyhd8ed1ab_0
- pandas =1.3.3=py39hde0f152_0
- pango =1.48.10=hb8ff022_0
- paramiko =2.7.2=pyh9f0ad1d_0
- pcre =8.45=h9c3ff4c_0
- peppy =0.31.1=pyhd8ed1ab_0
- perl =5.32.1=0_h7f98852_perl5
- pillow =8.3.2=py39ha612740_0
- pip =21.2.4=pyhd8ed1ab_0
- pixman =0.40.0=h36c2ea0_0
- pkg-config =0.29.2=h36c2ea0_1008
- pluggy =1.0.0=py39hf3d152e_1
- ply =3.11=py_1
- prettytable =2.2.0=pyhd8ed1ab_0
- protobuf =3.18.0=py39he80948d_0
- psutil =5.8.0=py39h3811e60_1
- pthread-stubs =0.4=h36c2ea0_1001
- pulp =2.5.0=py39hf3d152e_0
- py =1.10.0=pyhd3deb0d_0
- pyasn1 =0.4.8=py_0
- pyasn1-modules =0.2.7=py_0
- pycosat =0.6.3=py39h3811e60_1006
- pycparser =2.20=pyh9f0ad1d_2
- pygments =2.10.0=pyhd8ed1ab_0
- pygraphviz =1.7=py39h78163bd_0
- pynacl =1.4.0=py39h3811e60_2
- pyopenssl =20.0.1=pyhd8ed1ab_0
- pyparsing =2.4.7=pyh9f0ad1d_0
- pyrsistent =0.17.3=py39h3811e60_2
- pysftp =0.2.9=py_1
- pysocks =1.7.1=py39hf3d152e_3
- pytest =6.2.5=py39hf3d152e_0
- python =3.9.6=h49503c6_1_cpython
- python-dateutil =2.8.2=pyhd8ed1ab_0
- python-irodsclient =1.0.0=pyhd8ed1ab_0
- python_abi =3.9=2_cp39
- pytz =2021.1=pyhd8ed1ab_0
- pyu2f =0.1.5=pyhd8ed1ab_0
- pyvcf =0.6.8=py39hde42818_1002
- pyyaml =5.4.1=py39h3811e60_1
- ratelimiter =1.2.0=py_1002
- readline =8.1=h46c0cb4_0
- reproc =14.2.3=h7f98852_0
- reproc-cpp =14.2.3=h9c3ff4c_0
- requests =2.26.0=pyhd8ed1ab_0
- rsa =4.7.2=pyh44b312d_0
- ruamel_yaml =0.15.80=py39h3811e60_1004
- s3transfer =0.5.0=pyhd8ed1ab_0
- scipy =1.7.1=py39hee8e79c_0
- setuptools =58.0.4=py39hf3d152e_0
- simplejson =3.17.5=py39h3811e60_0
- six =1.16.0=pyh6c4a22f_0
- slacker =0.14.0=py_0
- smart_open =5.2.1=pyhd8ed1ab_0
- smmap =3.0.5=pyh44b312d_0
- snakemake =6.7.0=hdfd78af_0
- snakemake-minimal =6.7.0=pyhdfd78af_0
- sqlite =3.36.0=h9cd32fc_1
- stone =3.2.1=pyhd8ed1ab_0
- stopit =1.1.2=py_0
- tabulate =0.8.9=pyhd8ed1ab_0
- tk =8.6.11=h27826a3_1
- toml =0.10.2=pyhd8ed1ab_0
- toposort =1.6=pyhd8ed1ab_0
- tornado =6.1=py39h3811e60_1
- tqdm =4.62.3=pyhd8ed1ab_0
- traitlets =5.1.0=pyhd8ed1ab_0
- typing-extensions =3.10.0.0=hd8ed1ab_0
- typing_extensions =3.10.0.0=pyha770c72_0
- tzdata =2021a=he74cb21_1
- ubiquerg =0.6.1=pyh9f0ad1d_0
- uritemplate =3.0.1=py_0
- urllib3 =1.26.6=pyhd8ed1ab_0
- veracitools =0.1.3=py_0
- wcwidth =0.2.5=pyh9f0ad1d_2
- wheel =0.37.0=pyhd8ed1ab_1
- wrapt =1.12.1=py39h3811e60_3
- xmlrunner =1.7.7=py_0
- xorg-kbproto =1.0.7=h7f98852_1002
- xorg-libice =1.0.10=h7f98852_0
- xorg-libsm =1.2.3=hd9c2040_1000
- xorg-libx11 =1.7.2=h7f98852_0
- xorg-libxau =1.0.9=h7f98852_0
- xorg-libxdmcp =1.1.3=h7f98852_0
- xorg-libxext =1.3.4=h7f98852_1
- xorg-libxrender =0.9.10=h7f98852_1003
- xorg-libxt =1.2.1=h7f98852_2
- xorg-renderproto =0.11.1=h7f98852_1002
- xorg-xextproto =7.3.0=h7f98852_1002
- xorg-xproto =7.0.31=h7f98852_1007
- xz =5.2.5=h516909a_1
- yaml =0.2.5=h516909a_0
- yarl =1.6.3=py39h3811e60_2
- zipp =3.5.0=pyhd8ed1ab_0
- zlib =1.2.11=h516909a_1010
- zstd =1.5.0=ha95c52a_0