https://github.com/bihealth/stemcnv-check

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bihealth
License: mit
Language: R
Default Branch: main
Size: 17 MB

Statistics

Stars: 2
Watchers: 4
Forks: 0
Open Issues: 4
Releases: 8

Created about 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

StemCNV-check

About

StemCNV-check is a tool written to simplify copy number variation (CNV) analysis of SNP array data, specifically for quality control of (pluripotent) stem cell lines. StemCNV-check uses snakemake to run the complete analysis from raw data (.idat) up report generation for all defined samples with a single command. Samples need to be defined in a (tabular) sample table and the workflow settings are defined through a yaml file.

Documentation

Please consult our documentation on read-the-docs for detailed instructions on installation, usage, interpretation, trouble shooting and technical implementation of StemCNV-check.

Installation

StemCNV-check requires a linux environment (or WSL on windows) and a working conda (or mamba) installation. Follow the recommended instructions to install conda or mamba.

Further runtime dependencies (conda environments and docker containers) will be pulled automatically by snakemake when running stemcnv-check.

Stable versions, recommeded

It is recommended to install StemCNV-check through the bioconda channel. If you do not use conda for other things omitting the environment name and installing into your base environment may be an option.

conda install stemcnv-check [-n stemcnv-check]

Development version, install from source

Alternatively, installation 'from source' is also possible:

Clone this git repository
optional, but recommended Create a new enviroment, i.e. conda create -n stemcnv-check python=3.12, then activate it
- Note: on some systems like WSL you may also need: apptainer and gcc_linux-64 (<14 for recent datrie issue)
Install python dependencies and stemcnv-check itself using pip pip install -e .. For development, use pip install -e .
- Note: To install all development dependencies (for testing and building the documentation) use pip install -e .[dev,test,doc]

Setup

StemCNV-check requires a sample table and a config file to run. Example files can be created using stemcnv-check setup-files.

The sample table (default: sampletable.tsv) is a tab-separated file describing all samples to be analyzed: - Columns: SampleID, ChipName, ChipPos, ArrayName, Sex, ReferenceSample, RegionsofInterest, SampleGroup - The first 5 of these (SampleID - Sex) are required for all samples, ReferenceSample is used to track the origin of a sample (i.e. originating fibroblast or master bank) and should be used where possible, the last two columns can be filled optionally. - See the `sampletableexample.tsv` file (or the sampletable created bye the setup-files command) for a description of individual columns

The config file (default: config.yaml) defines all settings for the analysis and inherits from the inbuilt default.
Required settings that are not defined by default include array definition files specific to the used array platform and genome build: - egtclusterfile: the illumina cluster file (.egt) for the array platform, available from Illumina or the provider running the array - bpmmanifestfile: the beadpool manifest file (.bpm) for the array platform, available from Illumina or the provider running the array - csvmanifestfile (optional): the manifest file in csv format, available from Illumina or the provider running the array

Additionally, the config file needs to define the following paths: - rawdatafolder: path to the input directory under which the raw data (.idat) can be found. Ths folder should contain subfolders that match the ChipName column in the sample table (containing the array chip IDs) - datapath: the output of StemCNV-check will be written to this path - log_path: the log files of StemCNV-check will be written to this path

Usage

Before the first analysis sample table and config file need to be set up (see above). Unless otherwise specified, stemcnv-check defaults to look for a "sample_table.tsv" (or .xlsx) and "config.yaml" file.

Automatic generation of the additional array & genome-build specific static files can only be done if sample data for that array is available.
Notes:

unless specified directly in the config this will also include download of fasta and gtf file for the reference genome build.
Array specific files and an updated arraydefinition block for the config will be written into the cache directory (default: '~/.cache/stemcnv-check'). However, you still need to update or remove the arraydefinition from your config.yaml file, otherwise the cached definitions and files will not be used.

stemcnv-check make-staticdata [-s <sample_table>] [-c <config_file>]

To start the analysis, invoke the run command:

stemcnv-check run [-s <sample_table>] [-c <config_file>]

Example data

This repository contains example data (using data from the Genome in a Bottle samples) that can be used to test the setup. After pulling the repository and creating and activating the base StemCNV-check conda environment, test data can be downloaded via git LFS and StemCNV-check can be run with the following commands.
(Note that this will also include the download a fasta and gtf file for the human genome. If you have suitable files available locally, it is recommended to replace the corresponding paths in the config.yaml to avoid unnecessary and time-consuming downloads):

Install git lfs and pull test data: - sudo apt-get install git-lfs - git lfs fetch - git lfs checkout

Run the example data: - cd example_data - stemcnv-check make-staticdata - stemcnv-check run

Output

StemCNV-check will produce the following output files for each sample, when run with default settings: - data_path/{sample}/{sample}.annotated-SNP-data.{filter}-filter.vcf.gz
The filtered, processed and annotated SNP data of the array in vcf format - data_path/{sample}/{sample}.CNV_calls.CBS.vcf.gz
The CNV calls for the sample from the CBS (Circular Binary Segmentation) algorithm in vcf format - data_path/{sample}/{sample}.CNV_calls.PennCNV.vcf.gz
The CNV calls for the sample from the PennCNV caller, in vcf format - data_path/{sample}/{sample}.CNV_calls.combined-annotated.vcf.gz
The CNV calls processed, combined and annotated by StemCNV-check, in vcf format. Annotation includes comparison against reference sample; gene annotation; hotspots for stem cells, cancer and dosage sensitivity; call scoring; and call labelling (i.e. as Critical de-novo call). - data_path/{sample}/extra_files
Folder in which additional QC log files are stored. - data_path/{sample}/{sample}.summary-stats.xlsx
An Excel file with summary information for the sample. The first sheets contains quality summary statistics, including array quality measures, number of CNV and LOH calls, the number of calls above CheckScore thresholds. The further sheets have more details from individual CNV callers or sample comparisons. - `datapath/{sample}/{sample}.SNV-analysis.xlsxAn Excel file with the results from analysis on annotated SNVs (from the SNP probes). This includes a list of all SNVs with an annotated impact on a gene (including the gene name and categorisation based on predicted impact, known hPSC reference SNV hotspots, match and call reliability), coverage of known hotspots by the utilised array, a distance matrix of the sample to other selected samples (based on config settings), and a chromosome based summary of where SNPs occur. -datapath/{sample}/{sample}.StemCNV-check-report.html`; ... Html report containing summary statistics, QC statistics, lists of CNV calls sorted by annotation score, plots of most/all CNVs and sample comparison. The default 'StemCNV-check-report' only contains plots for the top20 calls or calls above a user defined CheckScore threshold. A fully self-contained report can easily be enabled in the config.yaml. The content of either the default or any additional reports can also be fine-tuned through the config.yaml file. - data_path/{sample}/{sample}.StemCNV-check-report-html_images
Folder containing all images included in the html report.

Furthermore, the following collated summary tables can be created. (Optionally with a date prefix, or as tsv instead of xlsx files): - data_path/[YYYY-MM-DD_]summary-overview.{xlsx,tsv}
A table that contains the information of the sample wise "summary-stats", but combined for all samples. Additionally, more information derived from the sampletable columns can be included. This output is included in the default 'complete' target. - data_path/[YYYY-MM-DD_]combined-cnv-calls.{xlsx,tsv}
A table that contains all CNV calls from all samples, that meet config defined filter criteria (By default: everything except calls with a minimum probe/size/density flag). This output is not included in the default 'complete' target, but can be created with the 'collate-cnv-calls' target.

Owner

Name: Berlin Institute of Health
Login: bihealth
Kind: organization

Website: https://www.cubi.bihealth.org/
Repositories: 215
Profile: https://github.com/bihealth

BIH Core Unit Bioinformatics & BIH HPC IT

GitHub Events

Total

Create event: 10
Release event: 8
Issues event: 3
Watch event: 2
Delete event: 6
Member event: 2
Push event: 139
Pull request event: 4
Pull request review event: 1

Last Year

Create event: 10
Release event: 8
Issues event: 3
Watch event: 2
Delete event: 6
Member event: 2
Push event: 139
Pull request event: 4
Pull request review event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 5
Total pull requests: 5
Average time to close issues: about 1 month
Average time to close pull requests: 2 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.2
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 3
Average time to close issues: 22 days
Average time to close pull requests: 3 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.5
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Nicolai-vKuegelgen (5)

Pull Request Authors

Nicolai-vKuegelgen (4)
icalledmyselfmoon (1)

Top Labels

Issue Labels

wontfix (1)

Pull Request Labels

Code base (1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science