https://github.com/cancerit/nanoseq

Analysis software for Nanorate Sequencing (NanoSeq) experiments

Last synced: 9 months ago · JSON representation

Repository

Analysis software for Nanorate Sequencing (NanoSeq) experiments

Basic Info

Host: GitHub
Owner: cancerit
License: agpl-3.0
Language: C++
Default Branch: develop
Homepage:
Size: 1.24 MB

Statistics

Stars: 21
Watchers: 13
Forks: 16
Open Issues: 22
Releases: 7

Created over 5 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

NanoSeq

Nanorate sequencing (NanoSeq) is a DNA library preparation and sequencing protocol based on Duplex Sequencing (Schmitt et al, 2012) and BotSeqS (Hoang et al, 2016). NanoSeq allows calling mutations with single molecule resolution and extremely low error rates (Abascal et al, 2021). The pipeline and code in this repository cover the preprocessing of NanoSeq sequencing data, the assessment of data quality and efficiency, and the calling of mutations (substitutions and indels) and the estimation of mutation burdens and substitution profiles.

The wet-lab protocol is described in the original publication (Abascal et al, 2021) and on ProtocolExchange (Lensing et al, 2021).

Nextflow

The simplest way of executing NanoSeq analyses is using the provided Nextflow ( NanoSeq_main.nf ) and its modules. It carries out all the required pre-processing of input files and carries out the NanoSeq analysis itself. The provided configuration file nextflow.config been setup for usage in an LSF cluster with singularity. Other environments can be accomodated by creating a similar section (see Nextflow docs ) in the config file. Most processes contain error recovery clauses that might also require modifications for other executor environments.

There are two possible file inputs for the workflow: FASTQs and BAM files. These are passed as arguments to the workflow in a samplesheet in csv format so that it is possible to do multi sample processing with one Nextflow.

FASTQ input

This option will carry all the pre-processing of sequencing data and the NanoSeq analysis. Both duplex and normal FASTQs are treated as NanoSeq libraries.

Sample sheet format id,d_fastq1,d_fastq2,n_fastq1,n_fastq2 experiment1,1_duplex_R1.fastq.gz,1_duplex_R2.fastq.gz,1_normal_R1.fastq.gz,1_normal_R2.fastq.gz experiment2,2_duplex_R1.fastq.gz,2_duplex_R2.fastq.gz,2_normal_R1.fastq.gz,2_normal_R2.fastq.gz ...

BAM input

BAM files can be of two types: ones that require re-mapping but already have the correct tags, or BAMs that are already correctly pre-processed and mapped. The type of BAM is specified with the --remap parameter (default true).

Sample sheet format id,d_bam,n_bam experiment1,1_duplex.bam,1_normal.bam experiment2,2_duplex.bam,2_normal.bam ...

A typical execution of the workflow with the NanoSeq_main.nf and module directory in the present working directory would look as this:

``` module add singularity

nextflow run NanoSeqmain.nf -qs 300 -profile lsfsingularity -resume \ --jobs 100 --ref genome.fa \ --samplesheet ss.csv --remap true --covQ 15 --varb 0 --varn 2 --var_z 15 ``` Where parameters with a single dash are arguments for Nextflow in this case it includes the typical ones: (-qs) only allow 300 jobs to run at once, (-profile) use the lsf and signularity profile and (-resume) to allow the pipeline to resume when restarted.

Parameters with a double dash indicate parameters that are passed to the workflow. In this case: (--jobs) split NanoSeq calculations into 100 jobs, (--ref) reference genome, (--samplesheet) the sample sheet, (--remap) do a re-map of the input BAM files, (--covQ, --varb, --varn, --varz ) are all parameters for various sections of the NanoSeq analysis. The header of NanoSeqmain.nf contains the default values of all the workflow parameters (eg. param.cov_Q = 0). These values are overwritten by the values specified on the command line. The header also contains the locations of the two singularity images that are used to run the workflow, the images contain all the necesary dependancies to run all the processes.

The reference directory must contain the reference itself (genome.fa), the genome index (genome.fa.fai) and all the index files generated by bwa-mem2.

Software details for manual execution

Dependencies

Execution of the scripts from this repository requires that these dependencies are on PATH :

samtools
bcftools
R (>=3.5 )
biobambam
bwa

Installation

./setup.sh path_to_install #install code from this repository export PATH=$PATH:path_to_install/bin Rscript ./build/manualInstall.R <R libraries path> #install all the required R libraries

Preprocessing of the sequencing data

Steps overview

1) Extract the duplex barcodes from the fastq files and add them to the fastq header of each read (extract_tags.py) 2) Map reads to the reference genome using bwa with option `-C to add the barcodes as tags in the bam 3) Add rc and mc tags, mark optical duplicates, and filter the bam for unpaired reads, creating a molecule-unique read bundle (RB) tag identifier for each read pair.

1/2. Extract barcodes and map reads

Prior to mapping, fastq files must be pre-processed with extract_tags.py in order to trim adapter sequences and to add the appropiate tags (rb,mb) to the read headers. ```

(trim 3 bases, skip 4 bases, add rb & mb tags), for reads of read length 151 bps

python extract-tags.py -a R1.fastq -b R2.fastq -c extrR1.fastq -d extrR2.fastq -m 3 -s 4 -l 151

(align with bwa appending rb & mb tags)

bwa mem -C reference_genome.fa extrR1.fastq extrR2.fastq > mapped.sam ```

3. Add rc and mc tags, mark optical duplicates, create read bundle tags

A read bundle tag must be appended to each read-pair of a BAM to determine which reads are PCR duplicates. The tag consists of: chromosome, read coordinate, mate corrdinate, read rb tag, and mate mb tag (RB:rc,mc,rb,mb).

With bamsormadup the rc and mc tags are added. We also recommend running bammarkduplicatesopt with optminpixeldif=2500 to flag optical duplicates.

```

(sort sam with biobambam, append rc & mc tags)

bamsormadup inputformat=sam rcsupport=1 threads=1 < mapped.sam > mapped_od.bam ```

With bamaddreadbundles, optical duplicates and unpaired mates are filtered and the RB tag is created: ```

(append RB tag, filter OD and unpaired read mates)

bamaddreadbundles -I mapped_od.bam -O filtered.bam ```

Preparation of a matched normal

The NanoSeq analysis requires a matched normal to distinguish somatic mutations from germline SNPs. Sequencing undiluted NanoSeq libraries is the most cost-efficient solution to create a matched normal because all the coverage will concentrate in the fraction of the genome "seen" with the selected restriction enzyme. If the matched normal happens to be an undiluted NanoSeq library it must be processed further as to just keep one read-pair from each read bundle to produce a 'neat' normal (i.e. to remove PCR duplicates).

randomreadinbundle -I filtered.bam -O neat.bam

Deduplication of bams with randomreadinbundle is also required to run VerifyBamId and the efficiency estimates (efficiency_nanoseq.pl).

Note

Correct pre-processing means that duplex BAMs must have @PG tags for bamsormadup , bammarkduplicatesopt and bamaddreadbundles. A neat bulk (NanoSeq library) BAM must have a @PG tag for bamaddreadbundles & randomreadinbundle. A WGS bulk will NOT have a tag for bamaddreadbundles.

The pipeline checks that all these programs have been run on the bam, exiting with an error otherwise. At users' own risk, the bam header checking can be disabled using option --no_test in dsa.

Contamination check

It is highly recommended to carry out a contamination check of the sample pair with verifyBAMId. This contamination check must be done on a bam generated with randomreadinbundle, where only one read per read bundle is kept in the bam (see above). An alpha < 0.005 would be acceptable for most situations.

Efficiency

The script efficiency_nanoseq.pl analyses the information in the NanoSeq original bam and its deduplicated version. The output provides information on duplicate rates, read counts... Theoretically, the optimal duplicate rate in terms of efficiency (duplex bases / sequenced bases) is 81% for read bundles of size >= 2+2, with 65% and 90% yielding ≥80% of the maximum of efficiency. Empirically the optimal duplicate rate is 75-76%.

Apart of the duplicate rate, the following outputs are important to assess the quality of the experiment: F-EFF, EFFICIENCY, GCBOTH/GCSINGLE

F-EFF or strand drop out fraction: This shows the fraction of read bundles missing one of the two original strands beyond what would be expected under random sampling (assuming a binomial process). Good values are between 0.10-0.30, and larger values are likely due to DNA damage such as modified bases or internal nicks that prevent amplification of one of the two strands. Larger values do not impact the quality of the results, just reduce the efficiency of the protocol.

EFFICIENCY: This is the number of duplex bases divided by the number of sequenced bases. Efficiency is maximised to ~0.07 when duplicate rates and strand drop outs are optimal

GCBOTH and GCSINGLE: the GC content of RBs with both strands and with just one strand. The two values should be similar between them and similar to the genome average. If there are large deviations that is possibly due to biases during PCR amplification. If GCBOTH is substantially larger than GCSINGLE, DNA denaturation before dilution may have taken place.

NanoSeq analysis

For a matched normal and a duplex pair of samples (hereafter referred to as "normal" and "tumour") an analysis requires the following steps:

1) Generation of tables files (dsa) 2) Generation of variant files (variants) 3) Indel identification (indelCaller_step1.pl, indelCaller_step2.pl & indelCaller_step3.R) 4) Summarizing of results (variantcaller.R & nanoseq_results_plotter.R)

Wrapper script

The wrapper script runNanoSeq.py provides a convinient way of running all the steps of a NanoSeq analyisis. It is meant to be run as a job array or in a multithreaded environment.

The wrapper scrpt has subcommands that that are meant to roughly follow the same steps that were outlined before. The steps are: cov, part, dsa, var, indel & post. Output for each step is located in the tmpNanoSeq directory.

Coverage (cov)

A coverage histogram is computed for the bulk BAM.

runNanoSeq.py -t 10 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ cov \ -Q 0 \ --exclude "MT,GL%,NC_%,hs37d5"

Partition (part)

Divide the coverage so that each job in the NanoSeq analysis gets roughly the same work. The -n argument idicates the number of tasks that will be used in the dsa, var and indel steps.

runNanoSeq.py -t 1 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ part \ -n 60 \

dsa beds (dsa)

Compute the dsa bed files. SNP and NOISE BED files contain sites to be marked on the output VCF file.

runNanoSeq.py -t 60 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ dsa \ -C SNP.sorted.bed.gz \ -D NOISE.sorted.bed.gz \ -d 2 \ -q 30 \

Variant tables (var)

Compute variants tables.

runNanoSeq.py -t 60 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ var \ -a 50 \ -b 5 \ -c 0 \ -f 0.9 \ -i 1 \ -m 8 \ -n 3 \ -p 0 \ -q 60 \ -r 144 \ -v 0.01 \ -x 8 \ -z 12

Indel vcfs (indel)

Compute vcf files for indels.

runNanoSeq.py -t 60 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ indel \ -s sample \ --rb 2 \ --t3 135 \ --t5 10 \ --mc 16

Post processing (post)

Merge final files, produce summaries. Results can be found in tmpNanoSeq/post.

runNanoSeq.py -t 2 \ -A normal.bam \ -B tumour.bam \ -R genome.fa \ post

Note

A bash script is provided in the LSF directory as a template for execution of all the steps in an computing environment using the LSF job scheduler. Script should be edited to fit individual needs.

Genomic masks

Genomic masks for common SNP masking and detection of noisy/variable genomic sites. Masks for GRCh37 and GRCh38 are available here.

These masks have recently been updated. "RE" versions now include the TGCA restriction site. In all cases, "RE" and "WGS", the new masks now include new sites found to be prone to artefacts in NovaSeqX.

DSA arguments for masks can be omitted when running NanoSeq on a species for which no common SNP data is available.

Output

The most relevant summary files include the following.

muts.vcf.gz / muts.tsv: substitutions called in vcf and tsv format. "PASS" substitutions are those not filtered by the common SNP and noisy sites masks (see Genomic masks).
indels.vcf.gz: indel calls
burden.masked-vs-unmasked.pdf: estimated burden before and after filtering common SNPs. Provides a qualitative view on contamination.
mut_burden.tsv: estimated substitution burden with Poisson confidence intervals. The corrected burden shows the burden after normalizing observed trinucleotide frequencies to the genomic trinucleotide frequencies.
trinuc-profiles.pdf / trint_subs_obs_corrected.tsv / trint_counts_and_ratio2genome.tsv: Trinucleotide substitution profiles (observed and corrected), using the trinucleotide substitution counts in trint_subs_obs_corrected.tsv and the normalization of trinucleotide frequencies in trint_counts_and_ratio2genome.tsv. Normalization is required because NanoSeq results are depleted of trinucleotides overlapping the restriction site and of CpGs due to extensive filtering of common SNPs.
cov.bed.gz: large file containing the effective duplex coverage for each genomic site, also showing the trinucleotide context of each site. This file is required to to calculate burdens and substitution profiles in sets of specific genomic regions (e.g. highly expressed genes, heterochromatin, ...).
subst_asym.pdf / subst_asym_and_rates_binned.pdf / subst_asym_binned.tsv / subst_asym.pvals / subst_asym.tsv: These files are not generally needed for NanoSeq analysis. They were originally used to detect asymmetries in the original DuplexSeq & BotSeqS protocols.
mismatches.trinuc-profile.pdf / mismatches.subst_asym.pdf / mismatches.subst_asym.pvals / mismatches.subst_asym.tsv: These files show the asymmetries and pyrimidine/purine-based trinucleotide substitution profiles for single-strand consensus calls. These profiles are useful to understand DNA damage during library preparation.
DSC_errors_per_channel.pdf / DSC_estimated_error_rates.pdf / estimated_error_rates.tsv / SSC-mismatches-Both.triprofiles.tsv / SSC-mismatches-Purine.triprofiles.tsv / SSC-mismatches-Pyrimidine.triprofiles.tsv: Based on the independent error rates in the purine and pyrimidine channels (e.g. G>T and C>A), we calculate the probability of having independent errors affecting both strands and resulting in double-strand consensus.

Owner

Name: CASM IT
Login: cancerit
Kind: organization
Email: cgpit@sanger.ac.uk
Location: Hinxton, Cambridge, UK

Website: http://www.sanger.ac.uk/science/programmes/cancer-genetics-and-genomics
Repositories: 89
Profile: https://github.com/cancerit

CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute

GitHub Events

Total

Issues event: 28
Watch event: 8
Delete event: 1
Issue comment event: 57
Push event: 9
Pull request review event: 3
Pull request event: 6
Fork event: 5
Create event: 6

Last Year

Issues event: 28
Watch event: 8
Delete event: 1
Issue comment event: 57
Push event: 9
Pull request review event: 3
Pull request event: 6
Fork event: 5
Create event: 6

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 15
Total pull requests: 3
Average time to close issues: about 1 month
Average time to close pull requests: 1 day
Total issue authors: 8
Total pull request authors: 2
Average comments per issue: 0.93
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 15
Pull requests: 3
Average time to close issues: about 1 month
Average time to close pull requests: 1 day
Issue authors: 8
Pull request authors: 2
Average comments per issue: 0.93
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

gevro (16)
zhangzhen (3)
katestan (3)
monagai (2)
MadeleineOman (2)
toshwatatosh (1)
scddfmsku (1)
tomdrever (1)
Jucarvag (1)
zz311784 (1)
shanyfr (1)
lculibrk (1)

Pull Request Authors

fa8sanger (3)
lbarbon (2)
ThomasSClarke (2)
monagai (1)
rulixxx (1)

https://github.com/cancerit/nanoseq

Science Score: 39.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

NanoSeq

Nextflow

Software details for manual execution

Dependencies

Installation

Preprocessing of the sequencing data

Steps overview

1/2. Extract barcodes and map reads

(trim 3 bases, skip 4 bases, add rb & mb tags), for reads of read length 151 bps

(align with bwa appending rb & mb tags)

3. Add rc and mc tags, mark optical duplicates, create read bundle tags

(sort sam with biobambam, append rc & mc tags)

(append RB tag, filter OD and unpaired read mates)

Preparation of a matched normal

Note

Contamination check

Efficiency

NanoSeq analysis

Wrapper script

Coverage (cov)

Partition (part)

dsa beds (dsa)

Variant tables (var)

Indel vcfs (indel)

Post processing (post)

Note

Genomic masks

Output

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies