bartab
A Nextflow pipeline to tabulate synthetic barcode counts from NGS data
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, nature.com, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
A Nextflow pipeline to tabulate synthetic barcode counts from NGS data
Basic Info
- Host: GitHub
- Owner: DaneVass
- License: gpl-3.0
- Language: Nextflow
- Default Branch: main
- Size: 161 MB
Statistics
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 3
Metadata Files
README.md
BARtab
A Nextflow pipeline to tabulate synthetic barcode counts from NGS data
- Parameters
- Pipeline summary
- Dependiencies
- Installing the pipeline
- Running the pipeline
- Performance
- Module Descriptions

BARtab integrates with the R package bartools for downstream QC, analysis and visualization of population-level and single-cell level cellular barcoding datasets.
Please see our preprint for application examples.
Please check NEWS.md for changes in BARtab v1.4.
Parameters
``` Usage: nextflow run DaneVass/bartab --indir --outdir
Input/output arguments:
--indir Directory containing input *.fastq.gz files.
Must contain R1 and R2 if running in mode paired-bulk or single-cell.
For single-cell mode, directory can contain BAM files.
--input_type Input file type, either fastq or bam, only relevant for single-cell mode [default = fastq]
--ref Path to a reference fasta file for the barcode / sgRNA library.
If null, reference-free workflow will be used for single-bulk and paired-bulk modes.
--mode Workflow to run. <single-bulk, paired-bulk, single-cell>
--outdir Output directory to place output [default = './']
Read merging arguments:
--mergeoverlap Length of overlap required to merge paired-end reads [default = 10]
Filtering arguments:
--minqual Minimum PHRED quality per base [default = 20]
--pctqual Percentage of bases within a read that must meet --minqual [default = 80]
--complexity_threshold Complexity filter [default = 0]
Minimum percentage of bases that are different from their next base (base[i] != base[i+1])
Trimming arguments:
--constants Which constant regions flanking barcode to search for in reads: up, down or both.
"all" runs all 3 modes and combines the results. <up, down, both, all> [default = 'up']
--upconstant Sequence of upstream constant region [default = 'CGATTGACTA'] // SPLINTR 1st gen upstream constant region
--downconstant Sequence of downstream constant region [default = 'TGCTAATGCG'] // SPLINTR 1st gen downstream constant region
--up_coverage Number of bases of the upstream constant that must be covered by the sequence [default = 3]
--down_coverage Number of bases of the downstream constant that must be covered by the sequence [default = 3]
--constantmismatches Proportion of mismatched bases allowed in constant regions [default = 0.1]
--min_readlength Minimum length of barcode sequence [default = 20]
--barcode_length Optional. Length of barcode if it is the same for all barcodes.
If constant regions are trimmed on both ends, reads are filtered for this length.
If either constant region is trimmed, this is the maximum sequence length.
If barcode_length is set, alignments to the middle of a barcode sequence are filtered out.
Mapping arguments:
--alnmismatches Number of allowed mismatches during reference mapping [default = 2]
--barcode_length (see trimming arguments)
--cluster_unmapped Cluster unmapped reads with starcode [default = false]
Reference-free arguments:
--cluster_distance Defines the Levenshtein distance for clustering lineage barcodes [default = 3].
--cluster_ratio Cluster ratio for message passing clustering.
A cluster of barcode sequences can absorb a smaller one only if it is at least x times bigger [default = 3].
Sincle-cell arguments:
--cb_umi_pattern Cell barcode and UMI pattern on read 1, required for fastq input.
N = UMI position, C = cell barcode position [default = CCCCCCCCCCCCCCCCNNNNNNNNNNNN]
--cellnumber Number of cells expected in sample, only required when fastq provided. whitelist_indir and cellnumber are mutually exclusive
--whitelist_indir Directory that contains a cell ID whitelist for each sample <sample_id>_whitelist.tsv
--umi_dist Hamming distance between UMIs to be collapsed during counting [default = 1]
--umi_count_filter Minimum number of UMIs per barcode per cell [default = 1]
--umi_fraction_filter Minimum fraction of UMIs per barcode per cell compared to dominant barcode in cell
(barcode supported by most UMIs) [default = 0.3]
--pipeline To specify if input fastq files were created by SAW pipeline
Resources:
--max_cpus Maximum number of CPUs [default = 6]
--max_memory Maximum memory [default = "14.GB"]
--max_time Maximum time [default = "40.h"]
Optional arguments:
-profile Configuration profile to use. Can use multiple (comma separated)
Available: conda, singularity, docker, slurm, lsf
--email Direct output messages to this address [default = '']
--help Print this help statement.
Author:
Dane Vassiliadis (dane.vassiliadis@petermac.org)
Henrietta Holze (henrietta.holze@petermac.org)
```
Pipeline summary
The pipeline can extract barcode counts from population-level DNA barcode amplicon sequencing, single-cell RNA-sequencing or single-cell barcode amplicon sequencing data.
BARtab can perform reference-free barcode extraction or perform alignment to a reference.
Detailed QC reports of the analysis steps are provided alongside barcode count tables.
Bulk workflow
The bulk workflow is executed with mode single-bulk and paired-bulk for single-end or paired-end reads, respectively.
- Report raw data quality using
fastqcFASTQC - [Paired-end] Merge paired end reads using
FLAShMERGE_READS - Quality filter reads using
fastpFILTER_READS - Trim 5' and/or 3' constant regions using
cutadaptCUTADAPT_READS - [Reference-based] Align to reference barcode library using
bowtieBUILDBOWTIEINDEX, BOWTIE_ALIGN - [Reference-based optional] Filter alignments to either end of a reference sequence, not the middle FILTER_ALIGNMENTS
- [Reference-based] Count number of reads aligning per barcode using
samtoolsSAMTOOLS, GETBARCODECOUNTS - [Reference-free optional] Trim barcode sequences to the same length if only one adapter was trimmed TRIMBARCODELENGTH
- [Reference-free] Cluster barcode reads to account for PCR and sequencing errors using
starcodeSTARCODE - Merge barcode count files for multiple samples COMBINEBARCODECOUNTS
- Report metrics for individual samples MULTIQC
Single-cell workflow
The single-cell workflow can extract barcodes from both single-cell amplicon-sequencing libraries and directly from processed scRNA-seq data with mode single-cell.
- [fastq] Report raw data quality using
fastqcFASTQC - [fastq] Extraction of cell barcodes (optional) and UMIs using
umi-toolsUMITOOLS_WHITELIST, UMITOOLS_EXTRACT - [BAM] Filter unmapped reads containing cell barcode and UMI and convert to fastq using
samtoolsBAMTOFASTQ - Trim 5' and/or 3' constant regions using
cutadaptCUTADAPT_READS - [Reference-based] Align to reference barcode library using
bowtieBUILDBOWTIEINDEX, BOWTIE_ALIGN - [Reference-based optional] Filter alignments to either end of a reference sequence, not the middle FILTER_ALIGNMENTS
- [Reference-free] Cluster barcodes and error-correct UMIs with
starcode-umiSTARCODE_SC - Remove reads from PCR chimerism REMOVEPCRCHIMERISM
- Error correct UMIs and count UMIs per barcode using
umi-toolsUMITOOLS_COUNT - Filter and tabulate barcodes per cell and produce QC plots PARSEBARCODESSC
- Report metrics for individual samples MULTIQC
Spatial data
BARtab allows for extraction of barcodes from stereo-seq data that was processed with the SAW pipeline (>=v6.1.0).
Input: one fastq file per sample with unmapped reads, generated by the mapping command of the SAW pipeline with the outUnMappedFq=1 flag.
Barcodes counts are tabulated by spot (coordinate) and can subsequently be binned the same way as the stereo-seq data.
- Filter barcode reads and trim 5' and/or 3' constant regions using
cutadaptCUTADAPT_READS - Align to reference barcode library using
bowtieBUILDBOWTIEINDEX, BOWTIE_ALIGN - [Optional] Filter alignments for sequences mapping to either end of a barcode FILTER_ALIGNMENTS
- Count barcodes from sam file COUNTBARCODESSAM
- Filter and tabulate barcodes per spot and produce QC plots PARSEBARCODESSC
- Report metrics for individual samples MULTIQC
Dependiencies
See citations * Nextflow * R * The tidyverse package * Python * fastqc * FLASh * fastp * cutadapt * bowtie1 * samtools * Starcode * MultiQC * umi-tools * parallel
Installing the pipeline
Install Nextflow using the instructions found here (and here) ```
download the executable
curl get.nextflow.io | bash
move the nextflow file to a directory accessible by your $PATH variable
sudo mv nextflow /usr/local/bin ```
Try out the pipeline
(this will automatically pull the pipeline, usually into~/.nextflow/assets/)nextflow run DaneVass/bartab --helpAlternatively, clone the pipeline into a directory of your choice first withnextflow clone DaneVass/bartab target_dir/Install dependencies
Docker
Download the Docker image from docker hub. ``` docker pull henriettaholze/bartab:v1.4
nextflow run DaneVass/bartab -profile docker [options] ```
Singularity
``` export NXFSINGULARITYLIBRARYDIR=MYSINGULARITYIMAGES # your singularity storage dir export NXFSINGULARITYCACHEDIR=MYSINGULARITYCACHE # your singularity cache dir singularity pull --dir $NXFSINGULARITYLIBRARYDIR henriettaholze-bartab-v1.4.img docker://henriettaholze/bartab:v1.4
nextflow run DaneVass/bartab -profile singularity [options] ```
Conda
- Install miniconda using the instructions found here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
- It is recommended to use mamba to create the conda environment
conda install -c conda-forge mamba - Install BARtab dependencies by running
mamba env create -f environment.yaml(orconda env create -f environment.yaml) - Run the pipeline with
nextflow run DaneVass/bartab -profile conda [options]
The location of the conda environment is specified in
conf/conda.config.
Running the pipeline
Print the help message with nextflow run DaneVass/bartab --help.
To run a specific branch or the pipeline use -r <branch>.
Run any of the test datasets using nextflow run DaneVass/bartab -profile <test_SE,test_PE,test_SE_ref_free,test_sc,test_sc_bam,test_sc_saw_fastq>,<conda,docker,singularity>,<slurm,lsf>
To run the pipeline with your own data, create a parameter yaml file and specify the location with -params-file.
An example to run the single-end bulk workflow:
indir: "test/dat/test_SE"
ref: "test/ref/SPLINTR_mCHERRY_V2_barcode_reference_library.fasta"
mode: "single-bulk"
outdir: "test/test_out/single_end/"
upconstant: "TGACCATGTACGATTGACTA"
downconstant: "TGCTAATGCGTACTGACTAG"
constants: "up"
barcode_length: 60
min_readlength: 20
Use -w to specify the location of the work directory and -resume when only parts of the input have changed or only a subset of process has to be re-run.
nextflow run DaneVass/bartab \
-profile conda \
-params-file path/to/params/file.yaml \
-w "/scratch/work/" \
-resume
Preparing pipeline input
Reference
The reference must be a fasta file.
For SPLINTR libraries, barcodes are labelled based on their frequency in the library.
Giving barcodes labels is advantagous for plotting and comparing barcodes across samples.
Bulk data
All fastq files must be moved or linked to an input directory. This can be done as follows if files are in different subdirectories:
cd indir/
fastq_files="path/to/data/*/*fastq.gz"
for f in $fastq_files; do ln -s $f .; done
Single-cell amplicon-sequencing data
All fastq files can then be symlinked to an input directory. File names must match the pattern <sample_name>_R{1,2}*.{fastq,fq}.gz.
Parameter input_type must be set to fastq.
If scRNA-seq has already been processed with Cell Ranger or STARSolo, cell barcode whitelists can be provieded to skip cell calling.
Set the parameter whitelist_indir to the location of the folder with all whitelist files.
A list of cell IDs detected in a sample can be found in Cell Ranger results in outs/filtered_feature_bc_matrix/barcodes.tsv.gz.
For each sample, a matching whitelist must be provided following the syntax <sample_id>_whitelist.tsv with the same sample_id as in the filename of the fastq files.
The whitelist files must be in tsv format and extensions like '-1' must be removed.
Example command:
zcat outs/filtered_feature_bc_matrix/barcodes.tsv.gz | sed 's/..$//g' > whitelist_indir/4-5nMT100nMP_S4_whitelist.tsv
If no cell ID whitelist is provided (with --whitelist_indir),
cell barcodes are identified in R1 using umi-tools whitelist.
Processed scRNA-seq data
Barcodes can be extracted from scRNA-seq data, from Cell Ranger or STARSolo generated BAM files.
Reads containing barcode sequences will be in the unmapped fraction of reads after alignment.
To retain unmapped reads annotated with cell ID and UMI in the output, run STAR with the option --outSAMunmapped Within KeepPairs.
All BAM files can then be symlinked to an input directory and the parameter input_type set to bam.
To symlink files and give them individual names,
cd indir/
ln -s cellranger/results/outs/possorted_genome_bam.bam Sample_Pool-1-231114.bam
When using the reference-based workflow, it is recommended to set the parameter constants to all to get maximum recovery of barcodes from randomly fragmented reads.
This parameter setting is not compatible with the reference-free workflow since barcode fragments covering different parts of a barcode cannot be clustered.
Debugging the pipeline
If a process fails and the error message is not clear, have a look in the run folder of the process.
Each process has runs in a separate folder in the work directory, e.g. /scratch/work/cf/d61146a435010dfeb92d2aee18947a.
Check .command.err and any other log files in the folder for error messages.
Performance
We recently did a comparison of BARtab with two other pipelines, pycashier and timemachine.
Code and results are available here: https://github.com/DaneVass/bartoolsmanuscriptcode/tree/main/tools-comparison
Module Descriptions
Module descriptions contain additional information on third-party tools, parameters and output files.
General modules
SOFTWARE_CHECK
Software check is always performed as first module.
Output files:
- reports/software_check.txt: Report of all software versions.
FASTQC
QC of fastq files is performed using FastQC.
Output files:
- reports/fastqc/<sample_id>.html: html report (copy)
- reports/fastqc/zips/<sample_id>.zip (copy)
MERGE_READS
If running in mode paired-bulk, forward and reverse reads are merged using FLASh.
The minimum overlap of reads can be specified with the parameter mergeoverlap (default 10 bases).
Output files:
merged_reads/<sample_id>.extendedFrags.fastq.gz: merged reads (symlink)merged_reads/unmerged/<sample_id>.notCombined_<1,2>.fastq.gz: reads that could not be merged (symlink)merged_reads/logs/<sample_id>.flash.log: log (copy)merged_reads/logs/<sample_id>.hist: Numeric histogram of merged read lengths. (copy)merged_reads/logs/<sample_id>/<sample_id>.histogram: Visual histogram of merged read lengths. (symlink)
FILTER_READS
Reads are quality filtered using fastp.
The minimum quality score to keep can be specified with the parameter minqual.
The minimum percent of bases that must have minqual quality can be specified with the parameter pctqual.
complexity_threshold (default 0) can be set to remove reads with e.g. poly-A tails due to an empty barcode contstruct.
Complexity is defined as minimum percentage of bases that are different from their next base (base[i] != base[i+1]).
For a WSN barcode pattern, barcodes have a minimum complexity of 66%.
fastp by default removes reads with >5 ambiguous bases (N).
To skip read filtering, set minqual to 0.
Output files:
- filtered_reads/<sample_id>.filtered.fastq.gz: filtered reads (symlink)
- filtered_reads/logs/<sample_id>.filter.log: log (copy)
- filtered_reads/logs/<sample_id>.filter.fastp.json: log (copy)
- filtered_reads/logs/<sample_id>.filter.fastp.html: log (copy)
CUTADAPT_READS
Constant regions are trimmed and reads are filtered for length and N bases using cutadapt.
Constants can be specified with the parameters upconstant and downconstant.
For each, minimum coverage can be specified with up_coverage and down_coverage (default 3).
If this is smaller than the length of the constant region, partial matches at the beginning or end of the sequence are accepted.
This is particularly useful in case of random fragmentation and variable stagger length.
In bulk mode, reads can be filtered for containing either upconstant (up), downconstant (down) or both (both) with the parameter constants.
If all reads cover the upstream constant but only some cover the downstream constant (e.g. due to variable length barcodes or stagger),
set constants to both and down_coverage to 0.
When contstants is set to all, reads are filtered in all three ways and fastq files of trimmed sequences are concatenated.
This is usefult for random fragmented barcode reads from scRNA-seq data.
Example for trimming options:
upconstant="ATGGAATTG"
downconstant="CGGAACCGA"
upcoverage=6
downcoverage=6
>seq1
ATGGAATTGACATCACGCTCAAGGATCCGGAACCGA
>seq2
GAATTGACATCACGCTCAAGGATCCGGAAC
>seq3
ATGGAATTGACATCACGCTCAAGGATC
>seq4
GAATTGACATCACGCTCAAGGATC
>seq5
ACATCACGCTCAAGGATCCGGAACCGA
>seq6
ACATCACGCTCAAGGATCCGGAAC
>seq7
ACATCACGCTCAAGGATCCGGA
>seq8
ACATCACGCTCCGGAACAAGGATC
>seq9
ACATCACGCTCAAGGATC
Option both will trim sequence 1 and 2, up will trim sequence 3 and 4, down will trim sequence 5 and 6, all will trim sequence 1-6.
The minimum read length can be specified with min_readlength (default 20).
If a constant barcode length is set with barcode_length, this is set as maximum sequence length.
For both, only sequences matching exactly barcode_length will be retained.
The fraction of mismatches in the constant region can be specified with constantmismatches (default 0.1).
Output files:
- trimmed_reads/<sample_id>.trimmed.fastq: filtered and trimmed reads (symlink)
- trimmed_reads/logs/<sample_id>.cutadapt.log: log (copy)
BUILDBOWTIEINDEX
If a reference is provided, it is indexed using bowtie1.
Output files:
- reference_index/ NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2 (symlink)
BOWTIE_ALIGN
If a reference is provided, trimmed and filtered reads are aligned to the indexed reference using bowtie1.
--norc is specified, bowtie will not attempt to align against the reverse-complement reference strand.
Only non-ambiguous alignments are reported with the flags -a --best --strata -m1.
This is necessary when aligning short barcode fragments from scRNA-seq.
Note: This can results in not detection of specific barcodes if only a part of the barcode is sequenced.
If for example only the first 40 bases of a barcode are sequenced and there are non-unique barcodes in the reference based on the first 40 bases, no read will unambiguously match to these barcodes.
Sequences that map with the same number of mismatches to multiple barcodes will be discarded.
The number of allowed mismatches can be specified with the parameter alnmismatches (default 2).
Output files:
- mapped_reads/<sample_id>.mapped.sam: Aligned reads (symlink)
- mapped_reads/unmapped/<sample_id>.unmapped.fastq.gz: Unaligned reads (optional) (symlink)
- mapped_reads/logs/<sample_id>.bowtie.log: log (copy)
FILTER_ALIGNMENTS
If the barcodes have a consistent length specified with barcode_length, alignments to the middle of a barcode sequence are filtered out.
Alignments that start at the first position or end at the last are retained.
This ensures confidence in barcodes detected with short mapping sequences (min_readlength), or barcode fragments from scRNA-seq.
Output files:
- mapped_reads/<sample_id>.mapped_filtered.sam: Aligned and filtered reads (symlink)
TRIMBARCODELENGTH
Before clustering barcode reads with starcode, they are trimmed to min_readlength if constants is set to up or down.
Reasoning:
If the reads do not cover the whole barcode and a stagger is used, barcode reads will have different length.
Since starcode only collapses sequence clusters of unequal sizes (--cluster_ratio), this would result in one cluster per stagger length for each barcode.
Trimming is either done on 3' or 5' end, depending on which constant was trimmed.
No trimming is performed if constants is both. all is not compatible with clustering barcode reads.
MULTIQC
MultiQC creates a report of metrics for fastqc, flash, fastp, cutadapt and bowtie for all samples.
Output files:
- reports/multiqc/multiqc_report.html: report for all samples (copy)
GETBARCODECOUNTS
The SAM file of aligned barcode reads is sorted, indexed and compressed using samtools.
Barcode counts for each sample are compiled with samtools indexstats.
Output files:
- counts/<sample_id>_rawcounts.txt: tsv containing barcode and count (copy)
Population-level specific modules
STARCODE
If no reference is provided, the consensus barcode repertoire is derived using starcode.
Starcode clusters the filtered and trimmed barcode sequences based on their Levenshtein distance (substitutions, insertions, deletions).
The default Levenshtein distance for clustering is 3 which is conservative to avoid collapsing truly distingt barcodes.
The parameter cluster_distance can be adjusted for the expected number of sequencing errors given the length of the barcode construct.
Examplary recommended cluster distances: ClonMapper (Gutierrez et al. 2021) (20bp barcode): 1 and FateMap (Goyal et al. 2023) (100bp barcode): 8.
The default cluster_ratio to use for message passing clustering is 3.
This means that a cluster can absorb a smaller one only if it is at least 3 times bigger.
When barcodes are aligned to a reference, cluster_unmapped can be set to true to cluster unmapped barcode reads.
When using this option for barcodes from direct scRNA-seq data, barcode reads have different lengths due to random fragmentation.
Clustering unmapped reads may therefore not be as informative.
From the starcode manual:
The clustering method is Message Passing. This means that clusters are built bottom-up by merging small clusters into bigger ones. The process is recursive, so sequences in a cluster may not be neighbors, i.e., they may not be within the specified Levenshtein distance.
Output files:
- starcode/<sample_id>[_unmapped]_starcode.tsv: barcode counts with sequence of centroid of each barcode cluster and read count (copy)
- starcode/logs/<sample_id>[_unmapped]_starcode.log: log (copy)
When starcode is run on reads that did not align to the barcode reference library, results are written to counts/unmapped/ and counts/logs/ instead of starcode/.
COMBINEBARCODECOUNTS
Barcode counts of all samples are combined into one table with an outer join.
Output files:
- counts/all_counts_combined.tsv: table of barcodes and counts for each sample (copy)
Single-cell specific modules
UMITOOLS_WHITELIST
If no cell ID whitelist is provided (with --whitelist_indir), Cell barcodes are identified in R1 using umi-tools whitelist.
A whitelist of cell IDs can be found in Cell Ranger results in outs/filtered_feature_bc_matrix/<sample_id>-barcodes.tsv but extensions like '-1' must be removed.
The expected number of cells should be specified with the parameter cellnumber.
This should be approximately the number of cells loaded. The command is only utilized to extract cell barcodes and not to perform cell calling.
Cell calling should be done with tools such as Cell Ranger.
Barcodes identified in droplets that do not contain cells or doublets will be removed when merging the barcode counts table with e.g. QC'd Seurat object.
If the number of cells loaded differs a lot between samples, they must be processed separately with adjusted cellnumber values.
Output files:
- umitools/whitelist/<sample_id>_whitelist.tsv: whitelisted cell barcodes and counts (symlink)
- umitools/whitelist/logs/<sample_id>_whitelist.log: log (copy)
UMITOOLS_EXTRACT
Reads that contain cell barcode and UMI are extracted using umi-tools extract.
Output files:
- umitools/extract/<sample_id>_R2_extracted.fastq: reads that contain cell barcode and UMI, both added to the read name (symlink)
- umitools/extract/logs/<sample_id>_exctract.log: log (copy)
BAMTOFASTQ
Reads are filtered with samtools for flag 4 and flags CB and UB to obtain unmapped reads that contain a cell barcode and UMI. At a later step (for efficiency), cell ID and UMI are added to the read headers with the module RENAMEREADSBAM (or RENAMEREADSSAW).
Output files:
- fastq/<sample_id>_R2.fastq.gz: reads containing cell barcode and UMI (symlink)
STARCODE_SC
If no reference is available, starcode-umi is used to cluster barcodes from scRNA-seq data.
We utilized the python script starcode-umi, from starcode/starcode-umi, written to cluster UMI-tagged sequences.
Starcode-umi performs clustering and merging of barcode sequences to find the best possible clusters of UMI and sequence pairs.
Parameters cluster_distance and cluster_ratio can be adjusted for error correction of barcode sequences (see STARCODE).
See also section Trimming barcode length before clustering
This module also clusters unmapped reads if cluster_unmapped is set to true.
Output files:
- starcode/[unmapped/]<sample_id>[_unmapped]_starcode.tsv: the output of starcode with clustered and corrected cellbarcode-umi-barcode sequence and sequence count (symlink)
- starcode/logs/<sample_id>[_unmapped]_starcode.log: starcode log (copy)
REMOVEPCRCHIMERISM
We assume that PCR chimerism happen mainly between the CB-UMI and the lineage barcode sequence.
Reads from PCR chimerism are removed by only retaining the cell barcode-umi-lineage barcode combination supported by the most reads.
This substantially decreases the amount of barcodes detected per cell.
Output files:
- counts/[unmapped/]pcr_chimerism/<sample_id>[_unmapped][_starcode]_barcodes.tsv: cell barcode-umi-lineage barcode combination, formatted as input to umi-tools counttab (symlink)
- `counts/logs/<sampleid>[unmapped][starcode]pcrchimerism.log`: log (copy)
UMITOOLS_COUNT
Trimmed, filtered and aligned barcodes are counted using umi-tools count_tab.
The Hamming distance between UMIs to be collapsed within cells during counting can be specified with parameter umi_dist (default and recommended 1).
Output files:
- counts/<sample_id>.counts.tsv: barcode UMI counts with columns barcode, cell barcode and deduplicated UMI count (copy)
- counts/logs/<sample_id>_counts.log: log (copy)
COUNTBARCODESSAM
Alternative to umitools_count for stereo-seq data processed with the SAW pipeline. Trimmed, filtered and aligned barcodes are counted from the SAM file (no pcr chimera removal or UMI error correction).
Output files:
- counts/<sample_id>.counts.tsv: barcode UMI counts with columns barcode, cell barcode and deduplicated UMI count (copy)
PARSEBARCODESSC
Since multiple barcodes can be detected in a cell, the counts table needs to be aggregated. This allows the results to be merged into the metadata of a single-cell object such as a Seurat or AnnData object.
Barcodes can be filtered based on the number of supporting UMIs (--umi_count_filter) and by the number of UMIs in comparison to the dominant barcode per cell (--umi_fraction_filter).
E.g. if barcode a has 5 supporting UMIs in a cell and a second barcode with 2 supporting UMIs and umi_fraction_filter set to 0.3, 5 / 2 = 0.25 < 0.3, so the second barcode will be discarded.
Barcodes and UMIs are semicolon-separated if multiple barcodes were detected per cell.
Output files:
- counts/<sample_id>_cell_barcode_annotation.tsv: aggregated barcode counts per cell with cell barcode as row index and barcode and UMI count as columns (copy)
- counts/qc/<sample_id>_barcodes_per_cell.pdf: QC plot, number of detected barcode per cell (copy)
- counts/qc/<sample_id>_UMIs_per_bc.pdf: QC plot, UMIs supporting the most frequent barcode per cell (copy)
- counts/qc/<sample_id>_avg_sequence_length.pdf: QC plot, average mapped sequence length per barcode (only with reference)
- counts/qc/<sample_id>_barcodes_per_cell_filtered.pdf: QC plot, number of detected barcode per cell (copy)
- counts/qc/<sample_id>_UMIs_per_bc_filtered.pdf: QC plot, UMIs supporting the most frequent barcode per cell (copy)
- counts/qc/<sample_id>_avg_sequence_length_filtered.pdf: QC plot, average mapped sequence length per barcode (only with reference) (copy)
- counts/qc/<sample_id>_avg_sequence_length.tsv: average mapped sequence length per barcode as table (only with reference) (copy)
Example output structure
Essential output files are marked with **.
Bulk
single_end/
counts
**all_counts_combined.tsv**
**test_SE_01_rawcounts.txt**
**test_SE_02_rawcounts.txt**
**test_SE_03_rawcounts.txt**
filtered_reads
logs
test_SE_01.filter.fastp.html
test_SE_01.filter.fastp.json
test_SE_01.filter.log
test_SE_02.filter.fastp.html
test_SE_02.filter.fastp.json
test_SE_02.filter.log
test_SE_03.filter.fastp.html
test_SE_03.filter.fastp.json
test_SE_03.filter.log
test_SE_01.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/4b/5aa24ebed0ac626e6f9a3e01a8cda2/test_SE_01.filtered.fastq.gz
test_SE_02.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/6b/2c6d8dbfd414d1f2cb07d636dd54ec/test_SE_02.filtered.fastq.gz
test_SE_03.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/ee/8aa047ee412dfc7d3efb5e40524770/test_SE_03.filtered.fastq.gz
mapped_reads
logs
test_SE_01.bowtie.log
test_SE_02.bowtie.log
test_SE_03.bowtie.log
test_SE_01.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/87/f338156278257ac88ff276253dcd6e/test_SE_01.mapped_filtered.sam
test_SE_01.mapped.sam -> /scratch/users/hholze/BARtab/work/89/d74f3656808410912312463972204e/test_SE_01.mapped.sam
test_SE_02.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/e2/b1d81c307389fd6a2ae690792520b5/test_SE_02.mapped_filtered.sam
test_SE_02.mapped.sam -> /scratch/users/hholze/BARtab/work/69/ded496d4631e6d2952e585b81f84bc/test_SE_02.mapped.sam
test_SE_03.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/45/c0bfc53016017909fd7bc8c6dee2dd/test_SE_03.mapped_filtered.sam
test_SE_03.mapped.sam -> /scratch/users/hholze/BARtab/work/19/84b5109b37a0e5453fc046f38b968b/test_SE_03.mapped.sam
unmapped
test_SE_01.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/89/d74f3656808410912312463972204e/test_SE_01.unmapped.fastq.gz
test_SE_02.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/69/ded496d4631e6d2952e585b81f84bc/test_SE_02.unmapped.fastq.gz
test_SE_03.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/19/84b5109b37a0e5453fc046f38b968b/test_SE_03.unmapped.fastq.gz
pipeline_info
execution_report_2024-02-13_10-54-54.html
execution_timeline_2024-02-13_10-54-54.html
execution_trace_2024-02-13_10-54-54.txt
pipeline_dag_2024-02-13_10-54-54.html
software_check.txt
reference_index
test.ref.fasta.1.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.1.ebwt
test.ref.fasta.2.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.2.ebwt
test.ref.fasta.3.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.3.ebwt
test.ref.fasta.4.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.4.ebwt
test.ref.fasta.rev.1.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.rev.1.ebwt
test.ref.fasta.rev.2.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.rev.2.ebwt
reports
fastqc
test_SE_01_fastqc.html
test_SE_02_fastqc.html
test_SE_03_fastqc.html
zips
test_SE_01_fastqc.zip
test_SE_02_fastqc.zip
test_SE_03_fastqc.zip
multiqc
multiqc_data
multiqc_bowtie1.txt
multiqc_citations.txt
multiqc_cutadapt_cutadapt_up.txt
multiqc_data.json
multiqc_fastp_fastp_filter.txt
multiqc_fastqc.txt
multiqc_general_stats.txt
multiqc.log
multiqc_sources.txt
**multiqc_report.html**
trimmed_reads
logs
test_SE_01.cutadapt_up.log
test_SE_02.cutadapt_up.log
test_SE_03.cutadapt_up.log
test_SE_01.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/b8/d4049037e44cb9ef2e8123c4cf8bca/test_SE_01.trimmed.fastq.gz
test_SE_02.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/af/fe172f0a69673bc61a0791fbffc6c3/test_SE_02.trimmed.fastq.gz
test_SE_03.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/5b/5180a8298fd66d703cc75675fa14e2/test_SE_03.trimmed.fastq.gz
Single-cell
sc
counts
logs
test_sc_count.log
test_sc_pcr_chimerism.log
pcr_chimerism
test_sc_mapped_barcodes.tsv -> /scratch/users/hholze/BARtab/work/91/80eb18d752e203a36e56c802287f1e/test_sc_mapped_barcodes.tsv
qc
**test_sc_avg_sequence_length_filtered.pdf**
**test_sc_avg_sequence_length.pdf**
test_sc_avg_sequence_length.tsv
**test_sc_barcodes_per_cell_filtered.pdf**
**test_sc_barcodes_per_cell.pdf**
**test_sc_UMIs_per_bc_filtered.pdf**
**test_sc_UMIs_per_bc.pdf**
**test_sc_cell_barcode_annotation.tsv**
**test_sc.counts.tsv**
filtered_reads
logs
test_sc.filter.fastp.html
test_sc.filter.fastp.json
test_sc.filter.log
test_sc.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/3d/5eb0602672d14f41394bb45beb8453/test_sc.filtered.fastq.gz
mapped_reads
logs
test_sc.bowtie.log
test_sc.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/74/ba40770c2814b5b8ca350f68b238a3/test_sc.mapped_filtered.sam
test_sc.mapped.sam -> /scratch/users/hholze/BARtab/work/85/342afa175dac82cc4c8c034659ec2a/test_sc.mapped.sam
unmapped
test_sc.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/85/342afa175dac82cc4c8c034659ec2a/test_sc.unmapped.fastq.gz
pipeline_info
execution_report_2024-02-13_10-41-07.html
execution_timeline_2024-02-13_10-41-07.html
execution_trace_2024-02-13_10-41-07.txt
pipeline_dag_2024-02-13_10-41-07.html
software_check.txt
reference_index
test.ref.fasta.1.ebwt
test.ref.fasta.2.ebwt
test.ref.fasta.3.ebwt
test.ref.fasta.4.ebwt
test.ref.fasta.rev.1.ebwt
test.ref.fasta.rev.2.ebwt
reports
fastqc
test_sc_R1_fastqc.html
test_sc_R2_fastqc.html
zips
test_sc_R1_fastqc.zip
test_sc_R2_fastqc.zip
multiqc
multiqc_data
multiqc_bowtie1.txt
multiqc_citations.txt
multiqc_cutadapt_cutadapt_up.txt
multiqc_data.json
multiqc_fastp_fastp_filter.txt
multiqc_fastqc.txt
multiqc_general_stats.txt
multiqc.log
multiqc_sources.txt
**multiqc_report.html**
trimmed_reads
logs
test_sc.cutadapt_up.log
test_sc.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/bc/0dc0335a276e124249000dfeeaaeb6/test_sc.trimmed.fastq.gz
umitools
extract
logs
test_sc_extract.log
test_sc_R2_extracted.fastq -> /scratch/users/hholze/BARtab/work/8a/1f38d4dcb11843353bcf43b4479a74/test_sc_R2_extracted.fastq
whitelist
logs
test_sc_whitelist.log
test_sc_whitelist.tsv -> /scratch/users/hholze/BARtab/work/40/89a05ef10cda4349ab4c71dcc74e50/test_sc_whitelist.tsv
Owner
- Name: Dane Vassiliadis
- Login: DaneVass
- Kind: user
- Location: Melbourne, Australia
- Company: Peter MacCallum Cancer Centre
- Website: https://twitter.com/danevass
- Twitter: DaneVass
- Repositories: 2
- Profile: https://github.com/DaneVass
GitHub Events
Total
- Watch event: 1
- Push event: 2
- Pull request review event: 1
- Pull request event: 2
Last Year
- Watch event: 1
- Push event: 2
- Pull request review event: 1
- Pull request event: 2
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 6 months
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- HenriettaHolze (4)
- DaneVass (1)
- MarwaneBourdim (1)