bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data

Keywords

bambu bioconductor long-reads nanopore nanopore-sequencing r rna-seq rna-seq-analysis transcript-quantification transcript-reconstruction transcriptomics

Keywords from Contributors

genomics bioconductor-packages ontology sequencing gene proteomics rna-seq-data promoter-annotation promoter-activity alternative-promoters

Last synced: 6 months ago · JSON representation

Repository

Reference-guided transcript discovery and quantification for long read RNA-Seq data

Basic Info

Host: GitHub
Owner: GoekeLab
License: gpl-3.0
Language: R
Default Branch: devel
Homepage:
Size: 550 MB

Statistics

Stars: 215
Watchers: 6
Forks: 24
Open Issues: 57
Releases: 5

Topics

bambu bioconductor long-reads nanopore nanopore-sequencing r rna-seq rna-seq-analysis transcript-quantification transcript-reconstruction transcriptomics

Created about 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License

bambu: Context-Aware Transcript Quantification from Long Read RNA-Seq data

bambu is a R package for multi-sample transcript discovery and quantification using long read RNA-Seq data. You can use bambu after read alignment to obtain expression estimates for known and novel transcripts and genes. The output from bambu can directly be used for visualization and downstream analysis such as differential gene expression or transcript usage.

Installation

bambu is available through GitHub and Bioconductor

Bioconductor: rscript if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("bambu")

GitHub: rscript library(devtools) install_github("GoekeLab/bambu") library(bambu) We can test if bambu is installed correctly and runs correctly by using a small test set that comes with the package.

```rscript test.bam <- system.file("extdata", "SGNexA549directRNAreplicate5run1chr911000000.bam", package = "bambu") fa.file <- system.file("extdata", "Homosapiens.GRCh38.dnasm.primaryassemblychr91_1000000.fa", package = "bambu")

gtf.file <- system.file("extdata", "Homosapiens.GRCh38.91chr911000000.gtf", package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file)

```

General Usage

The default mode to run bambu is using a set of aligned reads (bam files), reference genome annotations (gtf file, TxDb object, or bambuAnnotation object that can be obtained from prepareAnnotations() function), and reference genome sequence (fasta file or BSgenome). bambu will return a summarizedExperiment object with the genomic coordinates for annotated and new transcripts and transcript expression estimates. If you do not have any data yet or would like to test bambu with a full data-set that has been proven to work (the test data-set that comes with the package is too small to do proper analysis on), we recommend the SG-NEx data project. You can find this data and instructions on how to install it at here. We highly recommend using the same annotations that were used for genome alignment. If you have a gtf file and fasta file you can run bambu with the following options: rscript se <- bambu(reads = sample, annotations = annotations, genome = fa.file) reads - is a path to one or more bam files aligned to the same genome used in the genome argument, or a path to intermediate read class files (see Storing and using preprocessed files (rcFiles))

genome - is a path to the genome fasta file. This should be the same file used for read alignment. Bambu does not support alignment to the transcriptome as it requires the splice junctions from a genome alignment for transcript discovery.

annotations - takes a path to a gtf file, a txdb object, or annotations prepared by prepareAnnotations() (see Use precalculated annotation objects). When not provided, de novo transcript discovery is performed (see De-novo transcript discovery)

NDR - Novel Discovery Rate threshold. A value between 0 and 1 representing the maximum NDR threshold for candidate transcripts to be included in the analysis. By default, bambu will recommend a threshold for your analysis. For more information see Modulating the sensitivity of discovery (pre and post analysis).

For the full parameter list see Arguments

For information on the output and how to export it to a file see Output.

Transcript discovery only (no quantification)

If you are only interested in identifying novel transcripts, the quantification module of bambu can be skipped by setting quant to FALSE. Note that the output will be a GRangeslist object containing the reference and novel annotations (See rowRanges() in Output). For more details on how to adjust the sensitivity and precision of the results see Modulating the sensitivity of discovery (pre and post analysis)

rscript se.discoveryOnly <- bambu(reads = test.bam, annotations = gtf.file, genome = fa.file, quant = FALSE) Transcripts that were above the NDR threshold and filtered out (low confidence transcripts) and subset transcripts can be accessed in the metadata of the GRangesList object.

rscript metadata(se.discoveryOnly)$lowConfidenceTranscripts metadata(se.discoveryOnly)$subsetTranscripts

Quantification of annotated transcripts and genes only (no transcript/gene discovery)

If you are only interested in quantifying transcripts, the discovery module of bambu can be skipped by setting discovery to FALSE.

rscript se.quantOnly <- bambu(reads = test.bam, annotations = gtf.file, genome = fa.file, discovery = FALSE)

Using precalculated annotation objects

Depending on the size of your reference annotations the prepareAnnotations() step may take a few minutes. You can also use precalculated annotations and if you plan to run bambu more frequently with the same annotations, we recommend to save the bambuAnnotations object. The bambuAnnotation object can be calculated from:

a) a .gtf file: rscript annotations <- prepareAnnotations(gtf.file) b) a TxDb object rscript annotations <- prepareAnnotations(txdb) Save the object rscript saveRDS(annotations, ”/path/to/annotations.rds” ) This object can then be used instead of a path to your reference annotations to the annotations argument. rscript annotations <- readRDS("/path/to/annotations.rds") bambu(reads <- test.bam, annotations = annotations, genome = fa.file)

Running multiple samples

If you have multiple replicates for a sample, or plan to do comparative analysis between conditions, it may be beneficial to run all samples together instead of individually. This can be done by providing a vector of paths to all the bam files you want to analyze together.

```rscript

se.multiSample <- bambu(reads = c(test1.bam, test2.bam, test3.bam), annotations = gtf.file, genome = fa.file) ```

The advantage of running samples together include: Novel transcripts that are identified in multiple samples are assigned unified IDs, enabling comparative analysis between different samples. This is especially important for downstream differential expression analysis when looking at novel transcripts. Running multiple samples can be multithreaded (see ncore). While running multiple samples, By default, bambu will train a model separately on each sample and score novel transcripts in each sample separately.

If you need to combine samples in multiple configurations (for example different pairwise comparisons) we would recommend using the intermediate rcFiles to save processing time (see Storing and using preprocessed files (rcFiles))

Modulating the sensitivity of discovery (pre and post analysis)

When doing transcript discovery there is a balance between sensitivity (the number of real novel transcripts that are detected) and the precision (how many of the novel transcripts are real). To control this balance, bambu uses the novel discovery rate (NDR) as the main parameter. The NDR threshold approximates the proportion of novel candidates output by bambu, relative to the number of known transcripts it found, i.e., an NDR of 0.1 would mean that 10% of all transcripts passing the threshold are classified as novel.

If you are using a genome where you expect a high amount of novel transcripts, a higher NDR is recommended so that these novel transcripts are not missed. Conversely if you are using a well annotated genome, we recommend a lower NDR threshold to reduce the presence of false positives. By default the NDR threshold is automatically chosen for the user based on predicted level of annotation completeness when compared to the default model trained on human reference annotations (Hg38). For more information see Training a model on another species/dataset and applying it

To manually select an NDR value, use the NDR argument in bambu:

rscript se.NDR_0.3 <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, NDR = 0.3)

Alternatively the NDR threshold can be adjuted after discovery or on the final output (note that this will only effect the gtf output and for quantification to reflect the adddition or removal of transcripts because of the updated NDR, quantification would need to be rerun). the setNDR function will adjust the novel transcripts included in the output by removing any which are above the new threshold and adding those which are now below the threshold. setNDR takes the annotations as its first argument and the new NDR as the second argument. These annotations must have been generated by Bambu and have stored NDR values for this to work. Additionally setNDR can be run with no NDR, if you would prefer Bambu to recommend a threshold for your dataset. Refer to “Transcript discovery only” for advanced details using setNDR

```rscript

after the discovery step

extendedAnnotations_0.3 = setNDR(se.discoveryOnly, 0.3) writeAnnotationsToGTF(extendedAnnotations, "./output.gtf")

after a complete run

extendedAnnotations_0.3 = setNDR(rowRanges(se), 0.3) writeAnnotationsToGTF(extendedAnnotations, "./output.gtf") ```

To run quantification at a different NDR, simply provide bambu annotations alongside the new NDR threshold to bambu and it will automatically adjust the transcripts.

rscript se.quantOnly <- bambu(reads = test.bam, annotations = extendedAnnotatons genome = fa.file, discovery = FALSE, NDR = 0.5)

You can check the NDR threshold of your annotations by looking at the stored NDR value. This value is updated upon running setNDR. annotations imported from a gtf file will not have this value until after the first running of setNDR. rscript print(metadata(extendedAnnotations_0.3)$NDRthreshold)

Additionally there are other thresholds that advanced users can access through opt.discovery when running bambu (see arguments).

Output

bambu returns a SummarizedExperiment object which can be accessed as follows:

assays(se) returns a list of transcript abundance estimates as counts or CPM
rowRanges(se) returns a GRangesList with all annotated and newly discovered transcripts
rowData(se) returns additional information about each transcript
metadata(rowRanges(se)) returns a list of transcripts considered low confidience which were not included in the exnteded Annotations.

Access transcript expression estimates by extracting a variable (such as counts or CPM) using assays():

assays(se)$counts - expression estimates
assays(se)$CPM - sequencing depth normalized estimates
assays(se)$fullLengthCounts - estimates of read counts mapped as full length reads for each transcript
assays(se)$uniqueCounts - counts of reads that are uniquely mapped to each transcript

For a full description of the other outputs see Output Description

The full output can be written to a file using writeBambuOutput(). Using this function will generate six files, including 4four .gtf files(detailed below), and two .txt files for the expression counts at transcript and gene levels.

By default bambu will write four .gtf files - extendedAnnotations.gtf - Contains all transcript models from the reference annotations and any novel high confidence transcript models (below NDR threshold) from Bambu - allTranscriptModels - Contains all transcript models from the reference annotations and all novel transcript models, irrespective of their NDR score. This is useful for reloading into Bambu with prepareAnnotations() to redo the analysis or reoutput the annotations at different NDR thresholds. - supportedTranscriptModels - Contains only transcript models that are fully supported by at least one read across the samples provided. Note that if multiple reference annotations share the same intron junctions, an abitrary one will selected to be be included in this output. - novelTranscripts - Contains only novel high confidence transcript models (below NDR threshold) from Bambu.

rscript writeBambuOutput(se, path = "./bambu/")

If you are only interested in the novel transcripts, one can filter this 'se' object first to remove reference annotations. rscript se.novel = se[mcols(se)$novelTranscript,] writeBambuOutput(se.novel, path = "./bambu/") If you are only interested in full-length transcripts that were detected by Bambu in at least 1 sample. ```rscript se.novel = se[mcols(se)$novelTranscript&(apply(assays(se)$fullLengthCounts >= 1,1,sum)>=1),] writeBambuOutput(se.novel, path = "./bambu/")

If quant is set to FALSE i.e. only transcript discovery is performed, only the rowRanges output of the extended annotations is returned (a GRangesList object). The equivalent rowData can be accessed with mcols() These annotations can be written to a .gtf file using writeAnnotationsToGTF(GRangesListobject, outputpath). This will output the four .gtf files mentioned above, and can be excluded using the same arguments. rscript se.discoveryOnly <- bambu(reads = sample, annotations = annotations, genome = fa.file, quant = FALSE) writeAnnotationsToGTF(se.discoveryOnly, "./output.gtf") If you would prefer to manually filter the annotations, you can also provide the resulting annotations to writeToGTF() which will output the annotations as is. rscript se.discoveryOnly.novel = se.discoveryOnly[mcols(se.discoveryOnly)$novelTranscript,] writeToGTF(se.discoveryOnly.novel, "./output.gtf")

If both quant and discovery are set to FALSE, bambu will return an intermediate object see Storing and using preprocessed files (rcFiles)

To reimport the output of writeBambuOutput() use importBambuResults() rscript se <- importBambuResults(path = "/path/to/bambu/output/")

Visualization

You can visualize the novel genes/transcripts using plotBambu function. (Note that the visualization was done by running bambu on the three replicates of HepG2 cell line in the SG-NEx project)

rscript plotBambu(se, type = "annotation", gene_id = "ENSG00000107104")

plotGene

rscript plotBambu(se, type = "annotation", transcript_id = "tx.9")

plotTranscript

plotBambu can also be used to visualize the clustering of input samples on gene/transcript expressions. Only for multiple samples’ visualisation. See Running multiple samples

rscript plotBambu(se, type = "heatmap") # heatmap

plotHeapmap

rscript plotBambu(se, type = "pca") # PCA visualization

plotPCA

plotBambu can also be used to visualize the clustering of input samples on gene/transcript expressions with grouping variable ```rscript plotBambu(se, type = "heatmap", group.var) # heatmap

plotBambu(se, type = "pca", group.var) # PCA visualization ```

Single-Cell-and-Spatial

There is a single-cell and spatial pipeline starting from fastq or demultiplexed bam files that include demultiplexing and aligning available here https://github.com/GoekeLab/bambu-singlecell-spatial. We recommend using this pipeline where possible.

For advanced users see the #Custom single-cell and spatial analysis section under advanced options

Bambu Advanced Options

Below we include several advanced options and use-cases for bambu. We recommend reading and understanding the paper before attempting to use these features.

Using a pretrained model

Bambu requires at least 1000 transcripts from the annotations to be detected in a sample in order to train a sample specific model. In use cases where this is not possible bambu will instead use a default pretrained model to calculate the transcript probability score (TPS) for each read class. Users can force this behavior if they believe their annotations are not sufficient for sample specific training (for example if they suspect a high proportion of real novel transcripts are present in their sample). This is advantageous when you want NDR calibration without the impacts of a model trained using low quality annotations.

rscript se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, opt.discovery = list(fitReadClassModel = FALSE)) The default pretrained model was trained on SGNexHepG2directRNAreplicate5run1 and has the following characteristics:

Genome: Homosapiens.GRCh38.dnasm.primaryassembly
Annotations: Homosapiens.GRCh38.91
Read count: 7,861,846
Technology: Nanopore (ONT)
Library preparation: directRNA
Base Calling Accuracy: 79%
Average Read Length: 1093

We have found the pretrained model works successfully across species borders (on Arabidopsis thaliana) and on different technologies (PacBio), with only small decreases in performance compared to using a sample specific model. The pretrained model is not always effective in samples with large differences in sequencing quality or if the library preparation results in biases in the overall structure of the transcriptome. In this case, we would recommend training a new model using similar data from a different sample that has quality reference annotations (See Training a model on another species/dataset and applying it).

De-novo transcript discovery

In cases where the organism does not yet have reference annotations, or unreliable annotations, bambu can be run in de-novo mode. In de-novo mode, bambu does not train a model, and instead uses the pretrained model to classify novel transcripts (see Using a pretrained model. To learn how to train a new model for a more closely related organism/sample see Training a model on another species/dataset and applying it. Without annotations bambu is unable to calibrate the NDR output, nor be able to recommend a threshold and will instead use the TPS as the thresholded value. Therefore you should supply a manual NDR threshold (Modulating the sensitivity of discovery (pre and post analysis)) and note that the precision of the output is unlikely to linearly match an applied threshold. The TPS threshold used is (> 1-NDR). If an NDR is not provided, a default NDR threshold of <0.1 is used (an effective TPS threshold of > 0.9).

rscript novelAnnotations <- bambu(reads = test.bam, annotations = NULL, genome = fa.file, NDR = 0.5, quant = FALSE)

Storing and using preprocessed files (rcFiles)

The first step of bambu involves the construction of read classes which is a large fraction of the running time. This could be time-consuming if we want to perform transcript discovery & quantification multiple times on the same dataset using different configurations (eg. NDR, or combinations of samples), especially when the sample is large. To mitigate this, we can store the read class information as read class files (rcFiles) during a bambu run. Then they can be used as an input argument in the bambu main function for the subsequent bambu run.

rscript se <- bambu(reads = rcFiles, annotations = annotations, genome = fa.file) rcFiles can be generated in two ways, either as a direct output of the bambu() function when quant and discovery are FALSE, or as written outputs when a path is provided to the rcOutdir argument. When rcFiles are output using rcOutdir this is done using BiocFileCache. For more details on how to access, use, and identify these files see here. A short example is shown below.

Example using rcOutDir to produce preprocessed files rscript se <- bambu(reads = test.bam, rcOutDir = "path/to/rcOutput/", annotations = annotations, genome = fa.file)

This will store a preprocessed rcFile in the provided directory for each sample file provided to reads. To access these files for future use, we recommend using the BioCFileCache package which provides the metadata needed to identify the sample.

rscript library(BiocFileCache) bfc <- BiocFileCache("path/to/rcOutput/", ask = FALSE) info <- bfcinfo(bfc) The info object is a tibble which associates the filename (fpath) with the sample (rname) to help you identify which .rds file you need.

```rscript info

running bambu using the first file

se <- bambu(reads = info$rpath[1], annotations = annotations, genome = fa.file) ```

This output is also generated when both quant and discovery are set to false in a list form indexed by sample. rscript se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE, assignDist = FALSE)

As this is an intermediate object it will not be suitable to use for general use cases. We will document the object below for any potential advanced use cases that may arise. rscript rowData(se[[1]])

|column name|description| |---|---| |chr.rc|The chromosome name the read class is found on| |strand.rc|The strand of the read class| |startSD|The standard deviation of the aligned genomic start positions of all reads assigned to the read class| |endSD|The standard deviation of the aligned genomic end positions of all reads assigned to the read class| |readCount.posStrand|The number of reads assigned to this read class that aligned to the positive strand| |intronStarts|A comma separated character vector of intron start coordinates| |intronEnds|A comma separated character vector of intron end coordinates| |confidenceType|Category of confidence:
highConfidenceJunctionReads - the read class contain no low confidence junctions
lowConfidenceJunctionReads - the read class contains low confidence junctions
unsplicedWithin - single exon read class that is within the exon boundaries of an annotation
unsplicedNew - single exon read class that does not fully overlap with annotated exons| |readCount|The number of reads assigned to this read class| |readIds|An integer list of bambu internal read ids that belong to the read class. (See the metadata of the object for full read names)| |sampleIds|An integer list of bambu internal sample ids based on barcodes.| |GENEID|The gene ID the transcript is associated with| |novelGene|A logical that is true if the read class belongs to a novel gene (does not overlap with an annotated gene loci)| |numExons|The number of exons the read class has| |geneReadProp|The proportion of reads assigned to this read class relative to all the reads assigned to all read classes from its gene| |geneReadCount|The number of reads assigned to the gene of this read class| |equal|A logical that is true if the read classes exon-junctions perfectly and completely match the exon-junctions of a reference annotation| |compatible|An integer counting the number of reference annotations, where the read classes exon-junctions are contiguously present (a subset)| |numAstart|An integer counting the number of A nucleotides found within a 20bp window centered on the read class genomic start position| |numAend|An integer counting the number of A nucleotides found within a 20bp window centered on the read class genomic end position| |numTstart|An integer counting the number of T nucleotides found within a 20bp window centered on the read class genomic start position| |numTend|An integer counting the number of T nucleotides found within a 20bp window centered on the read class genomic end position| |txScore.noFit|This is the TPS generated by the pretrained model| |txScore|This is the TPS generated by the sample trained model|

Tracking read-to-transcript assignment

Some use cases require knowing which individual reads support specific transcripts (novel and annotated). By default this feature is off due to the memory overhead it introduces but can be turned on using the trackReads argument. The output has three columns: read_id, a list of indices of equal matches, a list of indices of compatible matches. These indices match the annotations found in rowRanges(se)

rscript se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, trackReads = TRUE) metadata(se)$readToTranscriptMaps[[1]]

|column name|description| |---|---| |readId|The read name as found in the bam file. If running from a rcFile where trackReads!=TRUE, bambu will not have stored the read names, this will instead be a unique bambu-assigned numerical ID (will not correlate with the bam file). | |equalMatches|A list of integers with the tx ids where the exon-junctions of the read match completely and contiguously. This matches the index of the transcript found in rowRanges()| |compatibleMatches|A list of integers with the tx ids where the exon-junctions of the read are found contiguously within the transcript (a subset). This matches the index of the transcript found in rowRanges()|

Training a model on another species/dataset and applying it

In situations where training is not or cannot be performed, and the default model is also not suitable for the sample (the sample is generated from a different technology, species, condition, etc), bambu provides the option to train a new model, if well annotated similar data is available. For example one might train a model on arabidopsis to apply to an unannotated plant sample.

```rscript

first train the model using a related annotated dataset from .bam

se <- bambu(reads = sample1.bam, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE, opt.discovery = list(returnModel = TRUE)) # note that discovery and quant need to be set to FALSE, alternatively you can have them set to TRUE and retrieve the model from the rcFile as long as returnModel = TRUE (see here). newDefaultModel = metadata(se[[1]])$model # [[1]] will select the model trained on the first sample

alternatively train the model using an rcFile

rcFile <- readRDS(pathToRcFile) newDefaultModel <- trainBambu(rcFile)

use the trained model on another sample

sample2.bam and fa.file2 represent the aligned reads and genome for the poorly annotated sample

se <- bambu(reads = sample2.bam, annotations = NULL, genome = fa.file2, opt.discovery = list(defaultModels = newDefaultModel, fitReadClassModel = FALSE))

trainBambu Arguments

rcFile <- NULL, min.readCount = 2, nrounds = 50, NDR.threshold = 0.1, verbose = TRUE ```

|arguments|description| |---|---| |rcFile|A loaded rcFile sample (see Storing and using preprocessed files (rcFiles))| |min.readCount|The minimum read count threshold used for read classes during training| |nrounds|The number of stumps used in the xgboost tree| |NDR.threshold|The NDR threshold that will be used for the recommended NDR calibration when using this model. | |verbose|A logical if more information should be printed whilst the function is running|

Quantification of gene expression

To obtain gene expression, simply summing up over all annotated transcripts will likely underestimate it, as Bambu assigns only reads to transcripts if they are compatible. Reads which are incompatible with transcripts, but which can be assigned to the gene are tracked by Bambu to obtain more accurate gene expression estimate.

To obtain the accurate gene expression estimates which uses all reads that can be assigned to each gene (including reads that are incompatible with all existing annotations) you can run the following command: rscript seGene <- transcriptToGeneExpression(se) The output of this function is a SummarizedExperiment object, where

assays(seGene)$counts returns the estimated expression counts for each gene
assays(seGene)$CPM returns the estimated CPM for each gene
rowData(seGene) returns the gene information
rowRanges(seGene) returns the gene genomic ranges

Including single exons

By default bambu does not report single exon transcripts because they are known to have a high frequency of false positives and do not have splice junctions that are used by bambu to distinguish read classes. Nevertheless bambu trains a separate model on single-exon transcripts, and these predictions can be accessed and included in the annotations.

rscript se <- bambu(reads = sample1.bam, annotations = annotations, genome = fa.file, opt.discovery = list(min.txScore.singleExon = 0))

Fusion gene/isoform detection

To facilitate fusion gene/isoform detection, bambu has implemented a fusion mode. When it is set to TRUE, it will assign multiple GENEIDs to fusion transcripts, separated by ":".

To use this feature, it is recommended to detect the fusion gene breakpoints using fusion detection tools like JAFFAL first. Then fusion chromosome fasta file can be created by concatenating the two fusion gene sequences. Similarly, the fusion annotation gtf file can also be created with coordinates of the transcripts from the relevant genes changed to fusion chromosome coordinates. It is then required to do the re-alignment of reads originating from fusion region to the generated fusion chromosome fasta file. Then users can apply bambu on the re-aligned bam files with fusion chromosome fasta and gtf files.

rscript se <- bambu(reads = fusionAligned.bam, annotations = fusionAnnotations, genome = fusionFasta, fusionMode = TRUE)

Custom single-cell and spatial

If you want to run Bambu-Clump for single-cell or spatial analysis stand alone and not part of the Bambu-Pipe pipeline we recommend running it in 4 stages which we will describe seperately: Read Class Construction, Transcript Discovery, Read Class Assignment, and EM Quantification. Note that this section will only cover arguments that are different or unique to this analysis.

Read Class Construction:

reads: provided bam files should have barcodes in the read name or in the BC tag ( and UG tag for UMI identifiers). In the case where both tags and read names contain barcode information, tags will be used a prior. If not, a regular delimited headerless file that contain the demultiplexing information for each read should be provided to demultiplexed argument below. For exact requirements see https://github.com/GoekeLab/bambu-singlecell-spatial.

demultiplexed: should be either set to TRUE or the path to barcode mapping file. Otherwise, bambu will not look for barcodes and seperate reads by barcode rather than sample.

Optional:

cleanReads: A logical TRUE/FALSE. Chimeric reads in samples can cause issues with barcode assignments. Setting this to TRUE will ensure only the first alignment per barcode is used (We recommend using this).

sampleNames: A vector of characters assigning names to each sample in the reads argument. By default the sample names are taken from the file names and appended to the barcodes in order to differentiate them. If your sample names are the same across multiple files, but matching barcodes between the samples should be counted seperately, provide them with different sample names using this argument. Similiarly if your samples have different names, but overlapping barcodes should be counted together, give them the same sample name with this argument.

dedupUMI: A logical TRUE/FALSE.

barcodesToFilter: A string vector indicating barcodes to be filtered out.

rscript readClassFile <- bambu(reads = samples, annotations = annotations, genome = fa.file, ncore = 1, discovery = FALSE, quant = FALSE, demultiplexed = barcode_maps, verbose = TRUE, assignDist = FALSE, lowMemory = as.logical("$params.lowMemory"), yieldSize = 10000000, sampleNames = ids, cleanReads = as.logical($cleanReads), dedupUMI = as.logical($deduplicateUMIs))

Transcript Discovery:

Transript discovery can be run as usual as typically bulk-level discovery is suitable. However cluster-level transcript discovery can be preformed using the clusters argument which can be redone done after clustering.

rscript extendedAnno <- bambu(reads = readClassFile, annotations = annotations, genome = fa.file, ncore = 1, discovery = TRUE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, assignDist = FALSE)

Read Class Assignment:

This step was previously performed together with the quantification, but can be done seperately so that the arguments can be passed to the quantification seperately with different clustering. If you only want barcode level gene counts or unique transcript counts you can stop here and do not need to proceed to the EM quantification.

spatial: This should be a path to your barcode whitelist that also contains the x and y coordinates as extra columns. If provided, the file should contain 3 columns with or without header, where the first column is the barcode, and the second and third column contains the x and y coordinates information accordingly. Compressed file format is accepted as well.

rscript quantData <- bambu(reads = readClassFile, annotations = extendedAnno, genome = fa.file, ncore = 1, discovery = FALSE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, opt.em = list(degradationBias = FALSE), assignDist = TRUE, spatial = spatial)

EM quantification:

If you plan to run this step with multiple processes we recommend restarting your R instance to ensure that environmental variables do not inflate the memory usage.

reads: This argument is still mandatory but not needed when performing quantification alone as long as you provide the quantData argument

quantData: This is the summerized experiement output from the Read Class Assignment step

clusters: This is an optional argument which is either a path to a csv containing the barcode to cluster assignments or a CharacterList which can be produced using the code below.

opt.em = list(degradationBias=FALSE): We recommend including this argument if you are doing barcode level EM quantification to greatly improve runtime with only a small reduction in quantification accuracy.

```rscript

use Seurat to generate clusters from gene counts

library(Seurat)

clusterCells <- function(counts, resolution = 0.8, dim = 15){

cellMix <- CreateSeuratObject(counts = counts, project = "cellMix", min.cells = 1)#, min.features = 200) #cellMix <- subset(cellMix, subset = nFeatureRNA > nFeatureRNAthreshold & nFeatureRNA < nFeatureRNAthresholdmax) #nFeatureRNAthreshold <- 1000, nFeatureRNAthresholdmax = 9000, cellMix <- NormalizeData(cellMix, normalization.method = "LogNormalize", scale.factor = 10000) cellMix <- FindVariableFeatures(cellMix, selection.method = "vst", nfeatures = 2500) all.genes <- rownames(cellMix) cellMix <- ScaleData(cellMix, features = all.genes) npcs <- ifelse(ncol(counts)>50, 50, ncol(counts)-1) cellMix <- RunPCA(cellMix, features = VariableFeatures(object = cellMix), npcs = npcs) dim <- ifelse(dim >= dim(cellMix@reductions$pca)[2], dim(cellMix@reductions$pca)[2],dim) # if data dimension is small, otherwise, cap dimension at 15 cellMix <- FindNeighbors(cellMix, dims = 1:dim) cellMix <- FindClusters(cellMix, resolution = resolution) cellMix <- RunUMAP(cellMix, dims = 1:dim)

return(cellMix) }

quantData.gene <- transcriptToGeneExpression(quantData) counts <- assays(quantData.gene)$counts #selecting first sample cellMix <- clusterCells(counts) #resolution can be customized. For larger clusters: 0.2-0.6, for higher resolution: 0.8-2 x <- setNames(names(cellMix@active.ident), cellMix@active.ident) clusters_temp <- splitAsList(unname(x), paste)("cluster",names(x)))#make clusters names start with cluster, for better comprehension

se <- bambu( reads = NULL, annotations = rowRanges(quantDatas), genome = "$genome", quantData = quantDatas, assignDist = FALSE, ncore = $params.ncore, discovery = FALSE, quant = TRUE, demultiplexed = TRUE, verbose = FALSE, opt.em = list(degradationBias = FALSE), clusters = clusters_temp) ```

Bambu Arguments

|argument|description| |---|---| |reads|A string or a vector of strings specifying the paths of bam files for genomic alignments, or a BamFile object or a BamFileList object (from Rsamtools).| | rcOutDir | A string variable specifying the path to where read class files will be saved. | | annotations | A TxDb object, a path to a .gtf file, or a GRangesList object obtained by prepareAnnotations. | | genome | A fasta file or a BSGenome object. If a fa.gz is provided, the .fai and .gzi must also be present | | stranded | A boolean for strandedness, defaults to FALSE. | | ncore | specifying number of cores used when parallel processing is used, defaults to 1. | | NDR | specifying the maximum NDR rate to novel transcript output among detected transcripts, defaults to 0.1 | | yieldSize | see Rsamtools. | | opt.discovery | A list of controlling parameters for isoform reconstruction process:
prefix specifying prefix for new gene Ids (genePrefix.number), defaults to empty
remove.subsetTx indicating whether filter to remove read classes which are a subset of known transcripts, defaults to TRUE
min.readCount specifying minimun read count to consider a read class valid in a sample, defaults to 2
min.readFractionByGene specifying minimum relative read count per gene, highly expressed genes will have many high read count low relative abundance transcripts that can be filtered, defaults to 0.05
min.sampleNumber specifying minimum sample number with minimum read count, gene read proportion, and TPS, defaults to 1
min.exonDistance specifying minimum distance to known transcript to be considered valid as new, defaults to 35bp
min.exonOverlap specifying minimum number of bases shared with annotation to be assigned to the same gene id, defaults to 10bp
min.primarySecondaryDist specifying the minimum number of distance threshold between a read class and the annotations internal exons. Read classes with distances less than the threshold are not annotated as novel and counted with the annotations for quantification, defaults to 5bp
min.primarySecondaryDistStartEnd1 specifying the minimum number of distance threshold between a read class and the annotations start/end exons. Read classes with distances less than the threshold are not annotated as novel, defaults to 5bp
min.primarySecondaryDistStartEnd2 specifying the minimum number of distance threshold between a read class and the annotations start/end exons. Read classes with distances less than the threshold are counted with the annotations, defaults to 5bp
min.txScore.multiExon specifying the minimum transcript probility score threshold for multi-exon transcripts for min.sampleNumber, defaults to 0
min.txScore.singleExon specifying the minimum transcript probability score threshold for single-exon transcripts for min.sampleNumber
fitReadClassModel a boolean specifying if bambu should train a model on each sample. If set to false bambu will use the default model for ranking novel transcripts. defaults to TRUE
defaultModels a bambu trained model object that bambu will use when fitReadClassModel==FALSE or the data is not suitable for training, defaults to the pretrained model in the bambu package
returnModel a boolean specifying if bambu will output the model it trained on the data, defaults to FALSE
baselineFDR a value between 0-1. Bambu uses this FDR on the trained model to recommend an equivilent NDR threshold to be used for the sample. By default, a baseline FDR of 0.1 is used. This does not impact the analysis if an NDR is set.
min.readFractionByEqClass indicating the minimum relative read count of a subset transcript compared to all superset transcripts (ie the relative read count within the minimum equivalent class). This filter is applied on the set of annotations across all samples using the total read count, this is not a per-sample filter. Please use with caution. defaults to 0 | | opt.em | A list of controlling parameters for quantification algorithm estimation process:
maxiter specifying maximum number of run iterations, defaults to 10000
degradationBias correcting for degradation bias, defaults to TRUE
conv specifying the covergence threshold control, defaults to 0.0001
minvalue specifying the minvalue for convergence consideration, defaults to 0.00000001 | | trackReads | When TRUE read names will be tracked and output as metadata in the final output as readToTranscriptMaps detailing the assignment of reads to transcripts.The output is a list with an entry for each sample. | | returnDistTable | When TRUE the calculated distance table between read classes and annotations will be output as metadata as distTables. The output is a list with an entry for each sample. | | discovery | A logical variable indicating whether annotations are to be extended for quantification, defaults to TRUE. | | quant | A logical variable indicating whether quantification will be performed, defaults to TRUE. | | verbose | A logical variable indicating whether processing messages will be printed. | | mode | A string that will set other input arguments ['bulk', 'multiplexed', 'fusion', 'debug']
bulk -
    processByBam = TRUE
    processByChromsome = FALSE
multiplexed -
    demultiplex = TRUE
    cleanReads = TRUE
    opt.em = list(degradationBias = FALSE)
    quant = FALSE
    processByChromosome = TRUE
fusion -
    NDR = 1
    fusionMode = TRUE
debug -
    verbose = TRUE
    trackReads = TRUE
    returnDistTable = TRUE | | demultiplexed | A logical variable indicating whether the input bam file is demultiplexed. The barcode and umi either need to be present in the read name or the $BC and $UG tags, defaults to FALSE. Alternatively a path to a csv file can be provided where column 1 is read names, column 2 is barcodes, and column 3 is UMI. | | spatial | A path to the barcode whitelist containing X and Y coordinates, defaults to null. If provided, the file should contain 3 columns with or without header, where the first column is the barcode, and the second and third column contains the x and y coordinates information accordingly. Compressed file format is accepted as well.| | assignDist | A logical variable indicating whether read class to transcript assignment will be performed, defaults to TRUE. | | quantData | Advanced use only. A list of se outputs from the assignDist step. Used only to run quantification | | sampleNames | A vector of strings representing the sample name associated with each input bam. bam files with the same sample name will be combined | | cleanReads | A logical variable indicating whether only the first sequenced alignment in a read should be kept. This helps to remove chimeric reads, but will remove alignments from fusion genes, defaults to FALSE. | | dedupUMI | A logical variable indicating whether UMI deduplication is performed. The longest read per UMI will be used and the rest discarded, defaults to FALSE.| | barcodesToFilter | A vector of strings indicating the barcodes to be filtered out in reads.| | clusters | Either a list containing the barcodes for each cluster, or a path to a csv file containg the barcode to cluster mapping. When provided, clusters will be used during discovery and EM quant steps, defaults to null. | | processByBam | A logical variable indicating if each input bam file will be processed seperately (TRUE) or all are read in and processed together (FALSE), defaults to TRUE | | processByChromosome | A logical variable indicating if read classes will be constructed with all reads together (FALSE), or done by chromsome which uses less memory, but provides less information for the junction correction model (TRUE), defaults to FALSE |

setNDR() arguments

|argument|description| |---|---| |extendedAnnotations| A GRangesList object produced from bambu(quant = FALSE) or rowRanges(se) or loaded in from prepareAnnotations() of a Bambu dervived .gtf | |NDR| The maximum NDR for novel transcripts to be in extendedAnnotations (0-1). If not provided a recommended NDR is calculated. | |includeRef| A boolean which if TRUE will also filter out reference annotations based on their NDR. Note that reference annotations with no NDR (because they were not detected) are not filtered and will remain potentially impacting quantificaiton. Use with caution. Defaults to FALSE. | |prefix| A string which determines which transcript names are considered novel by bambu and will be filtered. Defaults to 'Bambu') | |baselineFDR| a value between 0-1. Bambu uses this FDR on the trained model to recommend an equivilent NDR threshold to be used for the sample. By default, a baseline FDR of 0.1 is used. This does not impact the analysis if an NDR is set. Defaults to NULL| |defaultModels| a bambu trained model object used to recommend an NDR threshold if no NDR is provided. Defaults to the pretrained model in the bambu package|

Output Description

Access annotations that are matched to the transcript expression estimates by rowRanges() rscript rowRanges(se) |column|description| |---|---| |seqnames|The scaffold name the transcript is found on| |ranges|An IRanges object containing the start and end coordinates of the transcript (not stranded)| |strand|The strand of the transcript (+, -, *)| |exonrank|The exon index of the exons in the transcript starting from the 5’ end of the transcript| |exonendRank|The exon index of the exons in the transcript starting from the 3’ end of the transcript|

Access transcript level data which is matched to transcript expression estimates using rowData() ```rscript mcols(rowRanges(se))

or

mcols(se)

or

rowData(se) ```

|column|description| |---|---| |TXNAME|The transcript name for the transcript. Will use either the transcript name from the provided annotations or tx.X if it is a novel transcript where X is a unique integer.| |GENEID|The gene name for the transcript. Will use either the gene name from the provided annotations or gene.X if it is a novel transcript where X is a unique integer.| |NDR|The NDR score calculated for the transcript| |novelGene|A logical variable that is true if transcript model is from a novel gene (does not overlap with an annotated gene loci)| |novelTranscript|A logical variable that is true if transcript model is novel (passing NDR threshold)| |txClassDescription|A concatenated string containing the classes the transcript falls under:
annotation - Transcript matches an annotation transcript
allNew - All the intron-junctions are novel
newFirstJunction - the first junction is novel and at least one other junction matches an annotated transcript
newLastJunction - the last junction is novel and at least one other junction matches an annotated transcript
newJunction - an internal junction is novel and at least one other internal junction matches an annotated transcript
newWithin - A novel transcript with matching junctions but is not a subset of an annotation
unsplicedNew - A single exon transcript that doesn’t completely overlap with annotations
compatible - Is a subset of an annotated transcript
newFirstExon - The first exon is novel
newLastExon - The last exon is novel| |readCount|The number of full length reads associated with this transcript (filtered by min.readCount)| |relReadCount|The proportion of reads this transcript has relative to all reads assigned to its gene| |relSubsetCount|The proportion of reads this transcript has relative to all reads that either fully or partially match this transcript| |txId|A bambu specific transcript id used for indexing purposes |eqClassById|A integer list with the transcript ids of all equivalent transcripts |maxTxScore|The maximum model score across samples from the sample-trained model. Used internally by Bambu to calculate NDR scores| |maxTxScore.noFit|The maximum model score across samples from the pretrained model. Used internally by Bambu to recommend NDR thresholds|

rscript metadata(se)$incompatibleCounts metadata(se)$warnings IncompatibleCounts - A table containing counts for incompatible reads that can be assigned to a gene but to none of the provided transcripts.
warnings - A list containing the warnings produced by each sample

rscript metadata(rowRanges(se))$NDRthreshold metadata(rowRanges(se))$subsetTranscripts metadata(rowRanges(se))$lowConfidenceTranscripts metadata(rowRanges(se))$warnings NDRthreshold - The NDR threshold currently appled to the novel transcripts. A number between 0 and 1

subsetTranscripts - A GrangesList containing subset transcripts when remove.subsetTx = TRUE. readCount and txScore can be accessed from mcols.

lowConfidenceTranscripts - A GrangesList containing novel transcripts above the NDR threshold

warnings - A list containing the warnings produced by each sample

Release History

bambu v3.8.2

Release date: 2025-02-06

Minor changes:

Fix large number of samples issue
Fix denovo bug issue

bambu v3.2.6

Release date: 2023-October-25

Minor changes:

Fix crash cause by de novo mode
Restore fusion mode functionality and added documentation
Fixed bug in plot function
Update release history

bambu v3.2.5

Release date: 2023-July-07

Minor changes:

Fix crash when extremely large datasets provided
Speed up read class construction
Add LongRead BiocView
Update release history

bambu v3.2.4

Release date: 2023-Apr-26

Minor changes:

Fixes crash during Low Memory Mode when there are scaffolds with no reads
Fixes crash on windows machines caused by DNAStringSet
Adds NDR metadata when running discovery mode with recommended NDR, so users do not need to look at console for the recommended NDR.
Re-enabled GitHub actions for new devel branch name and the windows check
Fixed a crash that occurs with large datasets resulting in large overflow tables during novel gene id assignment
Remove nested bplapply in EM
Remove unused eqClassById list column in the readClassDist object to reduce memory usage
Fixed a bug that caused identical unspliced reads to not be tracked when trackReads = TRUE

bambu version 3.0.0

Release date: 2022-10-25

Major changes: - Updated the input parameters of Bambu to simplify the user experience - Introduced NDR threshold recommendation - Implemented trainBambu(), allowing users to train and use models on their own data - Reads that cannot be assigned to any transcript are grouped as incompatible counts - Partial estimates are removed from output as it can be directly obtained based on total count estimates and full-length count estimates - The fusion mode is now available, which assigns read classes that align to multiple genes to a new combined fusion gene

Updated the input parameters of Bambu to simplify the user experience
Introduced NDR threshold recommendation
Implemented trainBambu(), allowing users to train and use models on their own data
Reads that cannot be assigned to any transcript are grouped as incompatible counts
Partial estimates are removed from output as it can be directly obtained based on total count estimates and full-length count estimates
The fusion mode is now available, which assigns read classes that align to multiple genes to a new combined fusion gene

Minor changes: - Novel transcripts and genes are now by default output with a Bambu prefix - Updated the documentation, messages and errors output by Bambu - Annotated transcripts (with unique exon-junctions) with at least 1 full-length read are assigned a NDR rank

Novel transcripts and genes are now by default output with a Bambu prefix
Updated the documentation, messages and errors output by Bambu
Annotated transcripts (with unique exon-junctions) with at least 1 full-length read are assigned a NDR rank

bambu version 1.99.0

Release date: 2021-10-18

Major Changes:

Implemented a machine learning model to estimate transcript-level novel discovery rate
Implemented full length estimates, partial length estimates and unique read counts in final output
Improved the performance when extending annotations with simplified code
Improved the performance when large amounts of annotations are missing.
Implemented a lowMemory option to reduce the memory requirements for very large samples (>100 million reads)

Minor fixes:

remove the use of get() which looks into environment variables (prone to crashes if a variable of the same name exists) and directly references the functions that should be used instead.
bug fix when a fa file is provdied as a string variable to non-windows system
bug fix when no single exon read class in provided samples
bug fix when no splice overlaps found between read class and annotations

bambu version 1.0.2

Release date: 2020-11-10

bug fix for author name display
bug fix for calling fasta file and bam file from ExperimentHub
update NEWS file

bambu version 1.0.0

Release date: 2020-11-06

bug fix for parallel computation to avoid bplapply

bambu version 0.99.4

Release date: 2020-08-18

remove codes using seqlevelStyle to allow customized annotation
update the requirement of R version and ExperimentHub version

bambu version 0.3.0

Release date: 2020-07-27

bambu now runs on windows with a fasta file
update to the documentation (vignette)
prepareAnnotations now works with TxDb or gtf file
minor bug fixes

bambu version 0.2.0

Release date: 2020-06-18

bambu version 0.1.0

Release date: 2020-05-29

Citation

Chen, Y., Sim, A., Wan, Y.K. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods (2023). https://doi.org/10.1038/s41592-023-01908-w

Contributors

This package is developed and maintained by Ying Chen, Andre Sim, Yuk Kei Wan, Keith Yeo, Min Hao Ling and Jonathan Goeke at the Genome Institute of Singapore. If you want to contribute, please leave an issue. Thank you.

Bambu

Owner

Name: Göke Lab
Login: GoekeLab
Kind: organization
Location: Genome Insitute of Singapore

Website: www.jglab.org
Repositories: 9
Profile: https://github.com/GoekeLab

Computational Transcriptomics - Third Generation Sequencing

GitHub Events

Total

Create event: 16
Commit comment event: 5
Issues event: 42
Watch event: 30
Delete event: 31
Issue comment event: 92
Push event: 82
Pull request review comment event: 13
Pull request event: 33
Pull request review event: 28
Fork event: 3

Last Year

Create event: 16
Commit comment event: 5
Issues event: 42
Watch event: 30
Delete event: 31
Issue comment event: 92
Push event: 82
Pull request review comment event: 13
Pull request event: 33
Pull request review event: 28
Fork event: 3

Committers

Last synced: 11 months ago

All Time

Total Commits: 1,876
Total Committers: 12
Avg Commits per committer: 156.333
Development Distribution Score (DDS): 0.491

Past Year

Commits: 18
Committers: 3
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.389

Top Committers

Name	Email	Commits
Chen Ying	c**g@g**g	954
Andre	a**m@g**g	447
jonathangoeke	1****e	241
Yuk Kei Wan	4****a	142
minhao	l**1@g**m	41
keithyjy	k**y@g**m	21
J Wokaty	j**y@s**u	10
rstudio	r**o@i**l	9
Nitesh Turaga	n**a@g**m	8
Mike Love	m****e	1
yukkei	y**i@y**l	1
rstudio	r**o@i**l	1

Committer Domains (Top 20 + Academic)

gis.a-star.edu.sg: 2 ip-10-0-1-213.ap-southeast-1.compute.internal: 1 ip-10-0-1-203.ap-southeast-1.compute.internal: 1 sph.cuny.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 38
Total pull requests: 33
Average time to close issues: 3 months
Average time to close pull requests: 28 days
Total issue authors: 30
Total pull request authors: 4
Average comments per issue: 2.82
Average comments per pull request: 0.0
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 35
Pull requests: 33
Average time to close issues: about 1 month
Average time to close pull requests: 28 days
Issue authors: 29
Pull request authors: 4
Average comments per issue: 2.0
Average comments per pull request: 0.0
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

apsteinberg (6)
sparthib (4)
NikoLichi (2)
alexyfyf (2)
abcdtree (2)
tamuanand (2)
xhuashen (2)
nick-youngblut (2)
mikelove (2)
baibhav-bioinfo (2)
Tang-pro (2)
kneubehl (1)
ljwharbers (1)
ShaowenJ (1)
olazaro-ibri (1)

Pull Request Authors

SuiYue-2308 (22)
cying111 (7)
lingminhao (4)
andredsim (4)
dudududu12138 (2)
apsteinberg (1)

Top Labels

Issue Labels

documentation (1) enhancement (1)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- bioconductor 26,634 total

Total dependent packages: 1
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 19
Total maintainers: 1

proxy.golang.org: github.com/GoekeLab/bambu

Documentation: https://pkg.go.dev/github.com/GoekeLab/bambu#section-documentation
License: gpl-3.0
Latest release: v3.2.4+incompatible
published over 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

proxy.golang.org: github.com/goekelab/bambu

Documentation: https://pkg.go.dev/github.com/goekelab/bambu#section-documentation
License: gpl-3.0
Latest release: v3.2.4+incompatible
published over 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

bioconductor.org: bambu

Context-Aware Transcript Quantification from Long Read RNA-Seq data

Homepage: https://github.com/GoekeLab/bambu
Documentation: https://bioconductor.org/packages/release/bioc/vignettes/bambu/inst/doc/bambu.pdf
License: GPL-3 + file LICENSE
Latest release: 3.10.1
published 8 months ago

Versions: 9
Dependent Packages: 1
Dependent Repositories: 0
Downloads: 26,634 Total

Rankings

Dependent repos count: 0.0%

Dependent packages count: 0.0%

Average: 20.4%

Downloads: 61.2%

Maintainers (1)

chen_ying@gis.a-star.edu.sg

Last synced: 6 months ago

bambu

Science Score: 49.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

bambu: Context-Aware Transcript Quantification from Long Read RNA-Seq data

Content

Installation

General Usage

Transcript discovery only (no quantification)

Using precalculated annotation objects

Running multiple samples

Modulating the sensitivity of discovery (pre and post analysis)

after the discovery step

after a complete run

Output

Visualization

Single-Cell-and-Spatial

Bambu Advanced Options

Using a pretrained model

De-novo transcript discovery

Storing and using preprocessed files (rcFiles)

running bambu using the first file

Tracking read-to-transcript assignment

Training a model on another species/dataset and applying it

first train the model using a related annotated dataset from .bam

alternatively train the model using an rcFile

use the trained model on another sample

sample2.bam and fa.file2 represent the aligned reads and genome for the poorly annotated sample

trainBambu Arguments

Quantification of gene expression

Including single exons

Fusion gene/isoform detection

Custom single-cell and spatial

Read Class Construction:

Transcript Discovery:

Read Class Assignment:

EM quantification:

use Seurat to generate clusters from gene counts

Bambu Arguments

setNDR() arguments

Output Description

or

or

Release History

Citation

Contributors

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/GoekeLab/bambu

Rankings

proxy.golang.org: github.com/goekelab/bambu

Rankings

bioconductor.org: bambu

Rankings

Maintainers (1)

Dependencies