splicemutr

https://github.com/fertiglab/splicemutr

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: FertigLab
License: gpl-3.0
Language: R
Default Branch: main
Size: 212 MB

Statistics

Stars: 6
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

SpliceMutr: Calculating splicing-derived neoantigen burden from splice-junctions affecting protein-coding transcripts in RNA-seq data.

SpliceMutr is a tool used for the analysis of splicing-derived antigen burden. SpliceMutr uses differential intron usage analysis through LeafCutter or LeafCutterMD to first determine differential usage of splice-junctions between sets of RNA-seq samples. LeafCutter uses a Dirichlet-Multinomial generalized linear model to fit splice-junction counts within clusters of splice-junctions grouped due to overlapping splice site usage between pairs of sample types (tumor and normal or pre and post treatment in the case of the original SpliceMutr paper). Two separate Dirichlet-Multinomial models are fit to the splice-junction counts per splice-junction cluster based on the sample type stratification and a likelihood ratio test is used to evaluate differential usage within the splice-junction cluster as a whole. Each individual differentially-used splice junction per cluster is then used to modify the reference transcriptome. The modified transcripts are searched for canonical and/or modified open reading frames, translated based on the best identified open reading frame, and kmerized into 9-mers. During kmerization, a map between the peptide kmer and modified protein of interest is generated for searching after MHC-peptide binding prediction. MHC-peptide binding prediction is carried out using the arcasHLA generated HLA genotype per sample analyzed, the set of generated peptide 9-mers, and MHCnuggets or NetMHCpan. Peptide 9-mers found to bind to any of the HLA molecules per sample are then mapped back to their transcript of origin using the kmer-transcript map and peptide binding summaries are generated that contain the number of unique binding kmers per transcript and per sample. If differential intron usage analysis is performed, the number of unique and sample-type specific kmers are generated by filtering out all kmers shared between sample types within splice-junction clusters. The number of these unique peptide kmers per splice-junction-modified transcript is used to generate a splicing antigenicity score per sample and per gene that is higher if the splice-junctions associated with a gene produce more peptide kmers able to bind to an MHC complex and have higher RNA-seq expression. This score can then be associated to other quantitative measures.

Installation

Requirements

STAR\ arcasHLA\ MHCnuggets or NetMHCpan\ LeafCutter\ R >= 4.0.2\ python >= 3.6.10

To build the SpliceMutr environment, install miniconda, create a conda environment, and build the conda package contained within splicemutrpackages.yml. You will also need to build the conda environment contained within leafcutterpackage.yml in order to run the LeafCutter components of the SpliceMutr pipeline. Additionally, you will need to install the most recent version of BSgenome. You will need to start R in order to do this then install BiocManager, then use it to install BSgenome. Finally, you will need to install optparse in R.

``` bash conda env create -f leafcutterpackage.yml conda env create -r splicemutrpackages.yml conda activate splicemutr R if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BSgenome") install.packages("optparse")

```

SpliceMutr is licensed under the GNU General Public License v3

If you find SpliceMutr useful, please cite as follows:

Palmer, Theron, et al. "SpliceMutr Enables Pan-Cancer Analysis of Splicing-Derived Neoantigen Burden in Tumors." bioRxiv, 2023, https://doi.org/10.1101/2023.05.26.542165.

To learn how to use SpliceMutr and to test SpliceMutr usage, please refer to the simulation directory where a working example including snakemakes for all critical components of the pipeline has been formed.

Running the splicemutr simulation

splicemutr_simulation.Rmd runs the polyester simulations generating the fasta files that will be processed. To run this Rmd, you will need to download a gencode transcripts fasta file

Within the runningSTAR folder there are two files, runSTAR.sh and buildSTARindexhuman.sh. You will need to download the appropriate annotation and genome fasta files, but the buildSTARindexhuman.sh SLURM script, once customized for your specific annotation and genome fasta file, will generate the STAR index necessary to run STAR. The run_STAR.sh SLURM script will run STAR on your fasta files. You will need to install an anaconda version of STAR to run this script properly. That or ensure STAR is installed on the machine you will be running this pipeline on.

Within the prepreferences folder, you will find config, and .smk files. The .smk file is a snakemake file that will prepare the refereneces that will be used by SpliceMutr and LeafCutter for analysis. You will need to activate the splicemutrpackages.yml created conda envinronment, splicemutr, in order to run this portion of the SpliceMutr pipeline.

You will need to activate the leafcutterpacakge.yml conda environment within the envs folder in order to run this portion of the SpliceMutr pipeline. Within the runningleafcutter folder, you will find a config, a groups_file, and .smk file. Once the config file is modified to fit your personal specification, the snakemake file can be run which will modify and then run LeafCutter on your SJ.out.tab files output from the STAR alignment step.

You will need to activate the splicemutrpackages.yml created conda envinronment, splicemutr, in order to run this portion of the SpliceMutr pipeline. Within the runningsplicemutr folder, you will find a config file and .smk file. This snakemake will run the SpliceMutr pipeline.

The genotypingsamples directory is not necessary to run this SpliceMutr simulation. It is necessary if you are choosing to run your own samples though and specifically want to use arcasHLA to genotype your samples. The config and genotypesamples.smk file have been included within this directory for your reference.

General Script Usage

1) This step takes in a GTF file and creates the sqlite database necessary for transcript formation in formtranscripts.R. This is covered under rule maketxdb in the snakemake file within this directory.

bash make_txdb.R -o output_file.sqlite -g GTF

output_file.sqlite - the sqlite file to write\ GTF - the GTF file to convert into the txdb database (sqlite)\

2) Prepare the BSgenome references used for transcript formation in formtranscripts.R. This is covered first by rule prepbsgenome_reference in the snakemake file within this directory which covers converting the BSgenome fasta file to a 2bit file.Building the BSgenome package is done using the following directions: https://www.bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

3) Prepare the LeafCutter references. LeafCutter reference preparation is covered in the rule prepareleafcutterreferences in the snakemake file within this directory. The directions outlined at LeafCutter are what are followed to create the LeafCutter references.

4) Align your bulk RNA-seq reads to the appropriate genome of choice using STAR. Use the directions outlined at LeafCutter for building your STAR index as well as alignment of your reads to the STAR index you built. For the SpliceMutr paper analysis, we used prebuilt indeces that are no longer accessible at the associated LeafCutter link in the directions.

5) This step takes the STAR sj.out.tab files output from STAR alignment and formats them into the .junc file format that LeafCutter accepts. The rule convertSTARsj.out.tabtoleafcutter_.junc in the snakemake within this directory covers this protocol.

bash STAR_to_leaf.R -o output_directory -s star_file -f splicemutr_functions

outputdirectory - the directory to write the .junc file to\ starfile - the sj.out.tab file to convert into the .junc file\ splicemutr_functions - the splicemutr functions.R file to load into the script\

6) Run LeafCutter on the .junc files generated for your experiment. You will need to format the groups and junctions files for your experiments. The groups file is a tab-delimited file with junc file name in one column and binary experimental grouping in the second column. The first experimental group in the groups file must be your reference group for the analysis. Further details about running LeafCutter are explained at LeafCutter and are covered in the rule runningleafcutter in the snakemake within this directory. For this step you will need to use splicemutrleafcutterclusterregtools.py instead of leafcutterclusterregtools.py. This replacement script removes the dependence of the script on the .bed format blocksizes and blockcount sections since STAR sj.out.tab files do not rely upon this bed format for specifying splice junctions. Therefore, the conversion to .junc files does not account for blocksizes or blockcounts in the start and end coordinates for the splice junction.

7) This step takes the LeafCutter LeafViz output and saves the introns in a format suitable for input into SpliceMutr formattranscripts.R. This is covered in rule saveintrons in the associated snakemake file within this directory.

bash save_introns.R -i intron_file.rds -o out_prefix

intronfile.rds - the intron file output from LeafCutter LeafViz differential intron usage analysis outprefix - the outprefix for the introns .rds files output from save_introns.R

7.1) I often use splitintrons.R for analysis. This script takes similar inputs as saveintrons.R, but it also takes a splitnum argument for splitting the introns data.rds file output from LeafCutter LeafViz into splitnum chunked intron.rds files.

bash split_introns.R -i intron_file.rds -o out_prefix -s split_num

intronfile.rds - the intron file output from LeafCutter LeafViz differential intron usage analysis outprefix - the out prefix for the introns .rds files output from splitintrons.R splitnum - the number of introns to chunk the LeafCutter Leafvix .rds file into

8) This step modifies transcripts using the identified differentially-used introns from LeafCutter. The rule form_transcripts in the snakemake within this directory covers usage.

bash form_transcripts.R -o out_prefix -t txdb_file -j junc_file -b bsgenome_name -m chr_map -f funcs

outprefix - the output prefix for the\ txdbfile - the sqlite database created in the rule maketxdb and created from maketxdb.R\ juncfile - the introns.rds file created from the rule saveintrons (saveintrons.R) or one of the files from splitintrons.R if batching your job\ bsgenomename - the name of the bsgenome that you created from the reference fasta file.\ chrmap - logical indicating whether you want to replace "chr" in the junc_file (introns.rds) with "". This is sometimes necessary if your gtf or fasta references do not match regarding this chromosome naming convention.\ funcs - the path to the SpliceMutr functions.R file\

9) This step calcualtes the LGC coding potential of the transcripts generated from formtranscripts. The rule calculatecoding_potential in the snakemake within this directory covers usage.

bash calc_coding_potential.R -s splicemutr_data -t transcript_fasta -o out_dir -f funcs

splicemutrdata - the *_splicemutr.rds file containing the splicemutr metadata output from formtranscripts.R\ transcriptfasta - the fasta output associated with the splicemutr metadata ouput from formtranscripts.R\ outdir - the output directory to write the coding-potential-corrected SpliceMutr metadata ouptut from formtranscripts.R\ funcs - the path to the SpliceMutr functions.R file\

10) This step combines all batched (if batched) SpliceMutr metadata files and generates a file containing the set of proteins per modified transcript and a file containing the splicemutr metadata subset by those modified transripts that can form proteins. The rule combine_splicemutr within the snakemake within this directory covers usage.

bash combine_splicemutr.R -o out_dir -s splice_files

outdir - The output directory that the combined SpliceMutr files are written to\ splicefiles - The splicemutr files generated by\ form_transcripts.R in a .txt file where each line is a file, including its directory\

11) This step processes the proteins.txt file from the previous step and generates a kmer-transcript location (row within the peptides.txt file) map. It also generates a pepskmerlength.txt file that contains the unique set of k-mers from the kmer-transcript map. The rule process_peptides gives further details about usage.

bash process_peptides.py -p peptides.txt -o output_directory -k kmer_length

peptides.txt - the peptides.txt file output from combinesplicemutr.R\ outputdirectory - the directory to write the kmer-transcript map dictionary and the peps9.txt file\ kmerlength - the kmer length to use for the kmerization of the peptides in petides.txt

12) This step runs arcasHLA on the STAR generated .bam files. arcasHLA performs HLA genotyping on bulk RNA-seq data. The output of arcasHLA will need to be formatted using the genotype.txt format. The rule, run_arcasHLA in the snakemake within this directory covers running arcasHLA. The genotype.txt format is a tab delimited file and follows the following format:

HLA-A01-01,HLA-B02-01,HLA-C03-01 sample_1 HLA-A01-01,HLA-B02-01,HLA-C03-01 sample_2 HLA-A01-01,HLA-B02-01,HLA-C03-01 sample_3

13) This step runs MHCnuggets on the HLA genotypes, or superalleles associated with the HLA genotypes, output from arcasHLA and formatted as a file with one allele per line using the genotypes.txt HLA format and the pepskmerlength.txt file generated from combine_splicemutr.R.

bash runMHCnuggets.py -t type -k kmers -m mhc -o output

type - the HLA allele type "I" or "II" \ kmers - the pepskmerlength.txt file containing the set of unique peptide kmers \ mhc - the mhc allele file with one allele per line output - the output directory to write the mhcnuggets results to

14) This step processes the binding affinity prediction results output from MHCnuggets and filters the kmers for those that have predicted IC50 scores <=500 nM. The rule process_bindaffinity details usage for this step.

bash process_bindaff.py -b binders -p pickle_dir -o out -k kmer_length

binders - a MHCnuggets .txt output file \ pickledir - the directory that the pickle file is written to \ out - the output directory to write the filtered binders file to \ kmerlength - the kmer_length to process

15) This step extracts the binding location data from the pickle and from the associated allele kmers and generates a summary file per allele that contains location and binding kmer information for the specific allele. The rule extract_data in the snakemake within this directory details usage for this step.

bash extract_data.py -a hla_allele -p pickle_directory -b kmer_beginning_range -e kmer_end_range

hlaallele - the hla allele in genotypes.txt hla format to extract binding kmers for\ pickledirectory - the location of the kmer-transcript location map file\ kmerbeginningrange - the beginning kmer size to analyse when generating the HLA summary file. Typically set to 9\ kmerendrange - the end kmer size to analyse when generating the HLA summary file. Only the number directly before this kmer size is the end size analyzed. Typically set to 10.

16) This step generates a kmer counts file per sample analyzed that is later compiled into a counts per sample file used during splicing antigenicity calculations. The rule analyze_splicemutr within the snakemake within this directory details usage for this step.

bash analyze_splicemutr.py -g genotypes_file -s summary_dir -d splice_dat -o output_dir -t summary_type -n sample_num -f fasta_file

genotypesfile - the HLA genotypes per sample in the genotypes.txt file format\ summarydir - the directory containing the summary files for each HLA in the genotypes.txt file\ splicedat - the datasplicemutrallpep.txt file output from combinesplicemutr.R\ outputdir - the output directory to write the associated sample kmer counts file to\ summarytype - 'IC50' or 'perc'. IC50 reads in a summary file with the format *_txdictsummary.txt while 'perc' reads in a summary file with the format *_txdictsummaryperc.txt. samplenum - an integer indicating the row number of/sample to process in the genotype.txt file\ fastafile - the protein-coding translated transcript protein sequences fasta\

17) This step takes the kmer files generated by analyzesplicemutr.py run batched for all samples analyzed and compiles them into a single file where each column is a sample and each row corresponds to a row in the splicemutr *allpep.rds metadata file output from combinesplicemutr.R. The rule compilekmer_counts in the snakemake within this directory details usage for this step.

bash compile_kmer_counts.py -k kmers_files -o output_dir

kmersfiles - a file containing the kmer counts files per sample, one each line with their directories\ outputdir - the output directory to write the allkmerscounts.txt file to.\

18) This step uses the .junc files generated using STARtoleaf.R and creates a variance-stabilized junction expression matrix with junctions as rows and samples as columns. The rule createjunctionexpression in the snakemake within this directory details usage for this step.

bash create_junc_expression.R -j junc_dir -f junc_files -o output_directory

juncdir - the junctions file directory\ juncfiles - the junction files and their directories, one per line\ output_directory - the output directory to write the junction expression to\

19) This step calculates the splicing antigenicity per gene and sample. This script also generates the average coding potential per gene and sample. The rule calculategenemetric in the snakemake within this directory details usage for this step.

bash calculate_gene_metric_len_norm.R -s splice_dat_file -k kmer_counts -j junc_expr_file -o output_directory

splicedatfile - the *allpep.rds SpliceMutr file output from combinesplicemutr.R\ kmercounts - the allkmerscounts.txt file output from compilekmercounts.py\ juncexprfile - the file containing the combined junction expression file output from createjuncexpression\ outputdirectory - the output directory to write the SpliceMutr splicing antigenicity calculation files to\

Owner

Name: FertigLab
Login: FertigLab
Kind: organization
Email: ejfertig@jhmi.edu

Repositories: 68
Profile: https://github.com/FertigLab

Software projects in computational biology and bioinformatics in Elana Fertig's lab in Oncology Biostatistics and Bioinformatics at JHMI

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: splicemute
message: '"If you use this software, please cite it as below'
type: software
authors:
  - given-names: Theron
    family-names: Palmer
    email: theropalm@gmail.com
  - given-names: Elana
    family-names: Fertig
    email: ejfertig@jhmi.edu
date-released: 2021-05-27
url: "https://github.com/theron-palmer/splicemute.git"

GitHub Events

Total

Watch event: 5

Last Year

Watch event: 5

Dependencies

DESCRIPTION cran

AnnotationDbi * imports
BiocGenerics * imports
Biostrings * imports
DESeq2 * imports
DataCombine * imports
GenomicRanges * imports
IRanges * imports
S4Vectors * imports
TCGAutils * imports
annotate * imports
biomaRt * imports
dplyr * imports
ensembldb * imports
maftools * imports
optparse * imports
org.Hs.eg.db * imports
recount3 * imports
rjson * imports
rlist * imports
rtracklayer * imports
stringi * imports
stringr * imports

analysis/TCGA/pipeline/reference/DESCRIPTION cran

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science