https://github.com/bartongroup/rosa

Antisense analysis scripts

https://github.com/bartongroup/rosa

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Antisense analysis scripts

Basic Info
  • Host: GitHub
  • Owner: bartongroup
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Size: 113 MB
Statistics
  • Stars: 2
  • Watchers: 7
  • Forks: 1
  • Open Issues: 11
  • Releases: 0
Created almost 9 years ago · Last pushed almost 7 years ago

https://github.com/bartongroup/RoSA/blob/master/

# RoSA: a tool for the Removal of Spurious Antisense
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2661378.svg)](https://doi.org/10.5281/zenodo.2661378)

In stranded RNA-Seq experiments we have the opportunity to detect and measure antisense transcription, important since antisense transcripts impact gene transcription in several different ways. Stranded RNA-Seq determines the strand from which an RNA fragment originates, and so can be used to identify where antisense transcription may be implicated in gene regulation. 

However, spurious antisense reads are often present in experiments, and can manifest at levels greater than 1% of sense transcript levels. This is enough to disrupt analyses by causing false antisense counts to dominate the set of genes with high antisense transcription levels.   

The RoSA (Removal of Spurious Antisense) tool detects the presence of high levels of spurious antisense transcripts, by:
* analysing ERCC spike-in data to find the ratio of antisense:sense transcripts in the spike-ins; or
* using antisense and sense counts around splice sites to provide a set of gene-specific estimates; or
* both.

Once RoSA has an estimate of the spurious antisense, expressed as a ratio of antisense:sense counts, RoSA will calculate a correction to the antisense counts based on the ratio. Where a gene-specific estimate is available for a gene, it will be used in preference to the global estimate obtained from either spike-ins or spliced reads.

## Pre-requisites to running RoSA
### Data
RoSA depends on having a stranded RNA-Seq dataset. Ideally the data will also have spikeins, but these are not essential as RoSA can also estimate the spurious antisense in your data from looking at spliced reads at known splice junctions.

### Annotation
The data should have already been aligned to an annotation. RoSA will need the gff or gtf file for the annotation, in order to create an antisense annotation which can then be used to generate antisense counts, and to determine where known splice junctions occur.

### Dependencies

RoSA (currently) is a combination of an R package and some python scripts for preprocessing. Included in this repository is a conda environment that captures all the dependancies required for the package. To create the environment, and load it prior to installation of the pip and R rosa packages:

```
conda env create -f conda_env.yml
source activate RoSA
export TAR=`which tar`
```

(Note: setting the environment variable `tar` here is required to avoid [this issue](https://github.com/r-lib/devtools/issues/1722)).

#### R Dependencies

The R package depends on one third-party package, LSD, which you may need to install first, 
if it is not present already:

```
install.packages("LSD")
```

#### Python Dependencies

The python scripts are targetted for [python 2.7](https://www.python.org/download/releases/2.7/) and depend on:
- scipy (version 0.16.1 - 0.17.1 - see known issues)
- numpy
- pandas (not 0.20.1 - see known issues)
- six
- (optionally) drmaa

The python script to find and count spliced antisense and sense reads also depends on [sambamba](http://lomereiter.github.io/sambamba/) being installed.

## Installation

### Installing the RoSA python Package

RoSA's python scripts are set up in a python package which can be installed via:

```
pip install git+https://github.com/bartongroup/RoSA.git#subdirectory=python
```

and uninstalled via:

```
pip uninstall rosa
```

The pip installer will handle installing any additional python dependencies with the correct versions.

### Installing the RoSA R Package

To install the R rosa package directly from github use `devtools::install_github`. You will need to install devtools if you have not already done so. (Note: the `options(unzip = "internal")` call here is only required is you are using the provided conda env, see [this issue](https://github.com/r-lib/devtools/issues/1722)).

```
library(devtools)
options(unzip = "internal")
devtools::install_github("bartongroup/RoSA")
```
Load the package to get access to RoSA:
```
library("rosa")
```

## How to use RoSA

The R package takes as input datasets containing several different read counts:

1. Full read counts by gene
2. Antisense counts by gene (via RoSA's python script (*makeannotation.py*) to create an antisense annotation, and then read counting as usual)
3. At least one of:
     1. Spike-in sense and antisense counts (by aligning the reads data to the spike-ins, and generating read counts for each strand)
     2. Spliced sense and antisense counts (via RoSA's python script (*antisense.py*) to filter spliced reads which occur at known splice junctions)

Help for the R rosa functionality can be found by typing `help(rosa)` after installing and loading the RoSA R package. Since some of RoSA's inputs are non-standard, the python preprocessing scripts are supplied to make it easier to generate the inputs required by the package (specifically, input (2) and input (3(i))).

The *make_annotation* script creates an antisense annotation (as gtf) from a standard annotation (as gff or gtf), which can then be used to generate antisense read counts (input 2) via your favourite read counting tool (e.g. [featureCounts](http://subread.sourceforge.net)):
```
make_annotation -a  -o 
```
The annotation produced by *make_annotation* only contains antisense features and so cannot be used in place of a standard annotation. The *make_annotation* script can also be called from R:
```
make_annotation(, )
```

The *count_spliced* script generates sense and antisense counts of reads at splice junctions (input 3(i)). The script takes a standard annotation (as gtf/gff) and corresponding alignment (as bam) and outputs counts of spliced sense and antisense reads to a designated output file. An index file (.bai file) should also have been pre-generated and be in the same directory as the bam file. Because the script must process an entire bam file of reads, it is very slow. The script is set up to break the bam file into chunks and process each chunk separately using sambamba and some custom filtering. On a cluster with drmaa installed, the script will use drmaa to submit each chunk as a separate job. On a single machine, the script will spawn a new process to run each chunk separately. Once all of the jobs have run, the script collates the results to give a count of the spliced reads. The script is run as follows:
```
count_spliced -a  -l  -o 
```
An additional -i option allows the user to input a file containing the extents of all introns, which is otherwise calculated as part of splice counting process. This is primarily useful for testing on the same annotation multiple times, as the intron file can be calculated once and re-used, saving some time. The location of the generated intron file is output by the script.

The *count_spliced* script can also be called from R:
```
count_spliced(,,)
```

Both scripts output timestamped logfiles recording progress.

## Known issues

* RoSA is incompatible with pandas version 0.20.1, due to [bug #16921](https://github.com/pandas-dev/pandas/issues/16921)
* "command not found" errors when calling the preprocessing scripts from RStudio can be caused on Macs by a bug where RStudio uses an incorrect PATH variable. Running RStudio from the command line avoids the problem: 
`open /Applications/RStudio.app`

## Contact information

The `RoSA` R and python tool was developed within [The Barton Group](http://www.compbio.dundee.ac.uk) at [The University of Dundee](http://www.dundee.ac.uk)
by Dr. Kira Mouro and Dr. Nick Schurch.

To **report problems** or ask for **assistance**, please raise a new issue [on the project's support forum](https://github.com/bartongroup/RoSA/issues).
Providing a *reproducible working example* that demonstrates your issue is strongly encouraged.  Please also browse/search
the support forum before posting a new issue, in case your question is already answered there.

Owner

  • Name: Geoff Barton's Computational Biology Group
  • Login: bartongroup
  • Kind: organization
  • Location: Dundee, Scotland, UK

GitHub Events

Total
Last Year

Dependencies

DESCRIPTION cran
  • LSD * imports
  • stats * imports
  • utils * imports
  • boot * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • testthat * suggests
python/setup.py pypi
  • drmaa *
  • numpy *
  • pandas >=0.18.0,<=0.19.2
  • scipy *
  • six *