https://github.com/boyle-lab/uncalled

Raw nanopore signal mapper that enables real-time targeted sequencing

https://github.com/boyle-lab/uncalled

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Raw nanopore signal mapper that enables real-time targeted sequencing

Basic Info
  • Host: GitHub
  • Owner: Boyle-Lab
  • License: mit
  • Default Branch: master
  • Homepage:
  • Size: 5.02 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of skovaka/UNCALLED
Created over 6 years ago · Last pushed over 6 years ago

https://github.com/Boyle-Lab/UNCALLED/blob/master/

# UNCALLED

A **U**tility for **N**anopore **C**urrent **Al**ignment to **L**arge **E**xpanses of **D**NA

![UNCALLED logo](uncalled_logo_small.png "UNCALLED logo")

A streaming algorithm for mapping raw nanopore signal to DNA references

Enables real-time enrichment or depletion on Oxford Nanopore Technologies (ONT) MinION runs via ReadUntil

Also supports standalone signal mapping of fast5 reads

Read the [preprint on BioRxiv](https://www.biorxiv.org/content/10.1101/2020.02.03.931923v1)

## Installation

```
> git clone --recursive https://github.com/skovaka/UNCALLED.git
> cd UNCALLED
> python setup.py install
```

UNCALLED for ReadUntil sequenecing requires the [ONT ReadUntil API](https://github.com/nanoporetech/read_until_api), which is provided as a submodule but must be installed separately into the MinKNOW environment. UNCALLED must also be installed into the MinKNOW environment alongside the ReadUntil API for ReadUntil sequencing (see Real-Time ReadUntil below).

Most dependecies included via submodules, so be sure to clone with `git --recursive`

UNCALLED must be installed into a python environment. To install without root privileges use the `--user` or `--prefix=` flag when installing, or use a tool such as [virtualenv](https://virtualenv.pypa.io) or [anaconda](https://www.anaconda.com).

Requires python-dev, numpy, and GCC >= 4.8.1

[HDF5](https://www.hdfgroup.org/downloads/hdf5/) must be installed. Libraries and headers should be in system paths (ie `$LD_LIBRARY_PATH` and `$CPATH` respectively), or specified by running `python setup.py build_ext --library-dirs /lib --include-dirs /include` prior to installation.


We recommend running on a Linux machine. UNCALLED has been successfully installed and run on Mac computers, but real-time ReadUntil has not been tested on a Mac. Installing UNCALLED has not been attempted on Windows.

## Indexing

**Example:**

```
> bwa index -p E.coli E.coli.fasta
> uncalled index -i E.coli.fasta -x E.coli
```

UNCALLED requires a [BWA](https://github.com/lh3/bwa) index. You can use a previously built BWA index, or build a new one with the BWA instance provided in the `bwa/` submodule.

Before aligning, certain reference-specific parameters must be computed using `uncalled index`. The `` should be the same FASTA file which was used to build the BWA index. This will create an additional file in the same directory as the BWA index named `.uncl`.

## Fast5 Mapping

**Example:**

```
> uncalled map -t 16 -x E.coli -i fast5_list.txt > uncalled_out.paf
Loading fast5s
Mapping

> head -n 4 uncalled_out.paf
b84a48f0-9e86-47ef-9d20-38a0bded478e 3735 77 328 + Escherichia_coli_chromosome 4765434 2024611 2024838 66 228 255  ch:i:427 st:i:50085  mt:f:53.662560
77fe7f8c-32d6-4789-9d62-41ff482cf890 5500 94 130 + Escherichia_coli_chromosome 4765434 2333754 2333792 38 39  255  ch:i:131 st:i:238518 mt:f:19.497091
eee4b762-25dd-4d4a-8a59-be47065029be 2905     *      *      *      *      *      *      *      *      *       255  ch:i:44  st:i:302369 mt:f:542.985229
e175c87b-a426-4a3f-8dc1-8e7ab5fdd30d 8052 84 154 + Escherichia_coli_chromosome 4765434 1064550 1064614 41 65  255  ch:i:182 st:i:452368 mt:f:38.611683
```

Arguments:

- `-x/--bwa-prefix` the prefix of the index to align to. Should be a BWA index that `uncalled index` was run on
- `-i/--fast5-files`  a text file containing the path to one fast5 file per line
- `-t/--threads` number of threads to use for mapping (default: 1)
- `-n/--read-count` maximum number of reads to map
- `-f/--filter` text file containing subset of read IDs (one per line) to map from the fast5 files (will map all by default)
- `-e/--max-events-proc` number of events to attempt mapping before giving up on a read (default 30,000). Note that there are approximately two events per nucleotide on average.


See [example/](example/) for a simple read and reference example.

## Real-Time ReadUntil

**Example:**

```
> uncalled list-ports
MN02686 (2019-11-18 12:30:56): 8000

> /opt/ont/minknow/ont-python/bin/uncalled realtime --port 8000 -t 16 -x E.coli --enrich -c 3 --post-script basecall.sh > uncalled_out.paf 
Starting client
Starting mappers
Mapping
Running post-script: 'basecall.sh'

> head -n 4 uncalled_out.paf
81ba344d-60df-4688-b37f-9064e76a3eb8 1352 *     *     *     *      *      *      *      *      *   255 ch:i:68  st:i:29101 mt:f:375.93841 wt:f:1440.934 mx:f:0.152565
404113c1-6ace-4690-885c-9c4a47da6476 450  *     *     *     *      *      *      *      *      *   255 ch:i:106 st:i:29268 mt:f:63.272270 wt:f:1591.070 en:f:0.010086
d9acafe3-23dd-4a0f-83db-efe299ee59a4 1355 *     *     *     *      *      *      *      *      *   255 ch:i:118 st:i:29378 mt:f:239.50201 wt:f:1403.641 ej:f:0.120715
8a6ec472-a289-4c50-9a75-589d7c21ef99 451  98 369 + Escherichia_coli 4765434 3421845 3422097 56 253 255 ch:i:490 st:i:29456 mt:f:79.419411 wt:f:8.551202 kp:f:0.097424
```

You must install UNCALLED into the MinKNOW enrironment. If HDF5 is not installed in the root filesystem, you may have to specify the library and include directory by running `setup.py build_ext` prior to installation: 

```
sudo /opt/ont/minknow/ont-python/bin/python setup.py build_ext --library-dirs /path/to/hdf5/lib --include-dirs /path/to/hdf5/include
sudo /opt/ont/minknow/ont-python/bin/python setup.py install
```

The above commands assume your MinKNOW python environment is located at `/opt/ont/minknow/ont-python`.

We recommend that you try mapping fast5s with the MinKNOW installation via `uncalled map` before real-time enrichment, as runtime issues involving hdf5 libraries could come up if UNCALLED is not installed properly.

The command can be run at any time before or during a sequencing run, although if you want to chagne the chunk size you must run the commend *before* startin the run (see below). Reads will not be ejected until after the first mux scan finishes.

Arguments:

- `-x/--bwa-prefix` the prefix of the index to align to. Should be a BWA index that `uncalled index` was run on
- `-t/--threads` number of threads to use for mapping (default: 1)
- `-c/--max-chunks-proc` number of chunks to attempt mapping before giving up on a read (default: 10).
- `--chunk-size` size of chunks in seconds (default: 1). Note: this is a new feature and may not work as intended (see below)
- `--port` MinION device port. Use `uncalled list-ports` command to see all devices that have been plugged in since MinKNOW started.
- `--enrich` will *keep* reads that map to the reference if included
- `--deplete` will *eject* reads that map to the reference if included
- `--even` will only eject reads from even channels if included
- `--odd` will only eject reads from odd channels if included
- `--duration` expected duration of sequencing run in hours (default: 48)
- `--post-script` optional path to executable to run after analysis has finished. Useful to automatically basecall after sequencing is done, for example.
- `--post-time` buffer time to wait after the last read before running the post-script, in seconds (default: 60). Useful because MinKNOW's sequencing time is not always exact.

Note exactly one of `--deplete` or `--enrich` must be specified

### Altering Chunk Size

The ReadUntil API recieves signal is "chunks", which by default are one second's worth of signal. This can be changed using the `--chunk-size` parameter. Note that `--max-chunks-proc` should also be changed to compensate for changes to chunks size. *If the chunk size is changed, you must start running UNCALLED before sequencing begins.* UNCALLED is unable change the chunk size mid-seqencing-run. In general reducing the chunk size should improve enrichment, although [previous work](http://dx.doi.org/10.1101/2020.02.03.926956) has found that the API becomes unreliable with chunks sizes less than 0.4 seconds. We have not thoroughly tested this feature, and recommend using the default 1 second chunk size for most cases. In the future this default size may be reduced.


## Output Format

Both `uncalled map` and `uncalled realtime` output to stdout in a format similar to [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md). Unmapped reads are output with reference-location-dependent fields replaced with \*s. Lines that begin with "#" are comments that useful for debugging.

Query coordinates, residue matches, and block lengths are estimated assuming 450bp sequenced per second. This estimate can be significantly off depending on the sequencing run. UNCALLED attempts to map a read as early as possible, so the "query end" field corresponds to the leftmost position where UNCALLED was able to confidently map the read. In many cases this may only be 450bp or 900bp into the read, even if the read is many times longer than this. This differs from aligners such as [minimap2](https://github.com/lh3/minimap2), which attempt to map the full length of the read.

For real-time mapping, read lengths are estimated by how much signal UNCALLED received, which does not necessarily correspond to how much signal was actually sequenced.

Both modes include the following custom attributes in each PAF entry:

- `mt`: **map time**. Time in milliseconds it took to map the read.
- `ch`: **channel**. MinION channel that the read came from.
- `st`: **start sample**. Global _sequencing_ start time of the read (in signal samples, 4000 samples/sec).

`uncalled realtime` also includes the following attributes:

- `ej`: **ejected**. Time that the eject signal was sent, in milliseconds since last chunk was received.
- `kp`: **kept**. Time that UNCALLED decided to keep the read, in milliseconds since last chunk was received.
- `en`: **ended**. Time that UNCALLED determined the read ended, in milliseconds since last chunk was received.
- `mx`: **mux scan**. Time that the read _would have_ been ejected, had it not have occured within a mux scan.
- `wt`: **wait time**. Time in milliseconds that the read was queued but was not actively being mapped, either due to thread delays or waiting for new chunks.

### pafstats

We have included a functionality called `uncalled pafstats` which computes speed statistcs from a PAF file output by UNCALLED. Accuracy statistics can also be included if provided a ground truth PAF file, for example based on minimap2 alignments of basecalled reads. There is also an option to output the original UNCALLED PAF annotated with comparisions to the ground truth.

**Example:**
```
> uncalled pafstats -r minimap2_alns.paf -n 5000 -a uncalled_out.paf > uncalled_ann.paf
Summary: 5000 reads, 4373 mapped (89.46%)

Comparing to reference PAF
     P     N
T  88.74  6.80
F   0.60  3.74
NA: 0.12

Speed            Mean    Median
BP per sec:   4878.24   4540.50
BP mapped:     636.29    362.00
MS to map:     140.99     89.96
```

Accuracy statistics:
- TP: true positive - percent infile reads that overlap reference read locations
- FP: false positive - percent infile reads that do not overlap reference read locations
- TN: true negative - percent of reads which were not aligned in reference or infile
- FN: false negative - percent of reads which were aligned in the reference but not in the infile
- NA: not aligned/not applicable - percent of reads aligned in infile but not in reference. Could be considered a false positive, but the truth is unkown.

## Practical Considerations

For ReadUntil sequencing, the first decision to make is whether to perform **enrichment** or **depletion** (`--enrich` or `--deplete`). 
In enrichment mode, UNCALLED will eject a read if it *does not* map to the reference, meaning your target should be the reference. 
In depletion mode, UNCALLED will eject a read if it *does* map to the reference, meaning your target should be everything except your reference.

Note that enrichment necessitates a quick decision as to whether or not a read maps, since you want to eject a read as fast as possible. Usually ~95% of reads can be mapped within three seconds for highly non-repetitive references, so setting `-c/--max-chunks-proc` to `3` generally works well for enrichment. The default value of `10` works well for depletion. Note these values assume `--chunk-size` is set to the default 1 second.

UNCALLED currently does not support large (> ~100Mb) or highly repetitive references. 
The speed and mapping rate both progressively drop as references become larger and more repetitive. 
Bacterial genomes or small collections of divergent bacterial genomes typically work well. 
Small segments of eukaryotic genomes can also be used, however the presence of any repetitvie elements will harm the performance. 
Collections of highly similar genomes wil not work well, as conserved sequences introduce repeats.
See [masking](masking/) for repeat masking scripts and guidlines.

ReadUntil works best with longer reads. Maximize your read lengths for best results. You may also need to perform a nuclease flush and reloading to achive the highest yield of on-target bases.

UNCALLED currently only supports reads sequenced with r9.4 chemistry.

## Undocumented Features

UNCALLED is a work in progress. Many parameters exist that are not documented here, but can be seen on the command line help information. Most users should leave these unchanged. They may be removed in future versions, or be replaced with hyperparameters that adjust the accuracy and speed of UNCALLED.

Owner

  • Name: The Boyle Lab
  • Login: Boyle-Lab
  • Kind: organization
  • Email: apboyle@umich.edu
  • Location: University of Michigan

GitHub Events

Total
Last Year