teloclip

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.

https://github.com/adamtaranto/teloclip

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

bioinformatics genome-assembly telomere telomere-length telomeres telomeric
Last synced: 7 months ago · JSON representation ·

Repository

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.

Basic Info
  • Host: GitHub
  • Owner: Adamtaranto
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 783 KB
Statistics
  • Stars: 41
  • Watchers: 4
  • Forks: 4
  • Open Issues: 12
  • Releases: 3
Topics
bioinformatics genome-assembly telomere telomere-length telomeres telomeric
Created over 7 years ago · Last pushed 12 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

License: MIT PyPI version codecov install with bioconda Downloads



Teloclip

A tool for the recovery of unassembled telomeres from raw long-reads using soft-clipped read alignments.

Table of contents

About Teloclip

In most eukaryotic species, chromosomes terminate in repetitive telomeric sequences. A complete genome assembly should ideally comprise chromosome-level contigs that possess telomeric repeats at each end. However, genome assemblers frequently fail to recover these repetitive features, instead producing contigs that terminate immediately prior to their location.

Teloclip is designed to recover long-reads that can be used to extend draft contigs and resolve missing telomeres (short-read alignments may also be processed with teloclip). It does this by searching alignments of raw long-read data (i.e. Pacbio or ONT reads mapped with Minimap2) for 'clipped' alignments that occur at the ends of draft contigs. A 'clipped' alignment is produced where the end of a read is not part of its best alignment. This can occur when a read extends past the end of an assembled contig.

Information about segments of a read that were aligned or clipped are stored in SAM formatted alignments as a CIGAR string. Teloclip parses these strings to determine if a read has been clipped at one or both ends of a contig.

Optionally, teloclip can screen overhanging reads for telomere-associated motifs (i.e. 'TTAGGG' / 'CCCTAA') and report only those containing a match.

Teloclip is based on concepts from Torsten Seemann's excellent tool samclip. Samclip can be used to remove clipped alignments from a samfile prior to variant calling.

Options and Usage

Installation

Teloclip requires Python >= 3.8.

There are 4 options available for installing Teloclip locally:

1) Install from PyPi.
This or Bioconda will get you the latest stable release.

bash pip install teloclip

2) Install from Bioconda.

bash conda install -c bioconda teloclip

3) Pip install directly from this git repository.

This is the best way to ensure you have the latest development version.

bash pip install git+https://github.com/Adamtaranto/teloclip.git

4) Clone from this repository and install as a local Python package.

Do this if you want to edit the code.

bash git clone https://github.com/Adamtaranto/teloclip.git && cd teloclip && pip install -e '.[dev]'

Verify installation

```bash

Print version number and exit.

teloclip --version

> teloclip 0.1.1

Get usage information

teloclip --help ```

Run with Gitpod

Alternatively, launch a Gitpod Workspace with teloclip, samtools, and minimap2 pre-installed.

Example Usage

Basic use case:

First index the reference assembly

```bash

Create index of reference fasta

samtools faidx ref.fa ```

Reading alignments from SAM file

```bash

Read alignment input from sam file and write overhang-reads to stout

teloclip --ref ref.fa.fai in.sam

Read alignment input from stdin and write stdout to file

teloclip --ref ref.fa.fai < in.sam > out.sam ```

Reading and writing BAM alignments

BAM files are binary sam files, they contain all the same information but take up much less storage space. You can use bam files with teloclip like this:

```bash

Read alignments from bam file, pipe sam lines to teloclip, sort overhang-read alignments and wite to bam file

samtools view -h in.bam | teloclip --ref ref.fa.fai | samtools sort > out.bam ```

Streaming SAM records from aligner

```bash

Map PacBio long-reads to ref assembly,

return alignments clipped at contig ends,

write to sorted bam.

minimap2 -ax map-pb ref.fa pacbio_reads.fq.gz | teloclip --ref ref.fa.fai | samtools sort > out.bam

Map reads to reference,

Exclude non-primary alignments.

Return alignments clipped at contig ends,

write to sorted bam.

minimap2 -ax map-pb ref.fa pacbio_reads.fq.gz | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > out.bam ```

Report clipped alignments containing target motifs

```bash

Report alignments which are clipped at a contig end

AND contain >=1 copy of the telomeric repeat "TTAGGG" (or its reverse complement "CCCTAA") in the clipped region.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > out.bam

Report alignments which are clipped at a contig end

AND contain >=1 copy of the telomeric repeat "TTAGGG" (or its reverse complement "CCCTAA") ANYWHERE in the read.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG --matchAny | samtools sort > out.bam

To change the minimum number of consecutive repeats required for a match, simply extend the search motif.

In this example 3 TTAGGG are required for a positive match.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGGTTAGGGTTAGGG | samtools sort > out.bam

```

Matching noisy target motifs

Raw long-reads can contain errors in the length of homopolymer tracks. If the --fuzzy option is set, motifs will be converted to regex patterns that allow the number of repeated bases to vary by +/- 1.
i.e. "TTAGGG" -> "T{1,3}AG{2,4}". This pattern will match TTAGG TTAGGGG TAGG TTTAGGG etc.

To reduce off target matching you can increase to minimum required number of motif matches with "--min_repeats".

```bash

Compress homopolymers in query motifs and clipped regions to compensate for errors in raw PacBio or ONP data.

i.e. The motif 'TTAGGGTTAGGG' becomes 'TAGTAG' and will match 'TTTTTAAAGGTTTAAGGG'.

samtools view -h in.bam | teloclip --ref ref.fa.fai --noPoly --motifs TTAGGGTTAGGG | samtools sort > out.bam ```

Extract clipped reads

teloclip-extract will write overhanging reads to separate fasta files for each reference contig end. The clipped region of each read is masked as lowercase in output fasta files.

Collections of reads that overhang a contig end can be assembled with miniasm into a single segment before being used to extend the contig. The final telemere-extended assembly should be polished (i.e. with Racon or Pilon) to correct errors in the raw long-read extensions.

```bash

Find clipped alignments containing motif 'TTAGGG' and write reads to separate fasta files for each reference contig end.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs ```

Optional Quality Control

Additional filters

Users may wish to exclude reads below a minimum length or read quality score to reduce the risk of incorrect alignments.

In some cases it may be also be useful to prioritise primary alignments. This can be done by pre-filtering alignments with samtools view. You can decode sam flags here.

```bash

Exclude secondary alignments.

samtools view -h -F 0x100 in.sam | teloclip --ref ref.fa.fai > noSA.sam ```

Pre-corrected Data

Some assembly tools, such as Canu, preform pre-correction of long-reads through iterative overlapping and correction prior to assembly. Corrected reads are trimmed based on coverage to remove low-confidence ends.

This trimming step can result in loss of distal telomeric sequences and so these reads should NOT be used with Teloclip.

However, long-reads that have been error-corrected using Illumina data with tools such as LoRDEC or HALC should be fine.

Generally speaking, raw long-reads will be fine for extending your contigs. Any errors in the extended region can be corrected with a round of polishing with short-read data using Pilon.

Extending contigs

Before using terminal alignments identified by Teloclip to extend contigs you should inspect the alignments in a genome browser that displays information about clipped reads, such as IGV.

Check for conflicting soft-clipped sequences. These indicate non-specific read alignments. You may need to tighten your alignment criteria or manually remove low-confidence alignments.

After manually extending contigs the revised assembly should be re-polished using available long and short read data to correct indels present in the raw long-reads.

Finally, validate the updated assembly by re-mapping long-read data and checking for alignments that extend into revised contig ends.

Alternative use cases

Illumina data

Teloclip will also work fine with aligned short read data, which has a far lower error rate than single-molecule long-read data.

However, there are obvious limits to the distance that a contig may be extended with shorter reads.

Teloclip does not use information from paired-reads.

Merging existing assemblies

You may have assemblies for your genome generated with different assemblers/configurations or data types (i.e. Illumina, PacBio, ONT) which vary in their success in assembling individual telomeres.

These alternative assemblies can be treated as pseudo-long-reads and aligned to a reference using Minimap2.

Teloclip can identify aligned contigs that can be used to extend those in the reference set.

Be cautious of short contigs that may align to may repetative sub-telomeric regions and result non-specific extension of contigs.

Also beware of low-complexity telomeric regions on different chromosomes aligning to each other and resulting in end-to-end fusions.

bash # Align alternative assembly contigs to reference and report overhang alignments. Ignore secondary alignments. minimap2 -ax asm5 ref.fa asm.fa | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > asm2ref.bam

Circularising Mitochondrial / Bacterial genomes

Using default settings, teloclip will report alignments with clipped regions extending past linear contig ends.

Reads can be extracted from these alignments using circlator's bam2reads and re-aligned to an assembly graph in Bandage to help identify uncircularised contigs.

Options

Teloclip Options

Run teloclip --help to view the programs' most commonly used options:

``` Usage: teloclip [-h] [--version] --refIdx REFIDX [--minClip MINCLIP] [--maxBreak MAXBREAK] [--motifs MOTIFS] [--noRev NOREV] [--noPoly NOPOLY] [--matchAny MATCHANY] [samfile]

Required: --refIdx REFIDX Path to fai index for reference fasta. Index fasta using samtools faidx FASTA

Positional arguments: samfile Input SAM can be added as the first positional argument after flagged options. If not set teloclip will read from stdin.

Optional: --minClip Require clip to extend past ref contig end by at least N bases. Default: 1 --maxBreak Tolerate max N unaligned bases at contig ends. Default: 50 --motifs If set keep only reads containing given motif/s from a comma delimited list of strings. By default also search for reverse complement of motifs. i.e. TTAGGG,TTAAGGG will also match CCCTAA,CCCTTAA Default: None --noRev If set do NOT search for reverse complement of specified motifs. Default: Find motifs on both strands. --noPoly If set collapse homopolymer tracks within motifs before searching overhangs. i.e. "TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG" -> "TAGTAGTAGTAGTAG". Useful for PacBio or ONP long reads homopolymer length errors. Defaut: Off.
--matchAny If set motif match may occur in unclipped region of alignment. Defaut: False --version Show program's version number and exit. ```

Teloclip-extract Options

Run teloclip-extract --help to view the programs' most commonly used options:

``` Usage: teloclip-extract [-h] --refIdx REFIDX [--prefix PREFIX] [--extractReads] [--extractDir EXTRACTDIR] [--minClip MINCLIP] [--maxBreak MAXBREAK] [--version] [samfile]

positional arguments: samfile If not set, will read sam from stdin.

optional arguments: -h, --help Show this help message and exit --refIdx Path to fai index for reference fasta. Index fasta using samtools faidx FASTA --prefix Use this prefix for output files. Default: None. --extractReads If set, write overhang reads to fasta by contig. --extractDir Write extracted reads to this directory. Default: cwd. --minClip Require clip to extend past ref contig end by at least N bases. --maxBreak Tolerate max N unaligned bases at contig ends. --version Show program's version number and exit ```

Citing Teloclip

If you use Teloclip in your work please cite this git repo directly and note the release version you used.

Publications using Teloclip

van Westerhoven, A., Mehrabi, R., Talebi, R., Steentjes, M., Corcolon, B., Chong, P., Kema, G. and Seidl, M.F., 2023. A chromosome-level genome assembly of Zasmidium syzygii isolated from banana leaves. bioRxiv, pp.2023-08.

Yang, H.P., Wenzel, M., Hauser, D.A., Nelson, J.M., Xu, X., Eliáš, M. and Li, F.W., 2021. Monodopsis and Vischeria genomes shed new light on the biology of eustigmatophyte algae. Genome biology and evolution, 13(11), p.evab233.

Issues

Submit feedback to the Issue Tracker

License

Software provided under MIT license.

Star History

Star History
Chart

Owner

  • Name: Adam Taranto
  • Login: Adamtaranto
  • Kind: user
  • Location: Melbourne, Australia
  • Company: The University of Melbourne

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "teloclip"
version: 0.1.0
date-released: 2024-11-13
authors:
  - family-names: Taranto
    given-names: Adam
    orcid: https://orcid.org/0000-0003-4759-3475
    affiliation: "The University of Melbourne"
repository-code: "https://github.com/Adamtaranto/teloclip"
license: MIT
abstract: >-
  A tool for the recovery of unassembled telomeres from raw long-reads using soft-clipped read alignments.
keywords:
  - genomics
  - telomeres
  - bioinformatics
preferred-citation:
  type: software
  authors:
    - family-names: Taranto
      given-names: Adam
      orcid: https://orcid.org/0000-0003-4759-3475
      affiliation: "The University of Melbourne"
  title: "teloclip: A tool for the recovery of unassembled telomeres from raw long-reads using soft-clipped read alignments."
  year: 2019
  url: "https://github.com/Adamtaranto/teloclip"
  repository-code: "https://github.com/Adamtaranto/teloclip"
  # doi: TBA

GitHub Events

Total
  • Create event: 6
  • Issues event: 2
  • Release event: 3
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 3
  • Push event: 16
  • Pull request event: 8
Last Year
  • Create event: 6
  • Issues event: 2
  • Release event: 3
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 3
  • Push event: 16
  • Pull request event: 8

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 120
  • Total Committers: 1
  • Avg Commits per committer: 120.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 15
  • Committers: 1
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Adam Taranto a****o@g****m 120

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 24
  • Total pull requests: 7
  • Average time to close issues: 7 months
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 10
  • Total pull request authors: 1
  • Average comments per issue: 2.29
  • Average comments per pull request: 0.14
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 4
  • Average time to close issues: 1 day
  • Average time to close pull requests: about 6 hours
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 2.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Adamtaranto (14)
  • cyycyj (1)
  • zhangwenda0518 (1)
  • JWDebler (1)
  • ufaroooq (1)
  • PengfeiInTuebingen (1)
  • xiekunwhy (1)
  • stefankusch (1)
  • ihbxiongjie (1)
  • juntkym (1)
Pull Request Authors
  • Adamtaranto (12)
Top Labels
Issue Labels
enhancement (11) bug (2) question (2) wontfix (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 32 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 6
  • Total maintainers: 1
pypi.org: teloclip

A tool for the recovery of unassembled telomeres from raw long-reads using soft-clipped read alignments.

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 32 Last month
Rankings
Dependent packages count: 10.1%
Dependent repos count: 21.6%
Average: 24.4%
Downloads: 41.4%
Maintainers (1)
Last synced: 8 months ago

Dependencies

.devcontainer/Dockerfile docker
  • mcr.microsoft.com/devcontainers/miniconda 0-3 build
pyproject.toml pypi
requirements.txt pypi
setup.py pypi
environment.yml conda
  • minimap2
  • pip
  • python >=3.12
  • samtools