mimeo

Scan genomes for internally repeated sequences, elements which are repetitive in another species, or high-identity HGT candidate regions between species.

https://github.com/adamtaranto/mimeo

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Scan genomes for internally repeated sequences, elements which are repetitive in another species, or high-identity HGT candidate regions between species.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 2
  • Open Issues: 2
  • Releases: 4
Created over 8 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

License: MIT PyPI version install with bioconda Downloads

Mimeo

A tool for finding and annotating repeats in whole-genome alignments.

Table of contents

Modules

Mimeo comprises three tools for parsing repeats from whole-genome alignments:

mimeo-self

Internal repeat finder. Mimeo-self aligns a genome to itself and extracts high-identity segments above a coverage threshold. This method is less sensitive to disruption by indels and repeat-directed point mutations than kmer-based methods such as RepeatScout. Reported annotations indicate overlapping segments above the coverage threshold, mimeo-self does not attempt to separate nested repeats. Use this tool to identify candidate repeat regions for curated annotation.

mimeo-x

Cross-species repeat finder. A newly acquired or low-copy transposon may slip past copy-number based annotation tools. Mimeo-x searches for features which are abundant in an external reference genome, allowing for annotation of complete elements as they occur in a horizontal-transfer donor species, or of conserved coding segments of related transposon families.

mimeo-map

Find all high-identity segments shared between genomes. Mimeo-map identifies candidate horizontally transferred segments between sufficiently diverged species. When comparing isolates of a single species, aligned segments correspond to directly homologous sequences and internally repetitive features.

Intra/Inter-genomic alignments from Mimeo-self or Mimeo-x can be reprocessed with Mimeo-map to generate annotations of unfiltered/uncollapsed alignments. These raw alignment annotations can be used to interrogate repetitive-segments for coverage breakpoints corresponding to nested transposons with differing abundances across the genome.

mimeo-filter

An additional tool mimeo-filter is now included to allow post-filtering of SSR-rich sequences from FASTA formatted candidate-repeat libraries.

Installing Mimeo

Requirements:

Install from Bioconda:

bash conda install mimeo

Install from PyPi:

bash pip install mimeo

Clone and install from this repository:

```bash git clone https://github.com/Adamtaranto/mimeo.git && cd mimeo

pip install -e '.[dev]' ```

Example usage

Demo: mimeo-self

Annotate features in genome A which are > 100bp and occur with >= 80% identity at least 3 times on other scaffolds OR at least 4 times on the same scaffold.

bash mimeo self --adir data/A_genome_Split --afasta data/A_genome.fasta \ -d MS_outdir --gffout A_genome_Inter3_Intra4_id80_len_100.gff3 \ --outfile A_genome_Self_Align.tab --label A_Rep3 --prefix A_Self --minIdt 80 \ --minLen 100 --minCov 3 --intraCov 4 --strictSelf

Output:

  • MSoutdir/AgenomeInter3Intra4id80len_100.gff3
  • MSoutdir/AgenomeSelfAlign.tab
  • data/AgenomeSplit/*.fa

Demo: mimeo-x

Annotate features in genome A which are > 100bp and occur with >= 80% identity at least 5 times in genome B.

bash mimeo x --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \ -d MX_outdir --gffout B_Rep5_in_A.gff3 --outfile B_Reps_in_A_id80_len100.tab \ --label B_Rep5 --prefix B_Rep5 --minIdt 80 --minLen 100 --minCov 5

Output:

  • MXoutdir/BRep5inA.gff3
  • MXoutdir/BRepsinAid80len100.tab

Demo: mimeo-map

Annotate features in genome A which are > 100bp and occur with >= 90% identity in genome B. No coverage filter, all alignments are reported.

bash mimeo map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \ -d MM_outdir --gffout B_in_A_id90.gff3 --outfile B_in_A_id90.tab \ --label B_90 --prefix B_90 --minIdt 90 --minLen 100

Output:

  • MMoutdir/BinAid90.gff3
  • MMoutdir/BinAid90.tab

mimeo-map + SSR filter

Annotate features in genome A which are > 100bp and occur with >= 98% identity in genome B. Reuse B to A-genome alignment from the previous run.

Filter out hits which are >= 40% tandem repeats. Write filtered hits as tab file and GFF3 annotation.

bash mimeo map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \ -d MM_outdir --gffout B_in_A_id98_maxSSR40.gff3 --outfile B_in_A_id98.tab \ --label B_98 --prefix B_98 --minIdt 98 --minLen 100 \ --recycle --maxtandem 40 --writeTRF

Output:

  • MMoutdir/BinAid98_maxSSR40.gff3
  • MMoutdir/BinAid98.tab.trf

Demo: mimeo-filter

Filter sequences comprised of >= 40% short tandem repeats from a multifasta library of candidate transposons.

bash mimeo filter --infile data/candidate_TEs.fa

Output:

  • candidateTEsfiltered.fa

Standard options

mimeo-self

```code Usage: mimeo self [-h] [--adir ADIR] [--afasta AFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT] [--outfile OUTFILE] [--verbose] [--label LABEL] [--prefix PREFIX] [--lzpath LZPATH] [--bedtools BEDTOOLS] [--minIdt MINIDT] [--minLen MINLEN] [--minCov MINCOV] [--hspthresh HSPTHRESH] [--intraCov INTRACOV] [--strictSelf]

Internal repeat finder. Mimeo-self aligns a genome to itself and extracts high-identity segments above a coverage threshold.

Optional arguments: -h, --help Show this help message and exit. --adir Name of the directory containing sequences from the genome. Write split files here if providing genome as multifasta. --afasta Genome as multifasta. -r, --recycle Use existing alignment "--outfile" if found. -d , --outdir Write output files to this directory. (Default: cwd) --gffout Name of GFF3 annotation file. --outfile Name of alignment result file. --verbose If set report LASTZ progress. --label Set annotation TYPE field in gff. --prefix ID prefix for internal repeats. --lzpath Custom path to LASTZ executable if not in $PATH. --bedtools Custom path to bedtools executable if not in $PATH. --minIdt Minimum alignment identity to report. --minLen Minimum alignment length to report. --minCov Minimum depth of aligned segments to report repeat feature. --hspthresh Set HSP min score threshold for LASTZ. --intraCov Minimum depth of aligned segments from the same scaffold to report feature. Used if "--strictSelf" mode is selected. --strictSelf If set process same-scaffold alignments separately with the option to use higher "--intraCov" threshold. Sometimes useful to avoid false repeat calls from staggered alignments over SSRs or short tandem duplication. ```

mimeo-x

```code Usage: mimeo x [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA] [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT] [--outfile OUTFILE] [--verbose] [--label LABEL] [--prefix PREFIX] [--lzpath LZPATH] [--bedtools BEDTOOLS] [--minIdt MINIDT] [--minLen MINLEN] [--minCov MINCOV] [--hspthresh HSPTHRESH]

Cross-species repeat finder. Mimeo-x searches for features which are abundant in an external reference genome.

Optional arguments: -h, --help Show this help message and exit. --adir Name of the directory containing sequences from A genome. --bdir Name of the directory containing sequences from B genome. --afasta A genome as multifasta. --bfasta B genome as multifasta. -r, --recycle Use existing alignment "--outfile" if found. -d , --outdir Write output files to this directory. (Default: cwd) --gffout Name of GFF3 annotation file. --outfile Name of alignment result file. --verbose If set report LASTZ progress. --label Set annotation TYPE field in GFF. --prefix ID prefix for B-genome repeats annotated in A-genome. --lzpath Custom path to LASTZ executable if not in $PATH. --bedtools Custom path to bedtools executable if not in $PATH. --minIdt Minimum alignment identity to report. --minLen Minimum alignment length to report. --minCov Minimum depth of B-genome hits to report feature in A-genome. --hspthresh Set HSP min score threshold for LASTZ. ```

mimeo-map

```code Usage: mimeo map [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA] [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT] [--outfile OUTFILE] [--verbose] [--label LABEL] [--prefix PREFIX] [--keeptemp] [--lzpath LZPATH] [--minIdt MINIDT] [--minLen MINLEN] [--hspthresh HSPTHRESH] [--TRFpath TRFPATH] [--tmatch TMATCH] [--tmismatch TMISMATCH] [--tdelta TDELTA] [--tPM TPM] [--tPI TPI] [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD] [--maxtandem MAXTANDEM] [--writeTRF]

Find all high-identity segments shared between genomes.

Optional arguments: -h, --help Show this help message and exit. --adir Name of the directory containing sequences from A genome. --bdir Name of the directory containing sequences from B genome. --afasta A genome as multifasta. --bfasta B genome as multifasta. -r, --recycle Use existing alignment "--outfile" if found. -d, --outdir Write output files to this directory. (Default: cwd) --gffout Name of GFF3 annotation file. If not set, suppress output. --outfile Name of alignment result file. --verbose If set report LASTZ progress. --label Set annotation TYPE field in GFF. --prefix ID prefix for B-genome hits annotated in A-genome. --keeptemp If set does not remove temp files. --lzpath Custom path to LASTZ executable if not in $PATH. --minIdt Minimum alignment identity to report. --minLen Minimum alignment length to report. --hspthresh Set HSP min score threshold for LASTZ. --TRFpath Custom path to TRF executable if not in $PATH. --tmatch TRF matching weight. --tmismatch TRF mismatching penalty. --tdelta TRF indel penalty. --tPM TRF match probability. --tPI TRF indel probability. --tminscore TRF minimum alignment score to report. --tmaxperiod TRF maximum period size to report. --maxtandem Max percentage of an A-genome alignment which may be masked by TRF. If exceeded, the alignment will be discarded. --writeTRF If set write TRF filtered alignment file for use with other mimeo modules. ```

mimeo-filter

```code Usage: mimeo filter [-h] --infile INFILE [-d OUTDIR] [--outfile OUTFILE] [--keeptemp] [--verbose] [--TRFpath TRFPATH] [--tmatch TMATCH] [--tmismatch TMISMATCH] [--tdelta TDELTA] [--tPM TPM] [--tPI TPI] [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD] [--maxtandem MAXTANDEM]

Filter SSR containing sequences from FASTA library of repeats.

Optional arguments: -h, --help Show this help message and exit. --infile Name of the directory containing sequences from A genome. -d, --outdir Write output files to this directory. (Default: cwd) --outfile Name of alignment result file. --keeptemp If set does not remove temp files. --verbose If set report LASTZ progress. --TRFpath Custom path to TRF executable if not in $PATH. --tmatch TRF matching weight --tmismatch TRF mismatching penalty. --tdelta TRF indel penalty. --tPM TRF match probability. --tPI TRF indel probability. --tminscore TRF minimum alignment score to report. --tmaxperiod TRF maximum period size to report. Note: Setting this score too high may exclude some LTR retrotransposons. Optimal len to exclude only SSRs is 10-50bp. --maxtandem Max percentage of a sequence which may be masked by TRF. If exceeded, the element will be discarded.

```

Importing alignments

Whole genome alignments generated by alternative tools (i.e. BLAT) can be provided to any of the Mimeo modules as a tab-delimited file with the columns:

code [1] name1 = Name of target sequence in genome A [2] strand1 = Strand of alignment in target sequence [3] start1 = 5-prime position of alignment in target (lower value irrespective of strand) [4] end1 = 3-prime position of alignment in target (higher value irrespective of strand) [5] name2 = Name of source sequence in genome B [6] strand2 = Strand of alignment in source [7] start2+ = 5-prime position of alignment in source (lower value irrespective of strand) [8] end2+ = 3-prime position of alignment in source (higher value irrespective of strand) [9] score = Alignment score as int [10] identity = Identity of alignment as float

File should be sorted by columns 1,3,4

License

Software provided under MIT license.

Owner

  • Name: Adam Taranto
  • Login: Adamtaranto
  • Kind: user
  • Location: Melbourne, Australia
  • Company: The University of Melbourne

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "mimeo"
version: 1.2.0
date-released: 2025-04-06
authors:
  - family-names: Taranto
    given-names: Adam
    orcid: https://orcid.org/0000-0003-4759-3475
    affiliation: "The University of Melbourne"
repository-code: "https://github.com/Adamtaranto/mimeo"
license: MIT
abstract: >-
  A tool for finding and annotating repeats in whole-genome alignments.
  Mimeo uses the LASTZ alignment engine to find regions of similarity within
  or between genomes, we apply filtering heuristics to identify candidate
  repeats or HGT events in a reference independent manner.
keywords:
  - genomics
  - transposons
  - bioinformatics
preferred-citation:
  type: software
  authors:
    - family-names: Taranto
      given-names: Adam
      orcid: https://orcid.org/0000-0003-4759-3475
      affiliation: "The University of Melbourne"
  title: "Mimeo: A tool for finding and annotating repeats in whole-genome alignments."
  year: 2017
  url: "https://github.com/Adamtaranto/mimeo"
  repository-code: "https://github.com/Adamtaranto/mimeo"
  # doi: TBA

GitHub Events

Total
  • Create event: 5
  • Release event: 3
  • Issues event: 2
  • Delete event: 3
  • Push event: 18
  • Pull request event: 6
Last Year
  • Create event: 5
  • Release event: 3
  • Issues event: 2
  • Delete event: 3
  • Push event: 18
  • Pull request event: 6

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 29
  • Total Committers: 1
  • Avg Commits per committer: 29.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Adam Taranto a****o@g****m 29

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 2
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Adamtaranto (2)
Pull Request Authors
  • Adamtaranto (6)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 91 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 8
  • Total maintainers: 1
pypi.org: mimeo

Scan genomes for internally repeated sequences, elements which are repetitive in another species, or high-identity HGT candidate regions between species.

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 91 Last month
Rankings
Dependent packages count: 10.0%
Forks count: 19.1%
Dependent repos count: 21.7%
Average: 27.0%
Stargazers count: 31.9%
Downloads: 52.1%
Maintainers (1)
Last synced: 8 months ago