Dnaapler

Dnaapler: A tool to reorient circular microbial genomes - Published in JOSS (2024)

https://github.com/gbouras13/dnaapler

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 14 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, joss.theoj.org, zenodo.org
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Scientific Fields

Engineering Computer Science - 40% confidence

Last synced: 11 months ago · JSON representation

Repository

Reorients assembled microbial sequences

Basic Info

Host: GitHub
Owner: gbouras13
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 30.5 MB

Statistics

Stars: 125
Watchers: 3
Forks: 6
Open Issues: 5
Releases: 18

Created almost 4 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Authors

dnaapler

Dnaapler is a simple tool that reorients complete circular microbial genomes.

Quick Start

```

creates empty conda environment

conda create -n dnaapler_env

activates conda environment

conda activate dnaapler_env

installs dnaapler

conda install -c bioconda dnaapler

runs dnaapler all

dnaapler all -i inputmixedcontigs.fasta -o outputdirectorypath -p mybacterianame -t 8

runs dnaapler all with a gfa file from e.g. Flye, Unicycler or Autocycler

dnaapler all -i assembly.gfa -o outputdirectorypath -p mybacterianame -t 8 ```

If you have a MacOS machine with Apple Silicon (M1/M2/M3/M4) and are having installation issues, please try

``` conda create --platform osx-64 -n dnaapler_env dnaapler

conda activate dnaapler_env

dnaapler all -i inputmixedcontigs.fasta -o outputdirectorypath -p mybacterianame -t 8 ```

Paper

Dnaapler has been published in JOSS here. If you use Dnaapler in your work, please cite it as follows:

```

George Bouras, Susanna R. Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, Michael J. Roach (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968, https://doi.org/10.21105/joss.05968

```

Additionally, please consider citing the dependencies where relevant:

``` Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2. PMID: 2231712.

Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988.

Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296, https://doi.org/10.21105/joss.04296.

Hyatt, D., Chen, GL., LoCascio, P.F. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). https://doi.org/10.1186/1471-2105-11-119. ```

v1 and other recent changes

1.3.0

Thanks @mbhall88 for extending the functionality of --ignore
If your input FASTA or GFA is mixed (e.g. has chromosome and plasmids), you can also use dnaapler all, with the option to ignore some contigs with the --ignore parameter. The --ignore parameter accepts either:

A file path containing contig names to ignore (one per line)
A comma-separated list of contig names (e.g., chr1,chr2,chr3)
- to read contig names from stdin (one per line)

1.2.0

Thanks to the one and only @rrwick, Dnaapler now supports the GFA format as input. This was done to ensure support for Ryan's new bacterial genome assembly tool Autocycler, the successor to Trycycler, but may also be useful if you have GFA files from e.g. Unicycler, Flye, Spades or other assemblers.
- If you run dnaapler with GFA input, you will get a GFA output as well.
- If you run dnaapler with GFA input, only circular contigs will be reoriented
Relaxes the MMSeqs2 dependency to >=13.45111

1.1.0

Adds support for reorienting contigs where the gene of interest spands the contig ends - fixes this issue. Thanks @marade @oschwengers.
- Specifically, this is done by rotating each contig in the input by half the genome length, then running MMseqs2 for both the original and rotated contigs. The MMseqs2 hit with the highest bitscore across the original and rotated contigs will be chosen as the top hit to rotate by, therefor enabling detection of partial hits (on the original contig) that span the contig ends.
This has only been implemented for dnaapler all (this should be the command used by 99% of users).

v1.0

BREAKING CHANGE - dnaapler now uses MMSeqs2 v13.45111 rather than BLAST. You will need to install MMSeqs2 if you upgrade (if you use conda, it should be handled for you). The CLI is identical.
There are 2 reasons for this:
1. Users reported problems installing BLAST on MacOS with Apple Silicon (see e.g. here). MMseqs2 works on all platforms and is dilligently maintained.
2. MMSeqs2 is much much faster than BLAST (what took BLAST a few minutes takes MMSeqs2 seconds). We probably should have written dnaapler with MMseqs2 to begin with. MMSeqs2 v13.45111 was chosen to ensure interoperability with pharokka
The alignment resuls may not be identicial to dnaapler v0.8.1 (i.e. they might find different top hits), but the actual reorientation is likely to be identical (at least in my tests). Please reach out or make an issue if you notice any discrepancies

For example - on my machine (Ubuntu 20.04, Intel i9 13th gen 13900 CPU with 32 threads), for a Staphylococcus aureus genome with 1 small plasmid, dnaapler -i staph.fasta -o staph_dnaapler -t 8 took ~129 seconds wallclock with v0.8.1 using BLAST, while it took ~3 seconds wallclock with v1.0.0 using MMseqs2.

Google Colab Notebooks

If you don't want to install dnaapler locally, you can run dnaapler all without any code using the Google Colab notebook.

dnaapler
1.3.0
1.2.0
1.1.0
v1.0
Google Colab Notebooks
- Table of Contents
- Description
- Documentation
- Commands
- Installation
- Conda
- Pip
- Usage
- Example Usage
- Databases
- Motivation
- Contributing
- Acknowledgements

Description

Dnaapler Figure

dnaapler is a simple python program that takes a single nucleotide input sequence (in FASTA or GFA format), finds the desired start gene using MMseqs2 against an amino acid sequence database, checks that the start codon of this gene is found, and if so, then reorients the chromosome to begin with this gene on the forward strand.

It was originally designed to replicate the reorientation functionality of Unicycler with dnaA, but for for long-read first assembled chromosomes. We have extended it to work with plasmids (dnaapler plasmid) and phages (dnaapler phage), or for any input FASTA or GFA desired with dnaapler custom, dnaapler mystery or dnaapler nearest.

For bacterial chromosomes, dnaapler chromosome should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as Autocycler, Dragonflye or my own pipeline hybracter.

Additionally, you can also reorient multiple bacterial chromosomes/plasmids/phages at once using the dnaapler bulk subcommand.

If your input FASTA or GFA is mixed (e.g. has chromosome and plasmids), you can also use dnaapler all, with the option to ignore some contigs with the --ignore parameter. The --ignore parameter accepts either: - A file path containing contig names to ignore (one per line) - A comma-separated list of contig names (e.g., chr1,chr2,chr3) - - to read contig names from stdin (one per line)

As of v1, in practice, dnaapler all is the only command you will likely need, as it contains all the functionality of bulk, chromosome, plasmid, phage but with much more flexibility and user-friendliness

When provided with a GFA file, dnaapler will process only circular sequences – those with a single circularising link and no additional links – while leaving all other sequences unchanged. The output format will match the input: FASTA input produces FASTA output, and GFA input produces GFA output.

Documentation

The full documentation for dnaapler can be found here.

Commands

dnaapler all: Reorients 1 or more contigs to begin with any of dnaA, terL, repA or COG1474.
- Practically, this should be the most useful command for most users.
dnaapler chromosome: Reorients your sequence to begin with the dnaA chromosomal replication initiator gene
dnaapler plasmid: Reorients your sequence to begin with the repA plasmid replication initiation gene
dnaapler phage: Reorients your sequence to begin with the terL large terminase subunit gene
dnaapler archaea: Reorients your sequence to begin with the COG1474 archaeal Orc1/cdc6 gene.
dnaapler custom: Reorients your sequence to begin with a custom amino acid FASTA format gene that you specify
dnaapler mystery: Reorients your sequence to begin with a random CDS
dnaapler largest: Reorients your sequence to begin with the largest CDS
dnaapler nearest: Reorients your sequence to begin with the first CDS (nearest to the start). Designed for fixing sequences where a CDS spans the breakpoint.
dnaapler bulk: Reorients multiple contigs to begin with the desired start gene - either dnaA, terL, repA or a custom gene.

Installation

dnaapler requires only MMseqs2 v13.45111 as an external dependency.

Installation from conda is highly recommended as this will install MMseqs2 automatically.

Conda

dnaapler is available on bioconda.

conda install -c bioconda dnaapler

Pip

You can also install dnaapler with pip.

pip install dnaapler

If you install dnaapler with pip, then you will then need to install MMseqs2 v13.45111 separately. It will need to be available in the $PATH or else dnaapler will not work.

Usage

``` Usage: dnaapler [OPTIONS] COMMAND [ARGS]...

Options: -h, --help Show this message and exit. -V, --version Show the version and exit.

Commands: all Reorients contigs to begin with any of dnaA, repA... archaea Reorients your genome to begin with the archaeal COG1474... bulk Reorients multiple genomes to begin with the same gene chromosome Reorients your genome to begin with the dnaA chromosomal... citation Print the citation(s) for this tool custom Reorients your genome with a custom database largest Reorients your genome the begin with the largest CDS as... mystery Reorients your genome with a random CDS nearest Reorients your genome the begin with the first CDS as... phage Reorients your genome to begin with the terL large... plasmid Reorients your genome to begin with the repA replication... ```

``` Usage: dnaapler all [OPTIONS]

Reorients contigs to begin with any of dnaA, repA, terL or archaeal COG1474 Orc1/cdc6

Options: -h, --help Show this message and exit. -V, --version Show the version and exit. -i, --input PATH Path to input file in FASTA or GFA format [required] -o, --output PATH Output directory [default: output.dnaapler] -t, --threads INTEGER Number of threads to use with MMseqs2 [default: 1] -p, --prefix TEXT Prefix for output files [default: dnaapler] -f, --force Force overwrites the output directory -e, --evalue TEXT e value for MMseqs2 [default: 1e-10] --ignore TEXT Text file listing contigs (one per row) that are to be ignored OR comma separated list of contig names to ignore OR '-' to read from stdin -a, --autocomplete TEXT Choose an option to autocomplete reorientation if MMseqs2 based approach fails. Must be one of: none, mystery, largest, or nearest [default: none] --seed_value INTEGER Random seed to ensure reproducibility. [default: 13] ```

The reoriented output will be {prefix}_reoriented.fasta in the specified output directory. If the input file was in GFA format, then the output will be named {prefix}_reoriented.gfa.

Example Usage

For more detailed example usage, please see the examples section of the documentation.

dnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore list_of_contigs_to_ignore.txt

dnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore chr1,chr2,chr3

echo -e "chr1\nchr2\nchr3" | dnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore -

dnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8

dnaapler phage -i input.fasta -o output_directory_path -p my_phage_name -t 8

dnaapler plasmid -i input.fasta -o output_directory_path -p my_plasmid_name -t 8

dnaapler archaea -i input.fasta -o output_directory_path -p my_archaea_name -t 8

dnaapler custom -i input.fasta -o output_directory_path -p my_genome_name -t 8 -c my_custom_database_file

dnaapler mystery -i input.fasta -o output_directory_path -p my_genome_name

dnaapler nearest -i input.fasta -o output_directory_path -p my_genome_name

dnaapler largest -i input.fasta -o output_directory_path -p my_genome_name

```

to reorient multiple bacterial chromosomes

dnaapler bulk -i inputfilewithmultiplechromosomes.fasta -m chromosome -o outputdirectorypath -p mygenomename ```

Databases

dnaapler chromosome uses 584 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA. All hits from the query were also filtered to ensure "GN=dnaA" was included in the header of the FASTA entry.

dnaapler plasmid uses the repA database curated by Ryan Wick in Unicycler.

dnaapler phage uses a terL database curated using PHROGs. All the AA sequences of the 55 phrogs annotated as 'large terminase subunit' were downloaded, combined and depduplicated using seqkit seqkit rmdup -s -o terL.faa phrog_terL.faa.

dnaapler archaea uses a database of 403 archaeal COG1474 Orc1/cdc6 genes curated from here.

dnaapler all uses all four databases combined into one.

dnaapler custom uses a custom amino acid FASTA format file that you specify using -c.

The matching is strict - it requires a strong MMseqs2 match (default e-value 1E-10), and the first amino acid of a MMseqs2 hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages.

For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.

If you try dnaapler on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using dnaapler custom.

After this issue, dnaapler mystery was added. It predicts all ORFs in the input using pyrodigal, then picks a random gene to re-orient your sequence with.

Motivation

I couldn't get Circlator to work and it is no longer supported.
berokka doesn't orient chromosomes to begin with dnaa.
After reading Ryan Wick's masterful bacterial genome assembly tutorial, I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" long read bacterial assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's rotatecirculargfa.py script, without the requirement of strict circularity.
While researching MGEs in S. aureus whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline.
It's probably good to have all your sequences start at the same location for synteny analyses.

Contributing

If you would like to help improve dnaapler you are very welcome!

For changes to be accepted, they must pass the CI checks.

Please see CONTRIBUTING.md for more details.

Acknowledgements

Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to Michael Hall, whose repository tbpore we took and adapted a lot of scaffolding code from because he writes really nice code.

Owner

Name: George Bouras
Login: gbouras13
Kind: user
Location: Adelaide, SA

Twitter: GB13Faithless
Repositories: 8
Profile: https://github.com/gbouras13

Bioinformatics at the Basil Hetzel Institute, University of Adelaide. Phages, microbes and more. george.bouras@adelaide.edu.au

JOSS Publication

Dnaapler: A tool to reorient circular microbial genomes

Published

January 11, 2024

DOI

10.21105/joss.05968

Volume 9, Issue 93, Page 5968

Authors

George Bouras

Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia 5005, Australia, The Department of Surgery – Otolaryngology Head and Neck Surgery, Central Adelaide Local Health Network, Adelaide, South Australia 5000, Australia

Susanna R. Grigson

Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, South Australia 5042, Australia

Bhavya Papudeshi

Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, South Australia 5042, Australia

Vijini Mallawaarachchi

Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, South Australia 5042, Australia

Michael J. Roach

Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, South Australia 5042, Australia, Adelaide Centre for Epigenetics and South Australian Immunogenomics Cancer Institute, The University of Adelaide, Adelaide, South Australia 5005, Australia

Editor

Frederick Boehm

GitHub Events

Total

Create event: 8
Issues event: 6
Release event: 5
Watch event: 25
Delete event: 4
Issue comment event: 28
Push event: 39
Pull request review event: 1
Pull request event: 24
Fork event: 3

Last Year

Create event: 8
Issues event: 6
Release event: 5
Watch event: 25
Delete event: 4
Issue comment event: 28
Push event: 39
Pull request review event: 1
Pull request event: 24
Fork event: 3

Committers

Last synced: 12 months ago

All Time

Total Commits: 253
Total Committers: 6
Avg Commits per committer: 42.167
Development Distribution Score (DDS): 0.091

Past Year

Commits: 52
Committers: 3
Avg Commits per committer: 17.333
Development Distribution Score (DDS): 0.288

Top Committers

Name	Email	Commits
gbouras13	g**s@a**u	230
Ryan Wick	r**k@g**m	14
Vijini Mallawaarachchi	v**i@g**m	5
Sam Nooij	s****j	2
Michael Roach	b**e@g**m	1
Antônio Camargo	a****o	1

Committer Domains (Top 20 + Academic)

adelaide.edu.au: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 31
Total pull requests: 71
Average time to close issues: 18 days
Average time to close pull requests: about 9 hours
Total issue authors: 21
Total pull request authors: 7
Average comments per issue: 3.32
Average comments per pull request: 0.37
Merged pull requests: 70
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 16
Average time to close issues: 18 days
Average time to close pull requests: 1 day
Issue authors: 7
Pull request authors: 4
Average comments per issue: 2.86
Average comments per pull request: 0.69
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mkerin (4)
gbouras13 (3)
schorlton-bugseq (2)
vinisalazar (2)
MostafaYA (2)
erinyoung (2)
jsgounot (1)
jchorl (1)
samnooij (1)
oschwengers (1)
alexweisberg (1)
KR0manova (1)
MonicaSteffi (1)
rpetit3 (1)
marade (1)

Pull Request Authors

gbouras13 (70)
rrwick (4)
Vini2 (2)
apcamargo (2)
mbhall88 (1)
samnooij (1)
beardymcjohnface (1)

Top Labels

Issue Labels

enhancement (7) question (2) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 230 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 18
Total maintainers: 1

pypi.org: dnaapler

Reorients assembled microbial sequences

Homepage: https://github.com/gbouras13/dnaapler
Documentation: https://dnaapler.readthedocs.io/
License: MIT
Latest release: 1.3.0
published 11 months ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 230 Last month

Rankings

Dependent packages count: 7.3%

Stargazers count: 15.4%

Average: 23.6%

Forks count: 30.5%

Dependent repos count: 41.4%

Maintainers (1)

gbouras13

Last synced: 11 months ago

Dependencies

.github/workflows/ci.yaml actions

actions/checkout v3 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/release.yaml actions

actions/checkout v2 composite
conda-incubator/setup-miniconda v2 composite
pypa/gh-action-pypi-publish master composite

build/environment.yaml pypi

poetry.lock pypi

biopython 1.81
black 23.3.0
click 8.1.3
colorama 0.4.6
coverage 7.2.6
exceptiongroup 1.1.1
flake8 5.0.4
iniconfig 2.0.0
isort 5.12.0
loguru 0.7.0
mccabe 0.7.0
mypy-extensions 1.0.0
numpy 1.24.3
packaging 23.1
pandas 2.0.2
pathspec 0.11.1
platformdirs 3.5.1
pluggy 1.0.0
pycodestyle 2.9.1
pyflakes 2.5.0
pyrodigal 2.1.0
pytest 7.3.1
pytest-cov 4.1.0
python-dateutil 2.8.2
pytz 2023.3
pyyaml 6.0
six 1.16.0
tomli 2.0.1
typing-extensions 4.6.2
tzdata 2023.3
win32-setctime 1.1.0

pyproject.toml pypi

black >=22.3.0 develop
flake8 >=3.0.1 develop
isort >=5.10.1 develop
pytest >=6.2.5 develop
pytest-cov >=3.0.0 develop
biopython >=1.76
click >=8.0.0
loguru >=0.5.3
pandas >=1.4.2
pyrodigal >=2.0.0
python >=3.8,<4.0
pyyaml >=6.0

requirements.txt pypi

Click >=8.0.0
biopython >=1.76
loguru >=0.5.3
pandas >=1.4.1
pyrodigal >=2.0.0
pytest >=6.2.5
pytest-cov >=3.0.0
pytest-runner >=5.0.0
pyyaml >=6.0

Dnaapler

Science Score: 95.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

dnaapler

Quick Start

creates empty conda environment

activates conda environment

installs dnaapler

runs dnaapler all

runs dnaapler all with a gfa file from e.g. Flye, Unicycler or Autocycler

Paper

v1 and other recent changes

1.3.0

1.2.0

1.1.0

v1.0

Google Colab Notebooks

Table of Contents

Description

Documentation

Commands

Installation

Conda

Pip

Usage

Example Usage

to reorient multiple bacterial chromosomes

Databases

Motivation

Contributing

Acknowledgements

Owner

JOSS Publication

Dnaapler: A tool to reorient circular microbial genomes

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: dnaapler

Rankings

Maintainers (1)

Dependencies