Corekaburra

Corekaburra: pan-genome post-processing using core gene synteny - Published in JOSS (2022)

https://github.com/milnus/corekaburra

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: plos.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords from Contributors

annotations

Scientific Fields

Sociology Social Sciences - 87% confidence

Biology Life Sciences - 63% confidence

Last synced: 6 months ago · JSON representation

Repository

Program to find core gene consensus synteny, hotspots between these, and construct spatially aware multiple sequence alignments of core genes

Basic Info

Host: GitHub
Owner: milnus
License: mit
Language: Python
Default Branch: main
Size: 2.52 MB

Statistics

Stars: 5
Watchers: 1
Forks: 4
Open Issues: 2
Releases: 5

Created over 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Code of conduct

Overview

Corekaburra looks at the gene synteny across genomes used to build a pan-genome. Using syntenic information Corekaburra identifies regions between core genes. Regions are described in terms of their content of accessory genes and number of nucleotides between core genes. Information from neighboring core genes is further used to identify stretches of core gene clusters that appear in all genomes given as input. Corekaburra is compatible with outputs from 'standard' pan-genome pipelines: Roary and Panaroo, and can be extended to others if desired.

Why and When to use Corekaburra

Corekaburra fits into the existing frameworks of bioinformatics pipelines for pan-genomes as a downstream analysis tool. It does not reinvent a new pan-genome pipeline, but leverages the existing ones. Because of this, Corekaburra is built to be a natural extension to the analysis of pan-genomes by summarising information and inferring relationships in the pan-genome otherwise not easily accessible via pan-genome graphs. Other tools provide similar outputs or information, but in their own stand-alone pan-genome analysis framework or pipeline. Examples of such frameworks/pipelines are PPanGGolin and Panakeia. By building on top of existing tools Corekaburra frees users from potentially cross referencing between pan-genomes, which in itself is a challenging task. Corekaburra's workflow also allows it to be extended to any pan-genome tool, with an output similar to the genepresenceabsence.csv produced by Roary, making Corekaburra versatile for future implementations of pan-genome pipelines.

Installation

Corekaburra is written in Python 3.9, and can be installed via pip and conda. A Docker container is also available.

Conda install

conda install -c bioconda -c conda-forge corekaburra

pip

pip install corekaburra

Docker

Pull from DockerHub or see the Wiki for more information

Help

``` usage: Corekaburra -ig file.gff [file.gff ...] -ip path/to/pan_genome [-cg complete_genomes.txt] [-cc 1.0] [-lc 0.05] [-o path/to/output] [-p OUTPUT_PREFIX] [-c int] [-l | -q] [-h] [-v]

Welcome to Corekaburra! An extension to pan-genome analyses that summarise genomic regions between core genes and segments of neighbouring core genes using gene synteny from a set of input genomes and a pan- genome folder.

Required arguments: -ig file.gff [file.gff ...], --inputgffs file.gff [file.gff ...] Path to gff files used for pan-genome -ip path/to/pangenome, --inputpangenome path/to/pangenome Path to the folder produced by Panaroo or Roary

Analysis modifiers: -cg completegenomes.txt, --completegenomes completegenomes.txt text file containing names of genomes that are to be handled as complete genomes -cc 1.0, --corecutoff 1.0 Percentage of isolates in which a core gene must be present [default: 1.0] -lc 0.05, --low_cutoff 0.05 Percentage of isolates where genes found in less than these are seen as low-frequency genes [default: 0.05]

Output control: -o path/to/output, --output path/to/output Path to where output files will be placed [default: current folder] -p OUTPUTPREFIX, --prefix OUTPUTPREFIX Prefix for output files, if any is desired

Other arguments: -c int, --cpu int Give max number of CPUs [default: 1] -l, --log Record program progress in for debugging purpose -q, --quiet Only print warnings -h, --help Show help function -v, --version show program's version number and exit ```

Example workflow

We have made an example workflow using three Streptococcus pyogenes genomes, Panaroo and Corekaburra.

Inputs

Gff files

Input Gff files must be included in the pan-genome genepresenceabsence.csv-style file.
The Gffs are also required to contain a ##FASTA line, dividing the file into annotations at the top and the genome in the bottom of the file.
All coding sequences (CDS) annotated in the GFF must also carry an ID and a locus_tag.
Input Gff files can be in gzipped format, if desired.

Pan-genome folder

This is the output folder from a Roary or Panaroo run. If designing a genepresenceabsence.csv from another pan-genome tool folder must at minimum contain a genepresenceabsence.csv file with each field quoted as the genepresenceabsence.csv from Roary.

Complete genomes

If some input Gffs are to be processed as complete or closed genomes, a plain text file can be provided with the filename of these.
example: complete_genome.gff complete_genome.gff.gz /paths/are/allowed/complete_genome.gff complete_genome All files given in the plain text file as complete genomes must be found in a given gene presence/absence file, but are not required to be among the input gffs, meaning that a single plain text file of complete genomes can be used for analysing subsets of genomes in the pan-genome.

Adjusting cutoffs

To comply with common practice when handling pan-genomes, the cutoff for when a pan-genome cluster (gene) is perceived as core can be changed using the -cc arguments with a ratio of gene presence required. By default, this is set to a conservative 100% presence of core genes.
A second argument dividing accessory genes into two groups (Low frequency and Intermediate frequency) can be controlled using the -lc argument, with the ratio indicating the maximum presence of a gene cluster to be identified as having a low frequency in the pan-genome. This division of low- and intermediate frequency can be disabled by -lc 0, resulting in all genes being considered as intermediate.

Outputs

Corekaburra outputs multiple files ranging from summaries to more fine grained outputs. This is aimed at giving the user easy access to information, but still allowing for tailored or deep exploration. See description of outputs in wiki and how to query outputs

Core regions

A Core region is defined by two core genes flanking a stretch of the genome in at least one input genome. A core region can be described by a distance between the flanking core genes, positive if nucleotides can be found between them, and negative if the two genes overlap). A region can also be described by the number of encoded accessory genes. Using core gene clusters as a reference for a region of genomes it is possible to compare the same region or their presence across genomes. Additionally, with either or both the distance and number of encoded accessory genes in a region it is possible to identify regions of variability, due to horizontal genetic transfer, deletion or other genomic processes.

core_pair_summary.csv is a file that summarises the core regions identified across the input genomes (Gff files). Here information about occurrence and co-occurence of each core gene pair, and individual core gene occurrences can be found. Distance and accessory gene summary statistics (minimum, maximum, mean, and median) for each core pair is summarised.
This file is a good entry point to the results in most analyses, and should give a good indication of which core regions that could be of interest.

core_core_accessory_gene_content.tsv gives the placement of each accessory gene found in a core region across all genomes (Gff). Frequency of accessory genes (low- or intermediate frequency) is given.

low_frequency_gene_placement.tsv summarises each core region across all genomes (Gff) with the distance between core gene clusters, and the number of accessory genes found in the region.

Core segments

The two following files are only given if any core gene is found to have more than two different core genes as neighbours across all input genomes (Gff), meaning there is structural heterogeneity across the genomes.

The file core_segments.csv contain all segments of minimum two core genes identified in a pan-genome, where the start and end of a segments is defined by core gene clusters with more than two neighbours, meaning they could be a potential breakpoint of a genomic inversion in at least a single input genome (Gff), or be a misassembly.

no_accessory_core_segments.csv divides the segments identified in core_segments.csv into potential smaller segments where core genes must form regions with no accessory genes between them across all genomes. These segments could indicate potential operon structures or other stable genomic features that could be disturbed by insertion of accessory genes.

Core-less contigs

coreless_contig_accessory_gene_content.tsv gives all contigs identified in genomes (Gff) that do not contain a core gene cluster, but only accessory genes. Each contig is given by contig name, its Gff file, and number of low- and intermediate frequency genes found on the contig.

For more info

For more into on Corekaburra, its workings, inputs, outputs and more see the (wiki)[https://github.com/milnus/Corekaburra/wiki]

Bug reporting and feature requests

Please submit bug reports and feature requests to the issue tracker on GitHub: Corekaburra issue tracker

Licence

This program is released as open source software under the terms of MIT License.

Owner

Name: Magnus G. Jespersen
Login: milnus
Kind: user
Location: Melbourne, Australia

Twitter: magnus_gj
Repositories: 3
Profile: https://github.com/milnus

PhD student at The University of Melbourne | Davies Lab | Microbial genomics | Streptococcus Pyogenes | Multi omics analysis

JOSS Publication

Corekaburra: pan-genome post-processing using core gene synteny

Published

November 30, 2022

DOI

10.21105/joss.04910

Volume 7, Issue 79, Page 4910

Authors

Magnus G. Jespersen

Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia

Andrew Hayes

Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia

Mark R. Davies

Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia

Editor

Charlotte Soneson

GitHub Events

Total

Issues event: 1

Last Year

Issues event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 349
Total Committers: 3
Avg Commits per committer: 116.333
Development Distribution Score (DDS): 0.011

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
milnus	4****s	345
Charlotte Soneson	c**n@g**m	3
Andrew Hayes	2****t	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 10
Total pull requests: 14
Average time to close issues: 3 months
Average time to close pull requests: about 13 hours
Total issue authors: 4
Total pull request authors: 3
Average comments per issue: 0.8
Average comments per pull request: 0.0
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

milnus (4)
iferres (3)
BiologicalScientist (2)
asafpr (1)

Pull Request Authors

milnus (12)
BiologicalScientist (1)
csoneson (1)

Top Labels

Issue Labels

enhancement (2) bug (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 18 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

pypi.org: corekaburra

A commandline bioinformatics tool to utilize syntenic information from genomes in the context of pan-genomes

Homepage: https://github.com/milnus/Corekaburra
Documentation: https://corekaburra.readthedocs.io/
License: LICENSE
Latest release: 0.0.5
published over 3 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 18 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 15.3%

Stargazers count: 23.1%

Average: 30.7%

Downloads: 37.6%

Dependent repos count: 67.4%

Maintainers (1)

MagnusGJ

Last synced: 6 months ago

Dependencies

.github/workflows/Build_n_publish.yml actions

actions/checkout main composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish master composite

.github/workflows/Test.yml actions

actions/checkout v2 composite

.github/workflows/code_cov_check.yml actions

actions/checkout v3 composite
codecov/codecov-action v3 composite

.github/workflows/test_dev.yml actions

actions/checkout v2 composite

Dockerfile docker

python 3.9.7-buster build

requirements-dev.txt pypi

biopython ==1.79 development
gffutils >=0.10.1 development
networkx >=2.6.3 development
numpy >=1.23.4 development
pylint * development

setup.py pypi

biopython ==1.79

Corekaburra

Science Score: 93.0%

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Overview

Why and When to use Corekaburra

Installation

Conda install

pip

Docker

Help

Example workflow

Inputs

Gff files

Pan-genome folder

Complete genomes

Adjusting cutoffs

Outputs

Core regions

Core segments

Core-less contigs

For more info

Bug reporting and feature requests

Licence

Owner

JOSS Publication

Corekaburra: pan-genome post-processing using core gene synteny

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: corekaburra

Rankings

Maintainers (1)

Dependencies