Corekaburra

Corekaburra: pan-genome post-processing using core gene synteny - Published in JOSS (2022)

https://github.com/milnus/corekaburra

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: plos.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords from Contributors

annotations

Scientific Fields

Sociology Social Sciences - 87% confidence
Biology Life Sciences - 63% confidence
Last synced: 4 months ago · JSON representation

Repository

Program to find core gene consensus synteny, hotspots between these, and construct spatially aware multiple sequence alignments of core genes

Basic Info
  • Host: GitHub
  • Owner: milnus
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.52 MB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 4
  • Open Issues: 2
  • Releases: 5
Created about 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Test codecov

Overview

Corekaburra looks at the gene synteny across genomes used to build a pan-genome. Using syntenic information Corekaburra identifies regions between core genes. Regions are described in terms of their content of accessory genes and number of nucleotides between core genes. Information from neighboring core genes is further used to identify stretches of core gene clusters that appear in all genomes given as input. Corekaburra is compatible with outputs from 'standard' pan-genome pipelines: Roary and Panaroo, and can be extended to others if desired.

Why and When to use Corekaburra

Corekaburra fits into the existing frameworks of bioinformatics pipelines for pan-genomes as a downstream analysis tool. It does not reinvent a new pan-genome pipeline, but leverages the existing ones. Because of this, Corekaburra is built to be a natural extension to the analysis of pan-genomes by summarising information and inferring relationships in the pan-genome otherwise not easily accessible via pan-genome graphs. Other tools provide similar outputs or information, but in their own stand-alone pan-genome analysis framework or pipeline. Examples of such frameworks/pipelines are PPanGGolin and Panakeia. By building on top of existing tools Corekaburra frees users from potentially cross referencing between pan-genomes, which in itself is a challenging task. Corekaburra's workflow also allows it to be extended to any pan-genome tool, with an output similar to the genepresenceabsence.csv produced by Roary, making Corekaburra versatile for future implementations of pan-genome pipelines.

Installation

Corekaburra is written in Python 3.9, and can be installed via pip and conda. A Docker container is also available.

Conda install

conda install -c bioconda -c conda-forge corekaburra

pip

pip install corekaburra

Docker

Pull from DockerHub or see the Wiki for more information

Help

``` usage: Corekaburra -ig file.gff [file.gff ...] -ip path/to/pan_genome [-cg complete_genomes.txt] [-cc 1.0] [-lc 0.05] [-o path/to/output] [-p OUTPUT_PREFIX] [-c int] [-l | -q] [-h] [-v]

Welcome to Corekaburra! An extension to pan-genome analyses that summarise genomic regions between core genes and segments of neighbouring core genes using gene synteny from a set of input genomes and a pan- genome folder.

Required arguments: -ig file.gff [file.gff ...], --inputgffs file.gff [file.gff ...] Path to gff files used for pan-genome -ip path/to/pangenome, --inputpangenome path/to/pangenome Path to the folder produced by Panaroo or Roary

Analysis modifiers: -cg completegenomes.txt, --completegenomes completegenomes.txt text file containing names of genomes that are to be handled as complete genomes -cc 1.0, --corecutoff 1.0 Percentage of isolates in which a core gene must be present [default: 1.0] -lc 0.05, --low_cutoff 0.05 Percentage of isolates where genes found in less than these are seen as low-frequency genes [default: 0.05]

Output control: -o path/to/output, --output path/to/output Path to where output files will be placed [default: current folder] -p OUTPUTPREFIX, --prefix OUTPUTPREFIX Prefix for output files, if any is desired

Other arguments: -c int, --cpu int Give max number of CPUs [default: 1] -l, --log Record program progress in for debugging purpose -q, --quiet Only print warnings -h, --help Show help function -v, --version show program's version number and exit ```

Example workflow

We have made an example workflow using three Streptococcus pyogenes genomes, Panaroo and Corekaburra.

Inputs

Gff files

Input Gff files must be included in the pan-genome genepresenceabsence.csv-style file.
The Gffs are also required to contain a ##FASTA line, dividing the file into annotations at the top and the genome in the bottom of the file.
All coding sequences (CDS) annotated in the GFF must also carry an ID and a locus_tag.
Input Gff files can be in gzipped format, if desired.

Pan-genome folder

This is the output folder from a Roary or Panaroo run. If designing a genepresenceabsence.csv from another pan-genome tool folder must at minimum contain a genepresenceabsence.csv file with each field quoted as the genepresenceabsence.csv from Roary.

Complete genomes

If some input Gffs are to be processed as complete or closed genomes, a plain text file can be provided with the filename of these.
example: complete_genome.gff complete_genome.gff.gz /paths/are/allowed/complete_genome.gff complete_genome All files given in the plain text file as complete genomes must be found in a given gene presence/absence file, but are not required to be among the input gffs, meaning that a single plain text file of complete genomes can be used for analysing subsets of genomes in the pan-genome.

Adjusting cutoffs

To comply with common practice when handling pan-genomes, the cutoff for when a pan-genome cluster (gene) is perceived as core can be changed using the -cc arguments with a ratio of gene presence required. By default, this is set to a conservative 100% presence of core genes.
A second argument dividing accessory genes into two groups (Low frequency and Intermediate frequency) can be controlled using the -lc argument, with the ratio indicating the maximum presence of a gene cluster to be identified as having a low frequency in the pan-genome. This division of low- and intermediate frequency can be disabled by -lc 0, resulting in all genes being considered as intermediate.

Outputs

Corekaburra outputs multiple files ranging from summaries to more fine grained outputs. This is aimed at giving the user easy access to information, but still allowing for tailored or deep exploration. See description of outputs in wiki and how to query outputs

Core regions

A Core region is defined by two core genes flanking a stretch of the genome in at least one input genome. A core region can be described by a distance between the flanking core genes, positive if nucleotides can be found between them, and negative if the two genes overlap). A region can also be described by the number of encoded accessory genes. Using core gene clusters as a reference for a region of genomes it is possible to compare the same region or their presence across genomes. Additionally, with either or both the distance and number of encoded accessory genes in a region it is possible to identify regions of variability, due to horizontal genetic transfer, deletion or other genomic processes.

core_pair_summary.csv is a file that summarises the core regions identified across the input genomes (Gff files). Here information about occurrence and co-occurence of each core gene pair, and individual core gene occurrences can be found. Distance and accessory gene summary statistics (minimum, maximum, mean, and median) for each core pair is summarised.
This file is a good entry point to the results in most analyses, and should give a good indication of which core regions that could be of interest.

core_core_accessory_gene_content.tsv gives the placement of each accessory gene found in a core region across all genomes (Gff). Frequency of accessory genes (low- or intermediate frequency) is given.

low_frequency_gene_placement.tsv summarises each core region across all genomes (Gff) with the distance between core gene clusters, and the number of accessory genes found in the region.

Core segments

The two following files are only given if any core gene is found to have more than two different core genes as neighbours across all input genomes (Gff), meaning there is structural heterogeneity across the genomes.

The file core_segments.csv contain all segments of minimum two core genes identified in a pan-genome, where the start and end of a segments is defined by core gene clusters with more than two neighbours, meaning they could be a potential breakpoint of a genomic inversion in at least a single input genome (Gff), or be a misassembly.

no_accessory_core_segments.csv divides the segments identified in core_segments.csv into potential smaller segments where core genes must form regions with no accessory genes between them across all genomes. These segments could indicate potential operon structures or other stable genomic features that could be disturbed by insertion of accessory genes.

Core-less contigs

coreless_contig_accessory_gene_content.tsv gives all contigs identified in genomes (Gff) that do not contain a core gene cluster, but only accessory genes. Each contig is given by contig name, its Gff file, and number of low- and intermediate frequency genes found on the contig.

For more info

For more into on Corekaburra, its workings, inputs, outputs and more see the (wiki)[https://github.com/milnus/Corekaburra/wiki]

Bug reporting and feature requests

Please submit bug reports and feature requests to the issue tracker on GitHub: Corekaburra issue tracker

Licence

This program is released as open source software under the terms of MIT License.

Owner

  • Name: Magnus G. Jespersen
  • Login: milnus
  • Kind: user
  • Location: Melbourne, Australia

PhD student at The University of Melbourne | Davies Lab | Microbial genomics | Streptococcus Pyogenes | Multi omics analysis

JOSS Publication

Corekaburra: pan-genome post-processing using core gene synteny
Published
November 30, 2022
Volume 7, Issue 79, Page 4910
Authors
Magnus G. Jespersen ORCID
Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
Andrew Hayes ORCID
Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
Mark R. Davies ORCID
Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
Editor
Charlotte Soneson ORCID
Tags
microbiology genomics genome pan-genome

GitHub Events

Total
  • Issues event: 1
Last Year
  • Issues event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 349
  • Total Committers: 3
  • Avg Commits per committer: 116.333
  • Development Distribution Score (DDS): 0.011
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
milnus 4****s 345
Charlotte Soneson c****n@g****m 3
Andrew Hayes 2****t 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 10
  • Total pull requests: 14
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 13 hours
  • Total issue authors: 4
  • Total pull request authors: 3
  • Average comments per issue: 0.8
  • Average comments per pull request: 0.0
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • milnus (4)
  • iferres (3)
  • BiologicalScientist (2)
  • asafpr (1)
Pull Request Authors
  • milnus (12)
  • BiologicalScientist (1)
  • csoneson (1)
Top Labels
Issue Labels
enhancement (2) bug (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 5
  • Total maintainers: 1
pypi.org: corekaburra

A commandline bioinformatics tool to utilize syntenic information from genomes in the context of pan-genomes

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 18 Last month
Rankings
Dependent packages count: 10.1%
Forks count: 15.3%
Stargazers count: 23.1%
Average: 30.7%
Downloads: 37.6%
Dependent repos count: 67.4%
Maintainers (1)
Last synced: 4 months ago

Dependencies

.github/workflows/Build_n_publish.yml actions
  • actions/checkout main composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish master composite
.github/workflows/Test.yml actions
  • actions/checkout v2 composite
.github/workflows/code_cov_check.yml actions
  • actions/checkout v3 composite
  • codecov/codecov-action v3 composite
.github/workflows/test_dev.yml actions
  • actions/checkout v2 composite
Dockerfile docker
  • python 3.9.7-buster build
requirements-dev.txt pypi
  • biopython ==1.79 development
  • gffutils >=0.10.1 development
  • networkx >=2.6.3 development
  • numpy >=1.23.4 development
  • pylint * development
setup.py pypi
  • biopython ==1.79