gecco-tool

GEne Cluster prediction with COnditional random fields.

https://github.com/zellerlab/gecco

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Committers with academic emails
    3 of 5 committers (60.0%) from academic institutions
  • Institutional organization owner
    Organization zellerlab has institutional domain (www.embl.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary

Keywords

bioinformatics biosynthetic-gene-clusters genomics metagenomics natural-products python secondary-metabolites
Last synced: 6 months ago · JSON representation

Repository

GEne Cluster prediction with COnditional random fields.

Basic Info
  • Host: GitHub
  • Owner: zellerlab
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage: https://gecco.embl.de
  • Size: 28.8 MB
Statistics
  • Stars: 70
  • Watchers: 4
  • Forks: 8
  • Open Issues: 7
  • Releases: 53
Topics
bioinformatics biosynthetic-gene-clusters genomics metagenomics natural-products python secondary-metabolites
Created about 5 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

Hi, I'm GECCO!

Overview

GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

Actions License Coverage Docs Source Mirror Changelog Issues Preprint PyPI Bioconda Galaxy Versions Wheel

Installing GECCO

GECCO is implemented in Python, and supports all versions from Python 3.7. It requires additional libraries that can be installed directly from PyPI, the Python Package Index.

Use pip to install GECCO on your machine: console $ pip install gecco

If you'd rather use Conda, a package is available in the bioconda channel. You can install with: console $ conda install -c bioconda gecco

This will install GECCO, its dependencies, and the data needed to run predictions. This requires around 40MB of data to be downloaded, so it could take some time depending on your Internet connection. Once done, you will have a gecco command available in your $PATH.

Note that GECCO uses HMMER3, which can only run on PowerPC and recent x86-64 machines running a POSIX operating system. Therefore, GECCO will work on Linux and OSX, but not on Windows.

Running GECCO

Once gecco is installed, you can run it from the terminal by giving it a FASTA or GenBank file with the genomic sequence you want to analyze, as well as an output directory:

console $ gecco run --genome some_genome.fna -o some_output_dir

Additional parameters of interest are:

  • --jobs, which controls the number of threads that will be spawned by GECCO whenever a step can be parallelized. The default, 0, will autodetect the number of CPUs on the machine using os.cpu_count.
  • --cds, controlling the minimum number of consecutive genes a BGC region must have to be detected by GECCO. The default is 3.
  • --threshold, controlling the minimum probability for a gene to be considered part of a BGC region. Using a lower number will increase the number (and possibly length) of predictions, but reduce accuracy. The default of 0.8 was selected to optimize precision/recall on a test set of 364 BGCs from MIBiG 2.0.
  • --cds-feature, which can be supplied a feature name to extract genes if the input file already contains gene annotations instead of predicting genes with Pyrodigal. A common value for records downloaded from GenBank is --cds-feature CDS.

Results

GECCO will create the following files:

  • {genome}.genes.tsv: The genes file, containing the genes extracted or predicted from the input file, and per-gene BGC probabilities predicted by the CRF.
  • {genome}.features.tsv: The features file, containing the identified domains in the input sequences, in tabular format.
  • {genome}.clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters along their putative biosynthetic type, in tabular format.
  • {genome}_cluster_{N}.gbk: If any were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.

GECCO can also convert results to other formats that may be more convenient depending on the downstream usage. GECCO can convert results into:

  • GFF3 format so they can be loaded into a genomic viewer (gecco convert clusters --format gff).
  • GenBank files with antiSMASH-style features so they can be loaded into BiG-SLiCE for further analysis (gecco convert gbk --format bigslice).
  • FASTA files with the sequences of all the predicted BGCs (gecco convert gbk --format fna) or with the sequences of all their proteins (gecco convert gbk --format faa).

To get a more visual way of exploring of the predictions, you can open the GenBank files in a genome editing software like UGENE. You can otherwise load the results into an AntiSMASH report: check the Integrations page of the documentation for a step-by-step guide.

Reference

GECCO can be cited using the following preprint:

Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

Feedback

Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

Acknowledgments

We thank Maarten van Gompel (@proycon), the author of the homonymous Gecco package (Generic Environment for Context-Aware Correction of Orthography), for allowing us to take ownership of the PyPI package name. The Gecco releases (up to 0.3.0) can still be downloaded from the PyPI project page) of the same name.

License

This software is provided under the GNU General Public License v3.0 or later. GECCO is developped by the Zeller Team at the European Molecular Biology Laboratory in Heidelberg.

Owner

  • Name: Zeller Lab
  • Login: zellerlab
  • Kind: organization
  • Location: European Molecular Biology Laboratory, Heidelberg, Germany

Projects Relating to the Zeller Team's Research of Host-Microbiota Interactions

GitHub Events

Total
  • Release event: 1
  • Issues event: 2
  • Watch event: 13
  • Issue comment event: 6
  • Push event: 16
  • Fork event: 1
Last Year
  • Release event: 1
  • Issues event: 2
  • Watch event: 13
  • Issue comment event: 6
  • Push event: 16
  • Fork event: 1

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 1,001
  • Total Committers: 5
  • Avg Commits per committer: 200.2
  • Development Distribution Score (DDS): 0.518
Top Committers
Name Email Commits
Martin Larralde m****e@e****r 482
Martin Larralde m****e@e****e 427
Jonas Simon Fleck j****k@e****e 90
astair j****k@g****m 1
Laura Michelle Carroll c****l@n****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 17
  • Total pull requests: 1
  • Average time to close issues: 17 days
  • Average time to close pull requests: about 20 hours
  • Total issue authors: 14
  • Total pull request authors: 1
  • Average comments per issue: 4.06
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 1.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tamuanand (2)
  • xvazquezc (2)
  • OwenNaicker (2)
  • valentynbez (1)
  • ivandatasci (1)
  • sherinesaber (1)
  • apcamargo (1)
  • igibek (1)
  • Starcommits (1)
  • jolespin (1)
  • danudwary (1)
  • smb20200615 (1)
  • raufs (1)
  • franciscozorrilla (1)
Pull Request Authors
  • lmc297 (2)
Top Labels
Issue Labels
bug (7) question (3) external (2) documentation (2) enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 169 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 29
  • Total maintainers: 2
pypi.org: gecco-tool

Gene cluster prediction with Conditional random fields.

  • Versions: 29
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 169 Last month
Rankings
Stargazers count: 9.8%
Dependent packages count: 10.0%
Forks count: 14.2%
Average: 14.9%
Downloads: 18.8%
Dependent repos count: 21.7%
Maintainers (2)
Last synced: 7 months ago

Dependencies

docs/requirements.txt pypi
  • pygments-style-monokailight *
  • recommonmark *
  • semantic_version *
  • sphinx *
  • sphinx-bootstrap-theme *
.github/workflows/galaxy.yml actions
  • actions/checkout v2 composite
.github/workflows/package.yml actions
  • actions/checkout v2 composite
  • actions/checkout v1 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v2 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
  • pypa/gh-action-pypi-publish master composite
  • rasmus-saks/release-a-changelog-action v1.0.1 composite
  • softprops/action-gh-release v1 composite
.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v1 composite