dense

Identifies genes that have emerged de novo (from non-coding DNA).

https://github.com/i2bc/dense

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Identifies genes that have emerged de novo (from non-coding DNA).

Basic Info
  • Host: GitHub
  • Owner: i2bc
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 28.9 MB
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 0
  • Open Issues: 2
  • Releases: 5
Created about 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

DENSE

Nextflow DOI run with docker run with singularity

Introduction

DENSE is a pipeline that detects genes that have emerged de novo (from non-coding DNA regions), based on phylostratigraphy and synteny.

Figure 1 from Roginski et al from Roginski et al 2024 (https://doi.org/10.1093/gbe/evae159)

DENSE uses a genome of interest (focal) and its phylogenetic neighbors (genomes FASTA and GFF3 annotation files).

It has two main parts :

  • 1 : search for the taxonomically restricted genes (TRGs) among the annotated genes of a the focal genome (A and B),
  • 2 : identifies through a cascade of filters, TRGs with homology traces in the neighbor genomes (C and D)

More precisely, the pipeline includes the following steps :

  • A : extracts the coding sequences of all protein coding genes in the focal genome and search for homologs among the Refseq Non-redundant protein database (NR), and the neighbor genomes.
  • B : based on the previous step, selects genes that are taxonomically restricted.
  • C : selects TRGs with homology in non-coding regions of neighbor genomes. If a phylogenetic tree was provided, DENSE can require from these genomes to be 'outgroup', meaning that they are more distant from the focal genome that any neighbor actually sharing the gene.
  • D : DENSE finally determines whether the homologous non-coding regions are in synteny with their TRGs (the step can be switch off).
    It generates a file containing all the genes that have emerged de novo.

Flexibility : DENSE allows to use different "strategies" (combinations of filters) to detect de novo genes :

  • 1 : any TRG with a outgroup (see below) non-coding hit
  • 2 : any TRG with a non-coding hit
  • 3 : any orphan gene with a non-coding hit

Table of contents

Key concepts

Outgroup

A genome labeled as "outgroup" is a genome where a given gene is absent and which branches in the tree after the last genome where the gene is present.
outgroup illustration
from Roginski et al 2024 (https://doi.org/10.1093/gbe/evae159)

Set-up

1. Nextflow

Before anything, you need to have an recent Nextflow installed.

[!TIP] If you do not have Nextflow yet, you can find simple instructions here : this page.

In order to use the latest Nextflow version, you should use:

bash nextflow self-update

[!IMPORTANT] DENSE requires Nextflow >=23.04.3. A previous version could lead to errors.

To test your Nextflow installation you can use :

bash nextflow run hello

2. Container manager

In order to use DENSE in an fully-ready and reproducible environment, you need to have a container manager installed on your machine.
You can use any of the following :

You can now test DENSE on the example data with the following command :

bash nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY>,test

For example, if you have Docker installed on your machine, your command could be :

bash nextflow run i2bc/dense -profile docker,test

[!NOTE] The very first time you run DENSE, Nextflow will download the repository along with the appropriate container images from DockerHub. It takes about a minute and do not need do be repeated.

[!WARNING] Docker users may encounter the following error :

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?. See 'docker run --help'.

In that case, restart Docker desktop (if appropriate) or follow this fix.

3. Download the NR (not mandatory)

In order to detect taxonomically restricted genes (TRGs), DENSE uses GenEra to search the Refseq Non-redundant protein database (NR).

To download and properly install the NR along with taxonomic data, you can follow these instructions.

[!WARNING] Do not install the nr database in the root directory of your device (i.e. "/nr.dmnd").

[!NOTE] The downloading step can take a couple of hours, but is necessary to assess the absence of homology of your genes candidate to any other known protein coding gene.
You can ignore this step if you want to use you own user-defined TRG list instead (see Usage).

Input files

To run DENSE you always need a directory that contains a genomic FASTA file ('.fna','.fasta') and a GFF3 annotation file ('.gff','.gff3') for each genome (focal and neighbors, e.g. : the mouse and some close rodents). --gendir

GFF3 files must have a classical CDS < mRNA < gene parent relationship between features.

If you want to use DENSE the most complete way, you also need :

  • The phylogenetic tree that shows relations between the genomes (Newick format) --tree
  • A '.tsv' file with two columns : col1 = genome name, col2 = taxid. Must include all genomes (focal and neighbors) --taxids

--taxids

[!TIP] Get the taxid of your species:

  • GFF3 files from NCBI have a header line. ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=<TAXID>
  • Find your organism on the NCBI Taxonomy browser.

Here is a example of a --taxids TSV file :

Droso_melanogaster 7227 Droso_virilis 7244 Droso_simulans 7240 ...

--trgs

Here is a example of a --trgs file :

rna-NM_181684.3 rna-XM_047421177.1 rna-XM_047438033.1 rna-XM_047438039.1 rna-NM_001012416.1 ...

GFF3 files

The GFF3 annotation files must be compliant with the gff3 specifications.
In addition, they must have 'CDS' features with 'mRNA' parents, and these 'mRNA' features must have 'gene' features as 'Parent'.

Usage

Quick start

Full pipeline (steps A,B,C,D) with phylostratigraphy

[!WARNING]
The phylostratigraphy is performed with genEra. It requires several dozens of Gb of RAM (and as many CPUs as possible).

command

nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY> -c yourparams.config

config file

``` params {

gendir    = "/PATH/"           // a directory that contains the genome FASTA and GFF3
focal     = "name_of_the_focal_genome"

tree      = "/PATH/tree.nwk"   // a tree with the same names as the genome files
genera_db = "/PATH/nr.dmnd"    // the diamond database
taxids    = "/PATH/taxids.tsv" // see the Input files section

} ```

Short pipeline (steps C,D) without phylostratigraphy

command

nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY> -c yourparams.config

config file

``` params {

gendir    = "/PATH/"           // a directory that contains the genome FASTA and GFF3
focal     = "name_of_the_focal_genome"

tree      = "/PATH/tree.nwk"   // a tree with the same names as the genome files
trgs      = "/PATH/trgs.txt"   // a single column file with CDS IDs (from GFF3). Their parent genes are assumed to be taxonomically restricted. See the Input files section

} ```

[!NOTE] DENSE runs a tBLASTn of all TRG translated CDS against the neighbor genomes. For certain big genomes, a few queries can have several millions of hits (e.g. repeated elements) which can slow down the analysis.
e.g. with 16 cpus per taks (neighbor), the human (GRCh38.p14) tBLASTn finishes in about 10 days, whereas the mouse genome takes about 30 hours and the yeast (S. cer) only a few minutes.

Lucy example

Lucy has a favorite species. She wants to collect genes from that species with the best indications of de novo emergence.
Therefore, she runs a complete DENSE analysis.
Since her HPC's admin does not like Docker (they all do), she uses Apptainer.

command

nextflow run i2bc/dense -profile apptainer -c Lucy.config

config file

Lucy.config content :

``` params {

gendir    = "../GENOMES/"     // a directory that contains favorite.fna, favorite.gff3, cousin1.fna, cousin1.gff3, cousin2.fna, cousin2.gff3, ...
focal     = "favorite"       // the name of the focal genome

tree      = "family_tree.nwk" // a tree with the same names as the genome files
genera_db = "/PATH/nr.dmnd"    // the diamond database
taxids    = "taxids.tsv"      // see the Input files section

} ```

Luca example

Luca has dozen of annotated strains from its most cherished Yeast. He wants to know if its first strain has genes that seem to have emerged de novo by comparison with the eleven other strains.
He already has a list of orphan genes for this yeast, and so he provides it to DENSE (basically skip step A and B) (trgs).
He does not know the evolutionary relationship between the genomes (no tree and strategy = 2).
He does not care about checking the synteny (synteny = false).
He changed his mind about this options in the middle of a first analysis, so this time he use -resume to reuse pre-computed steps.

command

nextflow run i2bc/dense -profile docker -c Luca.config -resume

config file

Luca.config content :

``` params {

gendir   = "input_file/"              // a directory that contains strain1.fna, strain1.gff3, strain2.fna, strain2.gff3, etc...
focal    = "strain1"                  // the name of the focal genome

trgs     = "list_of_orphan_genes.txt" // see the Input files section
strategy = 2                          // select any TRG with a non-coding homolog region (and no coding homolog) of a neighbor.
synteny  = false                      // turn off synteny checking

} ```

[!TIP] Find out more ways to use options in Nextflow : configs

Options

see PARAMETERS.md

Pipeline output

  • denovogenes.tsv : this is the main output. A two columns TSV file (col1: gene, col2:CDS).
  • TRGmatchmatrix.tsv : synthetically shows the present/absence (homolog) of every TRG coding sequence among the provided genome.
  • TRG_table.tsv : details all homolog names/coordinates
  • directories with some useful precomputed intermediate files :
    • genera_results
    • diamondblast_out
    • orthologs
    • blast_out
    • synteny

Credits

We thank the following people for their extensive assistance in the development of this pipeline:

Ambre Baumann
Simon Herman

Citations

If you use DENSE for your analysis, please cite it using the following doi: https://doi.org/10.1093/gbe/evae159

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Owner

  • Name: i2bc
  • Login: i2bc
  • Kind: organization
  • Location: France

Citation (CITATIONS.md)

# proginski/dense: Citations

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [AWK](https://collaborate.princeton.edu/en/publications/awk-a-pattern-scanning-and-processing-language)

  > Aho, A. V., Kernighan, B. W., & Weinberger, P. J. (1979). Awk — a pattern scanning and processing language. Software: Practice and Experience, 9(4), 267-279.

- [BEDTools](https://academic.oup.com/bioinformatics/article/26/6/841/244688)

  > Aaron R. Quinlan, Ira M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, Volume 26, Issue 6, March 2010, Pages 841–842

- [GffRead](https://f1000research.com/articles/9-304/v2)

  > Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare [version 2; peer review: 3 approved]. F1000Research 2020, 9:304

- [GenEra](https://doi.org/10.1186/s13059-023-02895-z)

  > Barrera-Redondo, J., Lotharukpong, J.S., Drost, H.G., Coelho, S.M. (2023). Uncovering gene-family founder events during major evolutionary transitions in animals, plants and fungi using GenEra. Genome Biology, 24, 54.

- [BLAST](https://www.sciencedirect.com/science/article/abs/pii/S0022283605803602?via%3Dihub)

  > Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Issues event: 4
  • Watch event: 4
  • Issue comment event: 10
  • Push event: 2
  • Create event: 1
Last Year
  • Issues event: 4
  • Watch event: 4
  • Issue comment event: 10
  • Push event: 2
  • Create event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 3 months
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: 3 months
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Elmirareza (1)
  • MargauxAubel (1)
  • Galaxy-228 (1)
Pull Request Authors
  • Galaxy-228 (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan
bin/requirements.txt pypi
  • PyYAML *
  • configargparse *
  • csv *
  • dendropy *
  • numpy *
  • pandas *
  • prettytable *
  • pybedtools *