fair_genome_indexer

Download and index Ensembl sequences and annotations, remove non-canonical chromosimes, remove low TSL, index with multiple tools

https://github.com/tdayris/fair_genome_indexer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.9%) to scientific vocabulary

Keywords

ensembl fair snakemake snakemake-workflow snakemake-wrappers
Last synced: 6 months ago · JSON representation ·

Repository

Download and index Ensembl sequences and annotations, remove non-canonical chromosimes, remove low TSL, index with multiple tools

Basic Info
  • Host: GitHub
  • Owner: tdayris
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.02 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 38
Topics
ensembl fair snakemake snakemake-workflow snakemake-wrappers
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog License Citation

README.md

Snakemake GitHub actions status

Snakemake workflow used to deploy and perform basic indexes of genome sequence.

This is done for teaching purpose as an example of FAIR principles applied with Snakemake.

Usage

The usage of this workflow is described in the Snakemake workflow catalog, it is also available locally on a single page.

Results

The expected results of this pipeline are described here.

Material and methods

The tools used in this pipeline are described here textually.

Step by step

Get DNA sequences

| Step | Commands | | -------------------------------- | ---------------------------------------------------------------------------------------------------------------- | | Download DNA Fasta from Ensembl | ensembl-sequence | | Remove non-canonical chromosomes | pyfaidx | | Index DNA sequence | samtools | | Creatse sequence Dictionary | picard |

┌────────────────────────────────────────┐ │Download Ensembl Sequence (wget + gzip) │ └──────────────────┬─────────────────────┘ │ │ ┌──────────────────▼────────────────────────┐ │Remove non-canonical chromosomes (pyfaidx) │ └──────────────────┬──────────────────────┬─┘ │ │ │ │ ┌──────────────────▼──────────┐ ┌─▼───────────────────────────────────┐ │Index DNA Sequence (samtools)│ │Create sequence dictionary (Picard) │ └─────────────────────────────┘ └─────────────────────────────────────┘

Get genome annotation (GTF)

| Step | Commands | | ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | | Download GTF annotation | ensembl-annotation | | Fix format errors | Agat | | Remove non-canonical chromosomes, based on above DNA Fasta | Agat | | Remove <NA> Transcript support levels | Agat | | Convert GTF to GenePred format | gtf2genepred |

┌─────────────────────────────────────────┐ │Download Ensembl Annotation (wget + gzip)│ └─────────────┬───────────────────────────┘ │ │ ┌─────────────▼─────────┐ │Fix format Error (Agat)│ └─────────────┬─────────┘ │ │ ┌─────────────▼─────────────────────────┐ ┌────────────────────────────────────────┐ │Remove non-canonical chromosomes (Agat)◄───────────┤Fasta sequence index (see Get DNA Fasta)│ └─────────────┬─────────────────────────┘ └────────────────────────────────────────┘ │ │ ┌─────────────▼───────────────────────┐ │Remove <NA> transcript levels (Agat) │ └─────────────┬───────────────────────┘ │ │ ┌─────────────▼────────────────┐ │Convert GTF to GenePred (UCSC)│ └──────────────────────────────┘

Get transcripts sequence

| Step | Commands | | --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | | Extract transcript sequences from above DNA Fasta and GTF | gffread | | Index DNA sequence | samtools | | Creatse sequence Dictionary | picard |

┌───────────────────────────────┐ ┌─────────────────────────────┐ │GTF (see get genome annotation)│ │DNA Fasta (See get dna fasta)│ └────────────────────┬──────────┘ └────────┬────────────────────┘ │ │ │ │ ┌──────▼───────────────────────────▼─────┐ │Extract transcripts sequences (gffread) │ └──────┬───────────────────────────┬─────┘ │ │ │ │ ┌────────────────────▼────┐ ┌────────▼───────────────────────────┐ │Index sequence (samtools)│ │Create sequence dictionary (Picard) │ └─────────────────────────┘ └────────────────────────────────────┘

Get cDNA sequences

| Step | Commands | | ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | | Extract coding transcripts from above GTF | Agat | | Extract coding sequences from above DNA Fasta and GTF | gffread | | Index DNA sequence | samtools | | Creatse sequence Dictionary | picard |

┌───────────────────────────────┐ ┌─────────────────────────────┐ │GTF (see get genome annotation)│ │DNA Fasta (See get dna fasta)│ └────────────────────┬──────────┘ └────────┬────────────────────┘ │ │ │ │ ┌──────▼───────────────────────────▼─────┐ │Extract cDNA        sequences (gffread) │ └──────┬───────────────────────────┬─────┘ │ │ │ │ ┌────────────────────▼────┐ ┌────────▼───────────────────────────┐ │Index sequence (samtools)│ │Create sequence dictionary (Picard) │ └─────────────────────────┘ └────────────────────────────────────┘

Get dbSNP variants

| Step | Commands | | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | Download dbSNP variants | ensembl-variation | | Filter non-canonical chromosomes | pyfaidx + BCFTools | | Index variants | tabix |

``` ┌──────────────────────────────────────────┐
│Download dbSNP variants (wget + bcftools) │
└──────────┬───────────────────────────────┘


┌──────────▼───────────────────────────────────────────┐ │Remove non-canonical chromosomes (bcftools + bedtools)│ └──────────┬───────────────────────────────────────────┘ │

┌──────────▼─────────────┐
│Index variants (tabix) │
└────────────────────────┘

```

Get transcriptid, geneid, and gene_name correspondancy

| Step | Commands | | ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Extract geneid <-> genename correspondancy | pyroe | | Extract transcriptid <-> geneid <-> gene_name | Agat + XSV |

┌────────────────────────────────┐ │Genome annotation (see get GTF) ├──────────────────┐ └──────┬─────────────────────────┘ │ │ │ │ │ ┌──────▼──────────────────────────────┐ ┌────────▼─────────────────────────────────────────────┐ │Extract gene_id <-> gene_name (pyroe)│ │Extract gene_id <-> gene_name <-> transcript_id (Agat)│ └──────┬──────────────────────────────┘ └────────┬─────────────────────────────────────────────┘ │ │ │ │ ┌──────▼─────┐ ┌────────▼────┐ │Format (XSV)│ │Format (XSV) │ └────────────┘ └─────────────┘

Get blacklisted regions

| Step | Commands | | ---------------------------- | -------------------------------------------------------------------------------------------- | | Download blacklisted regions | Github source | | Merge overlapping intervals | bedtools |

``` ┌────────────────────────────────┐
│Download known blacklists (wget)│
└────────────┬───────────────────┘


┌────────────▼──────────────────────────┐ │Merge overlapping intervals (bedtools) │ └───────────────────────────────────────┘

```

GenePred format

| Step | Commands | | --------------- | -------------------------------------------------------------------------------------------------- | | GTF to GenePred | UCSC-tools |

``` ┌────────────────────────────────┐
│Genome annotation (see get GTF) │
└────────────┬───────────────────┘


┌────────────▼──────────────┐ │GTFtoGenePred (UCSC-tools) │ └───────────────────────────┘

```

2bit format

| Step | Commands | | --------------- | -------------------------------------------------------------------------------------------------- | | Fasta to 2bit | UCSC-tools |

``` ┌────────────────────────────────┐
│Genome sequence (see get Fasta) │
└────────────┬───────────────────┘


┌────────────▼──────────────┐ │FaToTwoBit (UCSC-tools) │ └───────────────────────────┘

```

STAR index

| Step | Commands | | --------------- | -------------------------------------------------------------------------------------------------- | | STAR index | STAR |

``` ┌────────────────────────────────┐
│Genome sequence (see get DNA) │
└────────────┬───────────────────┘


┌───────▼────┐ │ STAR index │ └────────────┘

```

Bowtie2 index

| Step | Commands | | --------------- | -------------------------------------------------------------------------------------------------- | | Bowtie2 build | Bowtie2 build |

``` ┌────────────────────────────────┐
│Genome sequence (see get DNA) │
└────────────┬───────────────────┘


┌───────▼────┐ │ STAR index │ └────────────┘

```

Salmon decoy aware gentrome index

| Step | Commands | | --------------- | -------------------------------------------------------------------------------------------------- | | Generate decoy | Bash | | Salmon index | Salmon |

┌─────────────────────────────┐ ┌─────────────────────────────────────┐ │Genome sequence (see get DNA)│ │Transcriptome sequence (see get cDNA)│ └──────────────────────────┬──┘ └─────┬───────────────────────────────┘ │ │ │ │ │ │ ┌────▼─────────────────▼────┐ │Generate decoy and gentrome│ └─────────────┬─────────────┘ │ ┌─────────────────┐ │ ┌───────────────┐ │Gentrome sequence◄────────────────┴─────►Decoy sequences│ └────────────┬────┘ └────┬──────────┘ │ │ │ │ │ ┌──────────────┐ │ └───────► Salmon index ◄─────────┘ └──────────────┘

Owner

  • Name: tdayris
  • Login: tdayris
  • Kind: user
  • Company: Institut Gustave Roussy

Bioinformatician

Citation (CITATION.cff)

authors:
- family-names: Dayris
  given-names: Thibault
  orcid: https://orcid.org/0009-0009-2758-8450
cff-version: 1.2.0
date-released: '2025-06-13'
message: If you use this software, please cite it as below.
title: fair-genome-indexer
url: https://github.com/tdayris/fair_genome_indexer
version: 3.10.0

GitHub Events

Total
  • Release event: 6
  • Push event: 16
  • Create event: 6
Last Year
  • Release event: 6
  • Push event: 16
  • Create event: 6

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 72
  • Total Committers: 2
  • Avg Commits per committer: 36.0
  • Development Distribution Score (DDS): 0.347
Past Year
  • Commits: 72
  • Committers: 2
  • Avg Commits per committer: 36.0
  • Development Distribution Score (DDS): 0.347
Top Committers
Name Email Commits
tdayris t****s@g****r 47
tdayris t****s@o****r 25
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • tdayris (1)
Top Labels
Issue Labels
Pull Request Labels
enhancement (1)