https://github.com/clavellab/genome-assembly

A Snakemake workflow assembling bacterial genomes according to the standard operating procedure in the Clavel Lab

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary

Keywords

bacterial-genomes genome-assembly snakemake snakemake-workflows

Last synced: 9 months ago · JSON representation

Repository

A Snakemake workflow assembling bacterial genomes according to the standard operating procedure in the Clavel Lab

Basic Info

Host: GitHub
Owner: ClavelLab
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 114 KB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Topics

bacterial-genomes genome-assembly snakemake snakemake-workflows

Created about 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

Snakemake workflow: genome-assembly-clavel

A Snakemake workflow for assembling bacterial genomes according to the standard operating procedure in the Clavel Lab and assessing their quality in accordance to MIMAG and SeqCode criteria.

Usage

First, the different conda environments can be installed using the following command:

snakemake --use-conda --conda-create-envs-only

Then, the workflow can be used with the default configuration file and 9 threads as follows:

snakemake -c 9 --use-conda

Or using a custom one:

snakemake --configfile config/er2.yaml -c 9 --use-conda

Notes

The default rerun behavior of Snakemake will rerun the workflow for any possible changes in input, code, parameters, modification time or software environment. However, this implies that the workflow downstream of the plasmid extraction will be rerun everytime because of the presence of checkpoints. In a situation where the user knows that some steps do not need to be run again (e.g., plasmid extraction, contigs curation), the flag --rerun-triggers {mtime,params,software-env,code} (instead of the default: --rerun-triggers {mtime,params,input,software-env,code}) can be added at the user's responsibility.
The assembly workflow is done by default only with the paired reads, but it is now possible to use the unpaired reads (quality-filtered, adapters- and phix-removed) when needed (e.g., low quality reverse reads that bring the paired reads number too low). To enable this feature, one must change the default git branch with the following command: git checkout with-unpaired and run the workflow according to the Usage section.

Input/Output

The documentation provides more details on the configuration of the workflow, but in summary:

Input:

A tab-separated table indicating the isolate name and the path to paired-end FASTQ files
A configuration file

Output:

The genome of the isolate (with the plasmids sequences if any) in a gzipped FASTA file format in results/genome/isolate.combined.fa.gz
The plasmids sequences of the isolates in a gzipped FASTA file format in results/plasmids/isolate.plasmids.fa.gz
A comma-separated table (in results) summarising the metadata on the generated genomes with for example:
- paths to the previous files and their md5sums
- genome quality metrics and the software used to calculate them (e.g., completion, contamination, tRNAs, 16S, 23S, 5S)
- boolean flags indicating which genome quality tests passed or failed

The column of the output table are detailed in the next section with examples.

Output table metadata details

| Metadata | Definition | Example | |----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------| | isolate | Identifier of the isolate provided in the sample table | CLAXXH123 | | genome_md5 | Checksum of the FASTA file of the genome | db2cd4eff6d40c463998851f43b15670 | | assembly_qual | Assembly quality, either “High-quality draft” if all quality flags are valid or “Manual review:” with the quality flags that were invalid. | Manual review: arecontigsless100 | | `genomelength| Size of the genome in nucleotide based on the length of all contigs in the assembly | 4839229 | |numbercontig| Number of contigs in the assembly | 134 | |N50| Median length of the contigs in the assembly | 83304 | |numbercontigbelow1kb| Number of small contigs, under 1000 nucleotides-long | 202 | |maxcontiglength| Length of the longest contig | 291966 | |coverage| Ratio of the basepairs count of raw sequences over the basepairs count of the assembly | 394 | |assemblysoftware| Name of the software used for the assembly | SPAdes | |complscore| Percentage of estimated marker-based completion of the assembly | 99.46 | |complsoftware| Name of the software used for assessing the assembly completion | checkm | |contamscore| Percentage of estimated marker-based contamination of the assembly | 0 | |contamsoftware| Name of the software used for assessing the assembly contamination | checkm | |16SSSUrRNAlength| Semicolon list of lengths of detected 16S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 1393;1037;597;528 | |SSUrecoversoftware| Name of the software used for detecting the 16S rRNA sequences | metaxa2 | |23SLSUrRNAlength| Semicolon list of lengths of detected 23S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 1637;795;528 | |LSUrecoversoftware| Name of the software used for detecting the 23S rRNA sequences | metaxa2 | |trnas| Number of unique essential tRNAs detected in the assembly among the following twenty: Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr and Val. | 20 | |trnaextsoftware| Name of the software used for detecting tRNAs sequences | tRNAscan-SE | |5SrRNAlength| Semicolon list of lengths of detected 5S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 110;110;106;105 | |archivefile| Path to the archive of the genome file | results/genome/CLAJMH5.genome.fa.gz | |archivefilemd5| Checksum of the archive of the genome file | db2cd4eff6d40c463998851f43b15670 | |forwardfile` | Path to the forward/R1 raw sequence file | /DATA/hibcfastq/CLAJMH5S16R1001.fastq.gz | | `forwardfilemd5| Checksum of the archive of the forward raw sequence file | 1ff2885fe846135980b481996d8e1da6 | |reversefile| Path to the reverse/R2 raw sequence file | /DATA/hibc_fastq/CLAJMH5_S16_R2_001.fastq.gz | |reversefilemd5| Checksum of the archive of the reverse raw sequence file | 3acd93ba96b97aac4da7756c83f2f990 | |sequencecount| Number of reads in the library (sequencing depth) | 6314677 | |basepairscount| Number of basepairs in nucleotides | 1907032454 | |averagelength` | The average read length estimated as basepairscount divided by sequencecount | 151 | | `sequencecountqual| Number of reads in the library (sequencing depth) after quality filtering | 4837051 | |basepairscountqual| Number of basepairs in nucleotides after quality filtering | 1324608613 | |adaptersfile| Path to the FASTA file containing the adapters sequences | resources/NexteraPE-PE.fa | |plasmidlength| Semicolon list of lengths of detected plasmid sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 7059 | |iscomplgrtr90| Logical flag indicating whether the completion, in compl_score, is greater than 90% | TRUE | |iscontamless5` | Logical flag indicating whether the contamination, contamscore, is greater than 5% | TRUE | | is_coverage_grtr_10 | Logical flag indicating whether the coverage is greater than 10x | TRUE | | are_contigs_less_100 | Logical flag indicating whether the number of contigs, in numbercontig, is below 100 | FALSE | | `isN50grtr25kb| Logical flag indicating whether the N50 is longer than 25 000 nucleotides | TRUE | |ismaxcontiggrtr100kb| Logical flag indicating whether the longest contig, in max_contig_length, is longer than 100 000 nucleotides | TRUE | |istrnasgrtr18| Logical flag indicating whether the number of tRNAs is greater than 18 | TRUE | |isSSUgrtr0| Logical flag indicating whether any 16S/SSU rRNA sequence longer than 0 nucleotide is detected | TRUE | |isLSUgrtr0| Logical flag indicating whether any 23S/LSU rRNA sequence longer than 0 nucleotide is detected | TRUE | |is5Sgrtr0| Logical flag indicating whether any 5S rRNA sequence longer than 0 nucleotide is detected | TRUE | |assemblydate| Date on which the assembly was generated | 2022-10-12 | |workflowversion` | Git tag of the genome assembly workflow used. If not a round version like v5.2, then the commit hash past the round version is added | v5.2-1-g4dae043 |

Changelog

v7.2.3: Fix SPades dependencies of dependencies on Python < 3.12.
v7.2.2: Fix tRNAs and 5S parsing from updated Bakta annotation.
v7.2.1: Fix bugs related to Bakta version bump (force outdir and explicit column names)
v7.2: Update Bakta to 1.9.3 and the associated database should be minimum v5.0
v7.0: Concatenate genome and plasmids sequences (if any). Add plasmids proteins annotations.
v6.1: Fix conda environments setup by updating snakemake wrappers.
v6.0: Checksums for both the genome archive and FASTA file are computed. Improved the table. Assembly with unpaired reads is possible.
v5.3.2: Fixes for conda environments
v5.3: Update bakta (fix protein predictions) and minor updates ofcheckm, metaxa2 and SPades. Better documentation of the column of the output table.
v5.2: Recovered the behavior of length of 0bp when LSU/SSU/plasmids are not detected.
v5.1: Fix issue in plasmid length summary. Remove bakta compliant mode.
v5.0: Better summary with lengths of (multiple) 16S, 23S, 5S and plasmids. Set the minimum contig size to 500bp. Remove MDMcleaner step. Fix incorrect N50 and contig number by QUAST. Note on tRNAs.
v4.2: Handle exceptions from MDMcleaner. Fix plasmid workflow to not trigger unneeded rules.
v4.1: Fix issue with the input function to keep the workflow flexible with the plasmid assembly
v4.0: Fix bugs such as issue with Recycler failing with unconnected plasmid assembly graph or missing gene names in annotations. Add assembly date and workflow version to the summary table. Better documentation regarding the rerun behavior of Snakemake
v3.2: Make sure to use at least the version v1.5 of bakta
v3.1: Reduce disk space used by flagging fast-to-generate output files as temporary (e.g., trimmed sequences, BAM file)
v3.0: Improved README (especially for the configuration in config) and fixed bugs. Failed quality criteria are correctly displayed along the aggregated table, plasmids are produced only when detected and a length of 0 bp is outputted when LSU/SSU are not detected.
v2.0: Extended the assembly quality assessment to evaluate the compliance to the MIMAG and SeqCode criteria and output an aggregated summary table
v1.1: Added a quality control by CheckM
v1.0: Initial conversion of the Standard Operating Procedure used in the lab to a Snakemake workflow

Owner

Name: The Clavel lab
Login: ClavelLab
Kind: organization
Location: Germany

Website: https://www.ukaachen.de/kliniken-institute/institut-fuer-medizinische-mikrobiologie/forschung/ag-clavel/
Twitter: clavellab
Repositories: 1
Profile: https://github.com/ClavelLab

This is the official GitHub account for the research group of Prof. Thomas Clavel.

GitHub Events

Total

Release event: 1
Push event: 1
Fork event: 1
Create event: 2

Last Year

Release event: 1
Push event: 1
Fork event: 1
Create event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/conventional-prs.yml actions

amannn/action-semantic-pull-request v3.4.0 composite

.github/workflows/main.yml actions

actions/checkout v2 composite
github/super-linter v4 composite
snakemake/snakemake-github-action v1.24.0 composite

.github/workflows/release-please.yml actions

GoogleCloudPlatform/release-please-action v2 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/clavellab/genome-assembly

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Snakemake workflow: genome-assembly-clavel

Usage

Notes

Input/Output

Output table metadata details

Changelog

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies