cps_extractor
NextFlow pipeline which extracts capsular locus sequences (CPS) and checks for disruptive mutations
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
NextFlow pipeline which extracts capsular locus sequences (CPS) and checks for disruptive mutations
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
CPS Extractor Pipeline <!-- omit in toc -->
This pipeline is in the early stages of development and is not fully tested!
The CPS extractor pipeline is a Nextflow pipeline designed for processing Streptococcus pneumoniae reads to extract the capsular locus sequence (CPS) and check for disruptive mutations
The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Once you have downloaded the necessary docker/singularity images the pipeline can be used offline unless you have changed the selection of any database or container image.
The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).
Table of contents <!-- omit in toc -->
Workflow
The current pipeline workflow is as follows:
The pipeline takes S.pneumoniae reads and uses SeroBA to determine their serotype. It then assembles the reads using Shovill. Following this, a blast search is performed to compare the assembly to a database of reference CPS sequences. Python code is used to extract the CPS sequence with the best blast hit for the given serotype. If any gaps are determined, these are filled in using a consensus sequence method. The CPS sequence is annotated using Bakta and checked for any disruptive mutations. A gene comparison plot for each sample versus the reference is made using clinker. Finally, Panaroo is used to assess gene content difference for individual genes in the CPS sequence. Optionally, if you know the serotype of your samples, serotyping via SeroBA is skipped, a pangenome analysis of all your samples is performed using panaroo and all amino acid sequences for each gene will be concatenated for easy alignment and tree building. Additionally, a gene comparison plot will be generated for all samples using clinker.
Output
Each sample will have its own results folder.
For example the sample 11826_1#37:
11826_1#37
├── 11826_1#37_blast_results.xml
├── 11826_1#37_cps.fa
├── 11826_1#37_cps.gff3
├── 11826_1#37_cps_mutations.csv
├── 11826_1#37_gene_comparison.csv
├── 11826_1#37_plot.html
├── cps_extractor.log
├── seroba_serotype_report.csv
├── proteins
│ ├── 11826_1#37-cpsC_protein.fa
│ ├── 11826_1#37-cpsD_protein.fa
└── snp_dists
├── cps4B_snp_dists.csv
├── cps4D_snp_dists.csv
Each results folder will contain the following:
- Blast result XML file (sample_blast_results.xml)
- The CPS sequence (sample_cps.fa)
- The CPS annotation (sample_cps.gff3)
- Disruptive mutations file (sample_cps_mutations.csv)
- A log file (cps_extractor.log) containing the logs from the CPS extraction
- A gene comparison plot (sample_plot.html) generated by clinker
- A serotype report (seroba_serotype_report.csv)
- A gene comparison file (sample_gene_comparison.csv) showing any differences in gene order between the sample and reference
- A snp_dists folder which contains snp distance CSV files for each gene in the sample and reference CPS annotations
- A proteins folder which contains the amino acid sequence for each gene in the sample
If you run the pipeline using the --serotype argument, the pangenome analysis results will be in the panaroo_pangenome_results folder and there will be a proteins folder containing amino acid sequences per gene for all samples. There will also be a plot.html file generated by clinker in the output folder.
Usage
Requirements
- A POSIX-compatible system (e.g. Linux, macOS, Windows with WSL) with Bash 3.2 or later
- Java 11 or later (up to 21) (OpenJDK/Oracle Java)
- Docker or Singularity/Apptainer
- For Linux, Singularity/Apptainer or Docker Engine is recommended over Docker Desktop for Linux. The latter is known to cause permission issues when running the pipeline on Linux.
- Nextflow >= 23.04
Accepted Inputs
- Only Illumina paired-end short reads are supported
- Each sample is expected to be a pair of raw reads following this file name pattern:
*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}- example 1: SampleNameR1001.fastq.gz, SampleNameR2001.fastq.gz
- example 2: SampleName1.fastq.gz, SampleName2.fastq.gz
- example 3: SampleNameR1.fq, SampleNameR2.fq
Setup
Clone the repository (if Git is installed on your system)
git clone https://github.com/sanger-bentley-group/cps_extractor.gitorDownload and unzip/extract the latest release
Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
cd cps_extractorRun the database setup to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.
⚠️ Docker or Singularity must be running, and an Internet connection is required.
- Using Docker as the container engine
./run_cps_extractor --setup - Using Singularity as the container engine
./run_cps_extractor --setup -profile singularity
- Using Docker as the container engine
Run
⚠️ Docker or Singularity must be running. <!-- --> ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster. - You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e.
inputdirectory inside thecps_extractorlocal directory)./run_cps_extractor- You can also specify the location of the raw reads by adding the--inputoption./run_cps_extractor --input /path/to/raw-reads-directory
Options
|Usage:
|./run_cps_extractor [option] [value]
|
|--input [PATH] Path to the input directory that contains reads to be processed. Default: ./input
|--output [PATH] Path to the output directory that save the results. Default: output
|--serotype [STR] Serotype (if known). Default: None
|--setup Alternative workflow for setting up the required databases.
|--version Alternative workflow for getting versions of pipeline, container images, tools and databases
|--help Print this help message
Profile
- By default, Docker is used as the container engine and all the processes are executed by the local machine. To change this, you could use Nextflow's built-in
-profileoption to switch to other available profiles > ℹ️-profileis a built-in Nextflow option, it only has one leading-nextflow run . -profile [profile name] - Available profiles:
| Profile Name | Details |
| --- | --- |
|
standard
(Default) | Docker is used as the container engine.
Processes are executed locally. | |singularity| Singularity is used as the container engine.
Processes are executed locally. | |lsf| The pipeline should be launched from a LSF cluster head node with this profile.
Singularity is used as the container engine.
Processes are submitted to your LSF cluster viabsubby the pipeline.
(Tested on Wellcome Sanger Institute farm5 LSF cluster only)
Resume
- If the pipeline is interrupted mid-run, Nextflow's built-in
-resumeoption can be used to resume the pipeline execution instead of starting from scratch again - You should use the same command of the original run, only add
-resumeat the end (i.e. all pipeline options should be identical) > ℹ️-resumeis a built-in Nextflow option, it only has one leading-- If the original command is
./run_cps_extractor --input /path/to/raw-reads-directory - The command to resume the pipeline execution should be
./run_cps_extractor --input /path/to/raw-reads-directory -resume
- If the original command is
Clean Up
- During the run of the pipeline, Nextflow generates a considerable amount of intermediate files
- If the run has been completed and you do not intend to use the
-resumeoption or those intermediate files, you can remove the intermediate files using one of the following ways:- Run the included
clean_pipelinescript - It runs the commands in manual removal for you
- It removes the
workdirectory and log files within thecps_extractorlocal directory./clean_pipeline - Manual removal
- Remove the
workdirectory and log files within thecps_extractorlocal directoryrm -rf work rm -rf .nextflow.log* - Run
nextflow cleancommand - This built-in command cleans up cache and work directories
- By default, it only cleans up the latest run
- For details and available options of
nextflow clean, refer to the Nextflow documentation./nextflow clean
- Run the included
Pipeline Options
- The tables below contain the available options that can be used when you run the pipeline
- Usage:
./run_cps_extractor [option] [value]> ℹ️ To permanently change the value of an option, edit thenextflow.configfile inside thecps_extractorlocal directory. <!-- --> > ℹ️$projectDiris a Nextflow built-in implicit variables, it is defined as the local directory ofgps-pipeline. <!-- --> > ℹ️ Pipeline options are not built-in Nextflow options, they are lead with--instead of-
Alternative Workflows
| Option | Values | Description |
| --- | ---| --- |
| --setup | true or false
(Default: false) | Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases.
Can be enabled by including --setup without value. |
| --version | true or false
(Default: false)| Use alternative workflow for showing versions of pipeline, container images, tools and databases.
Can be enabled by including --version without value.
(This workflow pulls the required container images if they are not yet available locally) |
| --help | true or false
(Default: false)| Show help message.
Can be enabled by including --help without value. |
General options
| Option | Values | Description |
| --- | ---| --- |
| --input | Any valid path containing paired end fastq.gz files
(Default: $projectDir/input) | Input folder containing S.pneumoniae reads |
| --output | Any valid path
(Default: $projectDir/output) | Output folder which stores the pipeline results |
| --blastdb | Any valid blast database path .n*
(Default: $projectDir/cps_reference_database/cps_blastdb| Path to blast database containing CPS references |
| --prodigal_training_file | Any valid path containing a prodigal training file
(Default: $projectDir/cps_reference_database/all.trn | Training file for improved annotation |
| --bakta_db | Any valid path containing a bakta database
(Default: $projectDir/cps_reference_database/bakta_db) | Path to bakta database used for annotation |
| --bakta_threads | Any valid integer value
(Default: 4) | Threads used for bakta annotation
| --reference_database | Any valid reference database path
(Default: $projectDir/cps_reference_database) | Full reference database used by the pipeline |
| --serotype | Any valid serotype string
(Default: None) | Manually set the serotype of your input sequences instead of having it determined by SeroBA |
Default database
The default database is stored at: https://github.com/Oliver-Lorenz-dev/cpsreferencedatabase
Credits
See Citations.MD for the full list of citations.
Thanks to Harry Hung for his excellent NextFlow code architecture which this pipeline also uses
Owner
- Name: sanger-bentley-group
- Login: sanger-bentley-group
- Kind: organization
- Repositories: 3
- Profile: https://github.com/sanger-bentley-group
Citation (CITATIONS.md)
Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2 Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352 Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal, 2014(239), 2. Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE, 12(5), e0177459. https://doi.org/10.1371/journal.pone.0177459 SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data Epping L, van Tonder, AJ, Gladstone RA, GPS Consortium, Bentley SD, Page AJ, Keane JA, Microbial Genomics 2018, doi: 10.1099/mgen.0.000186 Seeman T, Shovill: https://github.com/tseemann/shovill
GitHub Events
Total
Last Year
Dependencies
- actions/checkout v4 composite
- actions/setup-python v4 composite
- biopython ==1.81
- numpy ==1.26.1
- pybedtools ==0.9.1
- pysam ==0.22.0
- pytest *
- pytest-cov *
- six ==1.16.0