https://github.com/bioinfo-pf-curie/raw-qc
Nextflow pipeline for quality controls and trimming of raw sequencing data
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Nextflow pipeline for quality controls and trimming of raw sequencing data
Basic Info
- Host: GitHub
- Owner: bioinfo-pf-curie
- License: other
- Language: HTML
- Default Branch: master
- Size: 83.5 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 4
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Raw-QC
Institut Curie - Nextflow raw-qc analysis pipeline
Introduction
The main goal of the raw-qc pipeline is to perform quality controls on raw sequencing reads, regardless the sequencing application.
It was designed to help sequencing facilities to validate the quality of the generated data.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
3'/5' adapter Trimming
Several steps of trimming can be performed according to the specified options.
- 3' adapter trimming (with
TrimGalore!orfastp) - 5' adapter trimming with
cutadapt - PolyA tail trimming (with
cutadaptorfastp)
Additional options can be specified to define the type of sequencing and the minimum quality/length thresholds.
In addition, raw-qc also provides a few presets for automatic clipping/trimming:
- --picoV2, add 3/5prime end clipping
- --rnaLig, add 3/5prime end clipping
- --smartSeqV4, remove 3/5prime adapters
See the usage page for details.
PDX model
In the context of Mouse xenograft samples, it is strongly recommanded to distinguish Mouse from Human reads in order to avoid data misalignment.
To do so, raw-qc implements the xengsort tool (--pdx) which generates in output distinct fastq files for both genomes.
These new fastq files can then be used for downstream alignment and analysis.
Pipline summary
- Run quality control of raw sequencing reads (
fastqc) - Trim sequencing adapters (
TrimGalore!/fastp - Run quality control of trimmed sequencing reads (
fastqc) - Run first mapping screen on know references and sources of contamination (
fastq Screen) - Separate host/graft reads for PDX model (
xengsort) - Present all QC results in a final report (
MultiQC)
Quick help
```bash N E X T F L O W ~ version 21.10.6
Launching main.nf [distracted_curie] - revision: dc75952132
| ___ \ | _ / __ \
| |/ /_ ___ _____| | | | / \/
| // _ \ \ /\ / /| | | | |
| |\ \ (| |\ V V / \ \/ / _/\
_| __,_| _/_/ _/_\__/
v3.0.0
Usage: The typical command for running the pipeline is as follows:
nextflow run main.nf --samplePlan PATH -profile STRING OPTIONS
MANDATORY ARGUMENTS:
--reads PATH Path to input data (must be surrounded with quotes)
--samplePlan PATH Path to sample plan (csv format) with raw reads (if --reads is not specified)
INPUTS: --pdx Deconvolute host/graft reads for PDX samples --singleEnd For single-end input data
TRIMMING: --adapter STRING [auto, truseg, nextera, smallrna, *] Type of 3prime adapter to trim --adapter5 STRING Specified cutadapt options for 5prime adapter trimming --minLen INTEGER Minimum length of trimmed sequences --nTrim Trim poly-N sequence at the end of the reads --qualTrim INTEGER Minimum mapping quality for trimming --twoColour Trimming for NextSeq/NovaSeq sequencers --trimTool STRING [trimgalore, fastp] Tool for 3prime adapter trimming and auto-detection
PRESET: --picoV2 Preset of clipping parameters for picoV2 protocol --polyA Preset for polyA tail trimming --rnaLig Preset for RNA ligation protocol --smartSeqV4 Preset for smartSeqV4 RNA-seq protocol
REFERENCES: --genomeAnnotationPath PATH Path to genome annotations folder
SKIP OPTIONS: --skipFastqcRaw Disable FastQC --skipFastqScreen Disable FastqScreen --skipFastqcTrim Disable FastQC --skipMultiqc Disable MultiQC --skipTrimming Disable Trimming
OTHER OPTIONS: --metadata PATH Specify a custom metadata file for MultiQC --multiqcConfig PATH Specify a custom config file for MultiQC --name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic --outDir PATH The output directory where the results will be saved
=======================================================
Available Profiles
-profile test Run the test dataset
-profile conda Build a new conda environment before running the pipeline. Use --condaCacheDir to define the conda cache path
-profile multiconda Build a new conda environment per process before running the pipeline. Use --condaCacheDir to define the conda cache path
-profile path Use the installation path defined for all tools. Use --globalPath to define the insallation path
-profile multipath Use the installation paths defined for each tool. Use --globalPath to define the insallation path
-profile docker Use the Docker images for each process
-profile singularity Use the Singularity images for each process. Use --singularityImagePath to define the insallation path
-profile cluster Run the workflow on the cluster, instead of locally
```
Quick run
The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow
Run the pipeline on the test dataset
See the conf/test.conf to set your test dataset.
``` nextflow run main.nf -profile conda,test --genomeAnnotationPaths 'ANNOTATION_FOLDER'
```
Run the pipeline from a sample plan
``` nextflow run main.nf --samplePlan MYSAMPLEPLAN --outDir MYOUTPUTDIR -profile conda --genomeAnnotationPaths 'ANNOTATION_FOLDER'
```
Run the pipeline on a cluster
``` echo "nextflow run main.nf --reads '*.R{1,2}.fastq.gz' --outDir MYOUTPUTDIR -profile singularity,cluster" | qsub -N rawqc
```
Defining the '-profile'
By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable.
In addition, we set up a few profiles that should allow you i/ to use containers instead of local installation, ii/ to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).
Here are a few examples of how to set the profile option. See the full documentation for details.
```
Run the pipeline locally, using the paths defined in the configuration for each tool (see conf/path.config)
-profile path --globalPath INSTALLATION_PATH
Run the pipeline on the cluster, using the Singularity containers
-profile cluster,singularity --singularityImagePath SINGULARITYIMAGEPATH
Run the pipeline on the cluster, building a new conda environment
-profile cluster,conda --condaCacheDir CONDA_CACHE ```
Sample Plan
A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.
SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,[PATHTOR2FASTQ]
Full Documentation
- Installation
- Reference genomes
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
Credits
This pipeline has been set up and written by the sequencing and the bioinformatics core facilities of the Institut Curie (T. Alaeitabar, D. Desvillechabrol, F. Martin, S. Baulande, N. Servant).
Citation
If you use this pipeline for your project, please cite it using the following doi: 10.5281/zenodo.7515639.
Do not hesitate to use the Zenodo doi corresponding to the version you used !
Contacts
For any question, bug or suggestion, please, contact the bioinformatics core facility.
Owner
- Name: Institut Curie, Bioinformatics Core Facility
- Login: bioinfo-pf-curie
- Kind: organization
- Location: Paris, France
- Website: https://bioinfo-pf-curie.github.io/
- Repositories: 11
- Profile: https://github.com/bioinfo-pf-curie
bioinformatics platform of the Institut Curie
GitHub Events
Total
- Release event: 1
- Push event: 1
- Create event: 2
Last Year
- Release event: 1
- Push event: 1
- Create event: 2
Dependencies
- biopython ==1.79
- xengsort-cubic ==1.1.0