scepigenome

Processing of single-cell epigenomic data

https://github.com/bioinfo-pf-curie/scepigenome

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Processing of single-cell epigenomic data

Basic Info
  • Host: GitHub
  • Owner: bioinfo-pf-curie
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 83 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

scEpigenome

Institut Curie - single-cell Epigenomics analysis pipeline

Nextflow MultiQC Install with conda Singularity Container available Docker Container available DOI

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with conda / singularity containers making installation easier and results highly reproducible.

The goal of this pipeline is to process multiple type of single-cell epigenomics profiles, including scCut&Tag (10X, inDrop, plate), scATACseq in plate and scChIP-seq (inDrop).

Pipline summary

This pipeline can be run on any single-cell epigenomics data, ie. scChIPseq, scCUT&Tag and scATACseq, generated with various protocols including 10X barcoding, indrop microfluidics protocols or plate systems.

The pipeline goes from raw reads (fastq, paired end) to genomic count matrices as follow:

  1. Align barcode read parts on barcode index libraries
  2. Align genomic read parts on the genome
  3. Assignation of cell barcodes to aligned read
  4. Removal of duplicates (PCR & extra duplicates)
  5. Removal of black regions (repeated regions, low mappability regions)
  6. Counting (Generation of count matrix) by TSS (transcription start sites) as an approximation of genes
  7. Optional : Counting (Generation of count matrix) in bins
  8. Generation of fragment file
  9. Generation of coverage file (bigwig) (CPM normalization)
  10. Optional: Peak Calling (pseudo-bulk)
  11. Reporting

Quick help

```bash nextflow run main.nf --help N E X T F L O W ~ version 22.10.6

Launching main.nf [sad_magritte] DSL2 - revision: 59d670a8d1

Usage:

The typical command for running the pipeline is as follows:

nextflow run main.nf --reads PATH --samplePlan PATH --genome STRING --protocol STRING

MANDATORY ARGUMENTS: --genome STRING Name of the reference genome. --protocol STRING [scchipindrop, sccuttagindrop, sccuttag10X, scepigenomeplate] Specify which protocol to run --reads PATH Path to input data (must be surrounded with quotes) --samplePlan PATH Path to sample plan (csv format) with raw reads (if --reads is not specified)

REFERENCES: --genomeAnnotationPath PATH Path to genome annotations folder --effGenomeSize INTEGER Effective genome size --fasta PATH Path to genome fasta file --geneBed PATH Path to gene file (BED) --genomeAnnotationPath PATH Path to genome annotations folder --gtf PATH Path to GTF annotation file. Used in HOMER peak annotation --starIndex PATH Indexes for STAR aligner --bwaIndex PATH Indexes for Bwa-mem aligner

INPUTS: --sampleDescription PATH Path to sample description (csv format) with biological names of each cell --batchSize INTEGER Number of cells to merge together to work in batch (only for plate protocols)

ALIGNMENT: --aligner STRING Aligner to use ('star' or 'bwa-mem2')

BARCODES: --mapqBarcode INTEGER Mapping quality for the barcode alignment (40) --barcodeTag STRING Barcode tag ('XB')

FILTERING: --blackList PATH Path to black list regions (.bed). See the genome.config for details --mapq INTEGER Minimum mapping quality after reads alignment (20) --rmSingleton Remove singleton --extraDup Remove extra duplicates (RT and window) --keepRTdup Keep RT duplicates (if --extraDup is specified) --keepDups Keep all duplicated reads --keepBlackList Keep reads in blacklist regions --distDup INTEGER Genomic distance to consider a read as a window duplicate (for scChIP only)

MATRICES --minReadsPerCellmatrix INTEGER Cells having less than this number are removed from final matrices --binSize INTEGER [50000] Size of bins to create matrices

PEAK CALLING --peakCalling Run bulk peak calling analysis --macs2Opts STRING MACS2 parameters --peakDist INTEGER Maximum distance between peaks to be merged --tssWindow INTEGER Distance (upstream/downstream) to transcription start point to consider

SKIP OPTIONS: --skipBigWig Disable BigWig --skipMultiQC Disable MultiQC --skipSoftVersions Disable Soft Versions

OTHER OPTIONS: --metadata PATH Specify a custom metadata file for MultiQC --multiqcConfig PATH Specify a custom config file for MultiQC --name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic

OUTPUTs: --outDir PATH The output directory where the results will be saved --saveIntermediates Save intermediates files --cleanup STRING [none, auto, success] Cleaning strategy of the work/ directory

====================================================== Available Profiles -profile test Run the test dataset -profile conda Build a new conda environment before running the pipeline. Use --condaCacheDir to define the conda cache path -profile multiconda Build a new conda environment per process before running the pipeline. Use --condaCacheDir to define the conda cache path -profile path Use the installation path defined for all tools. Use --globalPath to define the insallation path -profile multipath Use the installation paths defined for each tool. Use --globalPath to define the insallation path -profile docker Use the Docker images for each process -profile singularity Use the Singularity images for each process. Use --singularityPath to define the insallation path -profile cluster Run the workflow on the cluster, instead of locally ```

Quick run

The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow:

Run the pipeline on a test dataset

See the conf/test.conf to set your test dataset.

nextflow run main.nf -profile test,conda --protocol 'scchip_indrop'

Run the pipeline from a sample plan

nextflow run main.nf --samplePlan MY_SAMPLE_PLAN --genome 'hg38' --outDir MY_OUTPUT_DIR -profile conda

Run the pipeline on a computational cluster

echo "nextflow run main.nf --reads '*.R{1,2}.fastq.gz' --genome 'hg19' --outDir MY_OUTPUT_DIR -profile singularity,cluster" | qsub -N scchip

Defining the '-profile'

By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable.

In addition, we set up a few profiles that should allow you i/ to use containers instead of local installation, ii/ to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).

Here are a few examples of how to set the profile option. See the full documentation for details.

```

Run the pipeline locally, using the paths defined in the configuration for each tool (see conf/path.config)

-profile path --globalPath INSTALLATION_PATH

Run the pipeline on the cluster, using the Singularity containers

-profile cluster,singularity --singularityImagePath SINGULARITYIMAGEPATH

Run the pipeline on the cluster, building a new conda environment

-profile cluster,conda --condaCacheDir CONDA_CACHE ```

Sample Plan

A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,[PATHTOR2FASTQ]

The sample plan can vary a bit according to the protocol.

scchip_indrop

Paired-end reads with R1 and R2 for each sample. The barcode information is expected to be on the R2 reads.

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,PATHTOR2FASTQ

The barcode is then extracted from the R2 reads, and the remaining bases are aligned on the genome.

sccuttag_indrop

Three fastq files are expected for one sample. The barcode information is expected to be on the R2 reads. Only the R1/R3 reads are aligned on the genome.

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,PATHTOR2FASTQ,PATHTOR3_FASTQ

sccuttag_10X

Three fastq files are expected for one sample. The barcode information is expected to be on the R2 reads. Only the R1/R3 reads are aligned on theThree fastq files are expected for one sample.

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,PATHTOR2FASTQ,PATHTOR3_FASTQ

sccuttag_plate

Usually, for this protocol, we end up with a pair of fastq files per cell, after demultiplexing.
The idea here is to give in the sample plan, the path to the folder which contains all the per-cell fastq files per sample.

SAMPLEID,SAMPLENAME,DIRECTORYTOR1R2FASTQ

Of note, the per cell fastq files will be merged (--batchSize) and processed as batch of cells.
Of note, sequencing files must match the R[1/2] pattern to be detected

Optionaly, you can provide a --sampleDescription file with cell's id which will be used in the output file.
This file has to contain the id from the --samplePlan with the cell names, separated by a "|".

L500C01_batch1|cell1 L500C02_batch1|cell2 ...

Full Documentation

  1. Installation
  2. Reference genomes
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Credits

This pipeline has been written by the single-cell custom and bioinformatics facilities of the Institut Curie (L. Hadj-Abed, P. Prompsy, C. Vallot, N. Servant)

Citation

If you use this pipeline for your project, please cite it using the following doi:
Do not hesitate to use the Zenodo doi corresponding to the version you used !

Contacts

For any question, bug or suggestion, please use the issues system or contact the bioinformatics core facility.

Owner

  • Name: Institut Curie, Bioinformatics Core Facility
  • Login: bioinfo-pf-curie
  • Kind: organization
  • Location: Paris, France

bioinformatics platform of the Institut Curie

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Hadj-Abed
  given-names: Louisa
- family-names: Vallot
  given-names: Celine
- family-names: Servant
  given-names: Nicolas
orcid: https://orcid.org/0000-0000-0000-0000
title: scEpigenome
version: v2.2.0
date-released: 2025-03-06

GitHub Events

Total
  • Release event: 3
  • Delete event: 2
  • Push event: 3
  • Create event: 9
Last Year
  • Release event: 3
  • Delete event: 2
  • Push event: 3
  • Create event: 9

Dependencies

environment.yml pypi