https://github.com/cellgeni/nf-processed2irods
Pipeline to move processed datasets to irods
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Pipeline to move processed datasets to irods
Basic Info
- Host: GitHub
- Owner: cellgeni
- License: mit
- Language: Nextflow
- Default Branch: main
- Size: 27.3 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Created 11 months ago
· Last pushed 10 months ago
https://github.com/cellgeni/nf-processed2irods/blob/main/
# nf-processed2i## Pipeline Workflow
1. **Dataset/Sample Discovery** - Reads dataset or sample information from CSV input file
2. **Public Dataset Detection** - Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns)
3. **Metadata Parsing** - Extracts metadata from public repositories for public datasets
4. **Quality Control** - Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection** - Gathers all data files and metadata files for upload
6. **iRODS Upload** - Transfers files to iRODS with checksums
7. **Metadata Attachment** - Attaches comprehensive metadata to iRODS collections
## Dry Run Mode
The pipeline supports a dry run mode (`--dry_run true`) that allows you to test and validate your pipeline configuration without actually uploading data to iRODS. This is particularly useful for:
### What Dry Run Does:
- **Validates input parameters** - Checks that all required parameters are provided and correctly formatted
- **Processes metadata** - Runs all metadata collection and parsing steps
- **Generates QC reports** - Creates mapping statistics and quality control files
- **Creates output files** - Generates `sample_metadata.csv` and `dataset_metadata.csv` files locally
- **Shows intended operations** - Displays what files would be uploaded and where
### What Dry Run Skips:
- **iRODS file uploads** - No files are actually transferred to iRODS storage
- **iRODS metadata attachment** - No metadata is attached to iRODS collections
- **MD5 checksum validation** - No checksums are computed or compared with iRODS
## Overview
This pipeline processes datasets containing single-cell RNA sequencing results (typically from STARsolo/CellRanger), extracts metadata, performs quality control analysis, and uploads the data with comprehensive metadata to iRODS for long-term storage and data management.
## Contents of Repo:
* `main.nf` the main Nextflow pipeline that orchestrates data processing and iRODS upload
* `nextflow.config` configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parameters
* `modules/local/reprocess10x/parsemetadata/` module for parsing metadata from public repositories (GEO, ENA)
* `modules/local/reprocess10x/mappingqc/` module for extracting mapping quality control statistics
* `modules/local/irods/storefile/` module for uploading files to iRODS
* `modules/local/irods/attachcollectionmeta/` module for attaching metadata to iRODS collections
## Pipeline Workflow
1. **Dataset Discovery**: Reads dataset information from CSV input file
2. **Public Dataset Detection**: Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns)
3. **Metadata Parsing**: Extracts metadata from public repositories for public datasets
4. **Quality Control**: Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection**: Gathers all data files and metadata files for upload
6. **iRODS Upload**: Transfers files to iRODS with checksums
7. **Metadata Attachment**: Attaches comprehensive metadata to iRODS collections
## Examples
### Basic Usage with Datasets
Upload processed datasets to iRODS:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/"
```
### Basic Usage with Individual Samples
Upload individual samples to iRODS:
```bash
nextflow run main.nf \
--samples samples.csv \
--irodspath "/archive/cellgeni/sanger/"
```
### Custom File Filtering
Specify which file types to ignore during upload:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--ignore_pattern ".bam,.fastq.gz"
```
### Disable Public Metadata Collection
Skip automatic metadata retrieval for public datasets:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--collect_public_metadata false
```
### Enable Verbose Output
Get detailed logging information:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--verbose true
```
### Custom Output Directory
Specify a different output directory:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--output_dir "my_results"
```
### Dry Run (Test Mode)
Test the pipeline without actually uploading files to iRODS:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--dry_run true
```
This mode will:
- Validate all input parameters and files
- Process metadata and generate QC reports
- Show what would be uploaded without actual iRODS operations
- Create local output files (metadata CSV files) for review
## Pipeline Parameters
### Required Parameters:
* `--datasets` Path to a CSV file containing dataset information with columns: `id` (dataset identifier) and `path` (local filesystem path to processed data directory)
**OR**
* `--samples` Path to a CSV file containing sample information with columns: `id` (sample identifier), `path` (local filesystem path), and optionally `dataset_id`
* `--irodspath` Base path in iRODS where datasets will be stored (e.g., "/archive/cellgeni/sanger/")
### Optional Parameters:
* `--output_dir` Output directory for pipeline results (`default: "results"`)
* `--publish_mode` File publishing mode (`default: "copy"`)
* `--ignore_pattern` Comma-separated list of file patterns to ignore during upload (`default: ".bam,.bai,.cram,.crai,.fastq.gz,.fq.gz,.fastq,.fq,.mate1.bz2,.mate2.bz2,.sh,.bsub,.pl"`)
* `--collect_public_metadata` Collect metadata from public repositories for public datasets (`default: true`)
* `--parse_mapper_metrics` Parse mapping QC metrics from STARsolo output (`default: true`)
* `--verbose` Enable verbose output (`default: false`)
* `--dry_run` Perform a dry run without uploading files to iRODS (`default: false`)
## Input File Format
The pipeline supports two input modes:
### Option 1: Dataset-based input (`--datasets`)
CSV file with the following structure:
```csv
id,path
GSE123456,/path/to/processed/GSE123456
PRJEB12345,/path/to/processed/PRJEB12345
EGA_DATASET,/path/to/processed/EGA_DATASET
```
Where:
- `id`: Unique dataset identifier (can be GEO accession, ENA project, or custom ID)
- `path`: Absolute path to the directory containing processed single-cell data with sample subdirectories
### Option 2: Sample-based input (`--samples`)
CSV file with the following structure:
```csv
id,path,dataset_id
SAMPLE1,/path/to/processed/SAMPLE1,GSE123456
SAMPLE2,/path/to/processed/SAMPLE2,GSE123456
SAMPLE3,/path/to/processed/SAMPLE3,
```
Where:
- `id`: Unique sample identifier
- `path`: Absolute path to the sample directory containing processed single-cell data
- `dataset_id`: (Optional) Dataset identifier to group samples under. If omitted, samples will be uploaded directly to irodspath/id
## iRODS Path Structure
The pipeline uploads data to different iRODS paths depending on the input type:
- **Datasets (`--datasets`)**: Uploaded to `irodspath/id`
- Example: `/archive/cellgeni/sanger/GSE123456/`
- **Samples with dataset_id (`--samples`)**: Uploaded to `irodspath/dataset_id/id`
- Example: `/archive/cellgeni/sanger/GSE123456/SAMPLE1/`
- **Samples without dataset_id (`--samples`)**: Uploaded to `irodspath/id`
- Example: `/archive/cellgeni/sanger/SAMPLE1/`
## Expected Data Structure
### For Dataset-based Input
Each dataset directory should contain:
- **Sample directories**: Named with sample identifiers containing STARsolo output files
- **QC files**: `*solo_qc.tsv` files (generated automatically if missing)
- **Metadata files**: For public datasets, metadata will be automatically retrieved
Example directory structure:
```
GSE123456/
SAMPLE1/
Aligned.sortedByCoord.out.bam
Log.final.out
Solo.out/
...
SAMPLE2/
...
GSE123456_solo_qc.tsv
```
### For Sample-based Input
Each sample directory should contain STARsolo output files:
```
SAMPLE1/
Aligned.sortedByCoord.out.bam
Log.final.out
Solo.out/
Gene/
GeneFull/
...
...
```
## Metadata Handling
### Public Datasets
For datasets matching patterns `GSE*`, `E-MTAB-*`, or `PRJEB*`, the pipeline automatically:
- Retrieves sample metadata from public repositories
- Generates accession mapping files
- Creates sample relationship files
- Downloads study metadata
### All Datasets
The pipeline extracts and stores:
- **Sample-level metadata**: Species, sequencing type (paired/single), strand information, read counts, whitelist version
- **Dataset-level metadata**: Study accession numbers, dataset identifiers
- **Technical metadata**: File checksums, upload timestamps
## iRODS Structure
Data is organized in iRODS based on the input type:
### For Dataset-based Input (`--datasets`)
```
/archive/cellgeni/sanger/
GSE123456/ # Dataset uploaded to irodspath/id
SAMPLE1/ # Individual samples within dataset
[STARsolo output files]
[metadata attached to collection]
SAMPLE2/
[dataset metadata files]
PRJEB12345/ # Another dataset
...
```
### For Sample-based Input (`--samples`)
**With dataset_id specified:**
```
/archive/cellgeni/sanger/
GSE123456/ # Dataset ID from CSV
SAMPLE1/ # Sample uploaded to irodspath/dataset_id/id
[STARsolo output files]
[metadata attached to collection]
SAMPLE2/
...
```
**Without dataset_id specified:**
```
/archive/cellgeni/sanger/
SAMPLE1/ # Sample uploaded directly to irodspath/id
[STARsolo output files]
[metadata attached to collection]
SAMPLE2/
...
```
## System Requirements
- **Nextflow**: Version 25.04.4 or higher
- **Singularity**: For containerized execution
- **LSF**: For job scheduling on HPC
- **iRODS**: Client tools for data upload
- **Storage**: Sufficient space in `/lustre` and `/nfs` mount points
## Configuration
The pipeline is configured for Sanger's HPC environment with:
- LSF executor with per-job memory limits
- Singularity containers from `/nfs/cellgeni/singularity/images/`
- Bind mounts for `/lustre` and `/nfs` filesystems
- Retry strategy with up to 5 attempts per process
## Monitoring
The pipeline generates comprehensive reports:
- Timeline report: `reports/YYYYMMDD-HH-mm-ss_timeline.html`
- Execution report: `reports/YYYYMMDD-HH-mm-ss_report.html`
- Trace file: `reports/YYYYMMDD-HH-mm-ss_trace.tsv`
## Pipeline Outputs
- **iRODS Collections**: Organized dataset collections with attached metadata
- **QC Reports**: Mapping statistics and quality metrics
- **Metadata Files**: Comprehensive sample and dataset annotations
- **Upload Logs**: Records of transferred files with checksums
## Troubleshooting
### Common Issues:
1. **Missing QC files**: The pipeline will generate them automatically
2. **iRODS connection**: Ensure iRODS client is properly configured
3. **File permissions**: Check read access to input directories
4. **Storage space**: Ensure sufficient space for temporary files
### Log Files:
Check Nextflow work directory (`nf-work/`) for detailed process logs and error messages.
## Version
Current version: 0.0.1
## License
See repository for license information.
GitHub Events
Total
- Push event: 8
Last Year
- Push event: 8