https://github.com/cellgeni/nf-processed2irods

Pipeline to move processed datasets to irods

https://github.com/cellgeni/nf-processed2irods

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Pipeline to move processed datasets to irods

Basic Info
  • Host: GitHub
  • Owner: cellgeni
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 27.3 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 10 months ago

https://github.com/cellgeni/nf-processed2irods/blob/main/

# nf-processed2i## Pipeline Workflow

1. **Dataset/Sample Discovery** - Reads dataset or sample information from CSV input file
2. **Public Dataset Detection** - Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns) 
3. **Metadata Parsing** - Extracts metadata from public repositories for public datasets
4. **Quality Control** - Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection** - Gathers all data files and metadata files for upload
6. **iRODS Upload** - Transfers files to iRODS with checksums
7. **Metadata Attachment** - Attaches comprehensive metadata to iRODS collections

## Dry Run Mode

The pipeline supports a dry run mode (`--dry_run true`) that allows you to test and validate your pipeline configuration without actually uploading data to iRODS. This is particularly useful for:

### What Dry Run Does:
- **Validates input parameters** - Checks that all required parameters are provided and correctly formatted
- **Processes metadata** - Runs all metadata collection and parsing steps
- **Generates QC reports** - Creates mapping statistics and quality control files
- **Creates output files** - Generates `sample_metadata.csv` and `dataset_metadata.csv` files locally
- **Shows intended operations** - Displays what files would be uploaded and where

### What Dry Run Skips:
- **iRODS file uploads** - No files are actually transferred to iRODS storage
- **iRODS metadata attachment** - No metadata is attached to iRODS collections
- **MD5 checksum validation** - No checksums are computed or compared with iRODS

## Overview

This pipeline processes datasets containing single-cell RNA sequencing results (typically from STARsolo/CellRanger), extracts metadata, performs quality control analysis, and uploads the data with comprehensive metadata to iRODS for long-term storage and data management.

## Contents of Repo:
* `main.nf`  the main Nextflow pipeline that orchestrates data processing and iRODS upload
* `nextflow.config`  configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parameters
* `modules/local/reprocess10x/parsemetadata/`  module for parsing metadata from public repositories (GEO, ENA)
* `modules/local/reprocess10x/mappingqc/`  module for extracting mapping quality control statistics
* `modules/local/irods/storefile/`  module for uploading files to iRODS
* `modules/local/irods/attachcollectionmeta/`  module for attaching metadata to iRODS collections

## Pipeline Workflow

1. **Dataset Discovery**: Reads dataset information from CSV input file
2. **Public Dataset Detection**: Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns) 
3. **Metadata Parsing**: Extracts metadata from public repositories for public datasets
4. **Quality Control**: Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection**: Gathers all data files and metadata files for upload
6. **iRODS Upload**: Transfers files to iRODS with checksums
7. **Metadata Attachment**: Attaches comprehensive metadata to iRODS collections

## Examples

### Basic Usage with Datasets
Upload processed datasets to iRODS:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/"
```

### Basic Usage with Individual Samples
Upload individual samples to iRODS:
```bash
nextflow run main.nf \
    --samples samples.csv \
    --irodspath "/archive/cellgeni/sanger/"
```

### Custom File Filtering
Specify which file types to ignore during upload:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/" \
    --ignore_pattern ".bam,.fastq.gz"
```

### Disable Public Metadata Collection
Skip automatic metadata retrieval for public datasets:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/" \
    --collect_public_metadata false
```

### Enable Verbose Output
Get detailed logging information:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/" \
    --verbose true
```

### Custom Output Directory
Specify a different output directory:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/" \
    --output_dir "my_results"
```

### Dry Run (Test Mode)
Test the pipeline without actually uploading files to iRODS:
```bash
nextflow run main.nf \
    --datasets datasets.csv \
    --irodspath "/archive/cellgeni/sanger/" \
    --dry_run true
```

This mode will:
- Validate all input parameters and files
- Process metadata and generate QC reports
- Show what would be uploaded without actual iRODS operations
- Create local output files (metadata CSV files) for review

## Pipeline Parameters

### Required Parameters:
* `--datasets`  Path to a CSV file containing dataset information with columns: `id` (dataset identifier) and `path` (local filesystem path to processed data directory)
  **OR**
* `--samples`  Path to a CSV file containing sample information with columns: `id` (sample identifier), `path` (local filesystem path), and optionally `dataset_id`
* `--irodspath`  Base path in iRODS where datasets will be stored (e.g., "/archive/cellgeni/sanger/")

### Optional Parameters:
* `--output_dir`  Output directory for pipeline results (`default: "results"`)
* `--publish_mode`  File publishing mode (`default: "copy"`)
* `--ignore_pattern`  Comma-separated list of file patterns to ignore during upload (`default: ".bam,.bai,.cram,.crai,.fastq.gz,.fq.gz,.fastq,.fq,.mate1.bz2,.mate2.bz2,.sh,.bsub,.pl"`)
* `--collect_public_metadata`  Collect metadata from public repositories for public datasets (`default: true`)
* `--parse_mapper_metrics`  Parse mapping QC metrics from STARsolo output (`default: true`)
* `--verbose`  Enable verbose output (`default: false`)
* `--dry_run`  Perform a dry run without uploading files to iRODS (`default: false`)

## Input File Format

The pipeline supports two input modes:

### Option 1: Dataset-based input (`--datasets`)
CSV file with the following structure:

```csv
id,path
GSE123456,/path/to/processed/GSE123456
PRJEB12345,/path/to/processed/PRJEB12345
EGA_DATASET,/path/to/processed/EGA_DATASET
```

Where:
- `id`: Unique dataset identifier (can be GEO accession, ENA project, or custom ID)
- `path`: Absolute path to the directory containing processed single-cell data with sample subdirectories

### Option 2: Sample-based input (`--samples`)
CSV file with the following structure:

```csv
id,path,dataset_id
SAMPLE1,/path/to/processed/SAMPLE1,GSE123456
SAMPLE2,/path/to/processed/SAMPLE2,GSE123456
SAMPLE3,/path/to/processed/SAMPLE3,
```

Where:
- `id`: Unique sample identifier 
- `path`: Absolute path to the sample directory containing processed single-cell data
- `dataset_id`: (Optional) Dataset identifier to group samples under. If omitted, samples will be uploaded directly to irodspath/id

## iRODS Path Structure

The pipeline uploads data to different iRODS paths depending on the input type:

- **Datasets (`--datasets`)**: Uploaded to `irodspath/id`
  - Example: `/archive/cellgeni/sanger/GSE123456/`
  
- **Samples with dataset_id (`--samples`)**: Uploaded to `irodspath/dataset_id/id`
  - Example: `/archive/cellgeni/sanger/GSE123456/SAMPLE1/`
  
- **Samples without dataset_id (`--samples`)**: Uploaded to `irodspath/id`
  - Example: `/archive/cellgeni/sanger/SAMPLE1/`

## Expected Data Structure

### For Dataset-based Input
Each dataset directory should contain:
- **Sample directories**: Named with sample identifiers containing STARsolo output files
- **QC files**: `*solo_qc.tsv` files (generated automatically if missing)
- **Metadata files**: For public datasets, metadata will be automatically retrieved

Example directory structure:
```
GSE123456/
 SAMPLE1/
    Aligned.sortedByCoord.out.bam
    Log.final.out
    Solo.out/
    ...
 SAMPLE2/
    ...
 GSE123456_solo_qc.tsv
```

### For Sample-based Input
Each sample directory should contain STARsolo output files:
```
SAMPLE1/
 Aligned.sortedByCoord.out.bam
 Log.final.out
 Solo.out/
    Gene/
    GeneFull/
    ...
 ...
```

## Metadata Handling

### Public Datasets
For datasets matching patterns `GSE*`, `E-MTAB-*`, or `PRJEB*`, the pipeline automatically:
- Retrieves sample metadata from public repositories
- Generates accession mapping files
- Creates sample relationship files
- Downloads study metadata

### All Datasets
The pipeline extracts and stores:
- **Sample-level metadata**: Species, sequencing type (paired/single), strand information, read counts, whitelist version
- **Dataset-level metadata**: Study accession numbers, dataset identifiers
- **Technical metadata**: File checksums, upload timestamps

## iRODS Structure

Data is organized in iRODS based on the input type:

### For Dataset-based Input (`--datasets`)
```
/archive/cellgeni/sanger/
 GSE123456/                    # Dataset uploaded to irodspath/id
    SAMPLE1/                  # Individual samples within dataset
       [STARsolo output files]
       [metadata attached to collection]
    SAMPLE2/
    [dataset metadata files]
 PRJEB12345/                   # Another dataset
     ...
```

### For Sample-based Input (`--samples`)
**With dataset_id specified:**
```
/archive/cellgeni/sanger/
 GSE123456/                    # Dataset ID from CSV
    SAMPLE1/                  # Sample uploaded to irodspath/dataset_id/id
       [STARsolo output files]
       [metadata attached to collection]
    SAMPLE2/
 ...
```

**Without dataset_id specified:**
```
/archive/cellgeni/sanger/
 SAMPLE1/                      # Sample uploaded directly to irodspath/id
    [STARsolo output files]
    [metadata attached to collection]
 SAMPLE2/
 ...
```

## System Requirements

- **Nextflow**: Version 25.04.4 or higher
- **Singularity**: For containerized execution
- **LSF**: For job scheduling on HPC
- **iRODS**: Client tools for data upload
- **Storage**: Sufficient space in `/lustre` and `/nfs` mount points

## Configuration

The pipeline is configured for Sanger's HPC environment with:
- LSF executor with per-job memory limits
- Singularity containers from `/nfs/cellgeni/singularity/images/`
- Bind mounts for `/lustre` and `/nfs` filesystems
- Retry strategy with up to 5 attempts per process

## Monitoring

The pipeline generates comprehensive reports:
- Timeline report: `reports/YYYYMMDD-HH-mm-ss_timeline.html`
- Execution report: `reports/YYYYMMDD-HH-mm-ss_report.html`
- Trace file: `reports/YYYYMMDD-HH-mm-ss_trace.tsv`

## Pipeline Outputs

- **iRODS Collections**: Organized dataset collections with attached metadata
- **QC Reports**: Mapping statistics and quality metrics
- **Metadata Files**: Comprehensive sample and dataset annotations
- **Upload Logs**: Records of transferred files with checksums

## Troubleshooting

### Common Issues:
1. **Missing QC files**: The pipeline will generate them automatically
2. **iRODS connection**: Ensure iRODS client is properly configured
3. **File permissions**: Check read access to input directories
4. **Storage space**: Ensure sufficient space for temporary files

### Log Files:
Check Nextflow work directory (`nf-work/`) for detailed process logs and error messages.

## Version

Current version: 0.0.1

## License

See repository for license information.

GitHub Events

Total
  • Push event: 8
Last Year
  • Push event: 8