https://github.com/bge-barcoding/bold-library-curation
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 9 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: bge-barcoding
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 10.1 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 3
- Releases: 0
Fork of FabianDeister/Library_curation_BOLD
Created about 2 years ago
· Last pushed 11 months ago
https://github.com/bge-barcoding/bold-library-curation/blob/main/
# BOLD Library Curation Pipeline  [](https://doi.org/10.5281/zenodo.11267905) [](https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.833.1) [](https://www.gnu.org/licenses/gpl-3.0)## Overview A comprehensive Snakemake pipeline for processing and curating [BOLD Systems](https://boldsystems.org) barcode reference sequence data. This tool implements standardized quality assessment criteria developed by the [Biodiversity Genomics Europe (BGE)](https://biodiversitygenomics.eu) consortium to evaluate and rank DNA barcode sequences for library curation. ### Key Features - **Comprehensive Quality Assessment**: Evaluates specimens against 16 standardized criteria including metadata completeness, voucher information, sequence quality, and phylogenetic analyses - **Advanced Phylogenetic Analysis**: Includes haplotype identification and OTU clustering for genetic diversity assessment - **BAGS Species Assessment**: Automated species-level quality grading system with subspecies inheritance - **Geographic Representation**: Country representative selection for balanced geographic sampling - **Scalable Architecture**: Family-level database splitting for efficient analysis of large datasets - **FAIR Compliance**: Built with reproducibility and provenance tracking using Snakemake workflows ## Background The classification criteria are actively developed by the BGE consortium and documented in this [living document](https://docs.google.com/document/d/18m-7UnoJTG49TbvTsq_VncKMYZbYVbau98LE_q4rQvA/edit). The pipeline evaluates sequences based on multiple quality dimensions to support evidence-based curation decisions. ## Pipeline Workflow [DETAILED WORKFLOW](https://github.com/bge-barcoding/bold-library-curation/blob/main/workflow/README.md) [Example RUN OUTPUT](https://github.com/bge-barcoding/bold-library-curation/blob/main/workflow/example_run_report.md) The pipeline processes BOLD data through six main phases: 1. **Data Preparation**: Optional pre-filtering by taxa, geography, or genetic markers 2. **Database Setup**: SQLite database creation with taxonomic enrichment 3. **Quality Assessment**: Evaluation against 16 standardized criteria plus phylogenetic analyses 4. **BAGS Assessment**: Species-level quality grading with database optimization 5. **Data Integration**: Ranking system combining all assessments with country representative selection 6. **Family Splitting**: Creation of family-level databases for scalable downstream analysis ### Assessment Criteria **Specimen Metadata**: Collection date, collectors, identifier, identification method **Geographic Data**: Country, region, site, sector, coordinates **Repository Info**: Institution, museum ID, public voucher **Sequence Quality**: DNA quality metrics, species ID, type specimen, images **Phylogenetic**: Haplotype identification, OTU clustering ## Requirements - **Operating System**: Linux, macOS, or Windows with WSL - **Package Manager**: Mamba (recommended) or Conda - **Memory**: Minimum 8GB RAM (16GB+ recommended for large datasets) - **Storage**: Sufficient space for BOLD data and family databases - **Dependencies**: VSEARCH (for OTU clustering), SQLite, Perl, Python Data**: Country, region, site, sector, coordinates **Repository Info**: Institution, museum ID, public voucher **Sequence Quality**: DNA quality metrics, species ID, type specimen, images **Phylogenetic**: Haplotype identification, OTU clustering ## Requirements - **Operating System**: Linux, macOS, or Windows with WSL - **Package Manager**: Mamba (recommended) or Conda - **Memory**: Minimum 8GB RAM (16GB+ recommended for large datasets) - **Storage**: Sufficient space for BOLD data and family databases - **Dependencies**: VSEARCH (for OTU clustering), SQLite, Perl, Python ## Installation 1. **Clone the repository:** ```bash git clone https://github.com/bge-barcoding/bold-library-curation.git cd bold-library-curation ``` 2. **Set up the environment:** ```bash mamba env create -f environment.yml mamba activate bold-curation ``` ## Quick Start 1. **Configure your analysis** in `config/config.yml`: ```yaml # Input data BOLD_TSV: "resources/your_bold_data.tsv" # Optional filtering ENABLE_PRESCORING_FILTER: false USE_TARGET_LIST: false # Analysis parameters OTU_CLUSTERING_THRESHOLD: 0.99 FAMILY_SIZE_THRESHOLD: 10000 ``` 2. **Run the pipeline:** ```bash snakemake --cores 4 --use-conda ``` 3. **Check results** in `results/result_output.tsv` and `results/family_databases/` ## Configuration Options ### Core Settings - **BOLD_TSV**: Path to BOLD data dump (BCDM TSV format) - **RESULTS_DIR/LOG_DIR**: Customizable output directories - **TAXONOMY_CHUNK_SIZE**: Memory optimization (default: 10,000) ### Optional Filtering - **Pre-scoring Filter**: Early dataset reduction by taxa, countries, markers, or BIN sharing - **Target Lists**: Focus on specific species of interest - **Geographic Filtering**: Country-based specimen filtering ### Analysis Parameters - **OTU_CLUSTERING_THRESHOLD**: Genetic similarity for clustering (default: 0.99) - **OTU_CLUSTERING_THREADS**: Parallel processing threads (default: 8) - **FAMILY_SIZE_THRESHOLD**: Minimum records for individual family databases (default: 10,000) ## Usage Examples ### Basic Analysis ```bash # Standard pipeline execution snakemake --cores 8 --use-conda ``` ### Large Dataset with Pre-filtering ```bash # Configure pre-filtering in config.yml first # ENABLE_PRESCORING_FILTER: true # FILTER_TAXA: true # FILTER_TAXA_LIST: "resources/target_taxa.txt" snakemake --cores 16 --use-conda --resources mem_mb=32000 ``` ### HPC Cluster Execution (SLURM) ```bash #!/bin/bash #SBATCH --partition=day #SBATCH --mem=32G #SBATCH --cpus-per-task=16 source activate bold-curation snakemake --cores 16 --use-conda ``` ### Target Species Focus ```bash # Enable target list in config.yml # USE_TARGET_LIST: true # TARGET_LIST: "resources/target_species.csv" snakemake --cores 8 --use-conda ``` ## Output Files ### Primary Outputs - **`results/result_output.tsv`**: Final scored and ranked specimens with all assessments - **`results/family_databases/`**: Family-level SQLite databases organized by phylum - **`results/pipeline_summary.txt`**: Comprehensive execution summary ### Assessment Results - **Individual criteria files**: `assessed_*.tsv` for each quality criterion - **BAGS assessment**: Species-level quality grades with BIN sharing analysis - **Phylogenetic analyses**: Haplotype IDs and OTU clustering results - **Database**: Complete SQLite database with specialized tables for complex queries ### Performance Monitoring - **Comprehensive logging**: Step-by-step execution logs in `logs/` directory - **Progress tracking**: Real-time monitoring for long-running operations - **Error handling**: Detailed debugging information for troubleshooting ## Performance Guidelines ### Dataset Size Recommendations - **Small** (< 10K records): 30-60 minutes, 8GB RAM - **Medium** (10K-100K records): 2-6 hours, 16GB RAM - **Large** (100K+ records): 6-24 hours, 32GB+ RAM - **Very large** (1M+ records): 1-3 days, consider pre-filtering ### Optimization Tips 1. **Use pre-scoring filter** for very large datasets 2. **Adjust memory settings** based on available resources 3. **Configure threading** for OTU clustering based on CPU cores 4. **Use SSD storage** for database operations when possible ## Project Structure ``` bold-library-curation/ config/ # Configuration files workflow/ # Snakemake workflow bold-ranker.smk # Main workflow definition scripts/ # Analysis scripts envs/ # Conda environments README.md # Detailed workflow documentation resources/ # Input data and references results/ # Output files and databases logs/ # Execution logs ``` ## Advanced Features ### Phylogenetic Integration - **Haplotype Analysis**: Identifies genetic variants within species/BIN groups - **OTU Clustering**: VSEARCH-based clustering with configurable similarity thresholds - **Multi-threaded Processing**: Parallel execution for computationally intensive steps ### Geographic Representation - **Country Representatives**: Systematic selection of optimal specimens per region - **Balanced Sampling**: Maintains geographic diversity while optimizing quality ### Database Architecture - **Hierarchical Organization**: Family-level databases organized by phylum - **Specialized Tables**: Optimized schema for different data types - **Efficient Querying**: Comprehensive indexing for complex analyses ## Contributing We welcome contributions! Please see our [contributing guidelines](CONTRIBUTING.md) and: 1. Fork the repository 2. Create a feature branch 3. Make changes with appropriate tests 4. Submit a pull request ## License This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details. ## Citation If you use this tool in your research, please cite: ``` [Citation information - to be added when published] ``` **DOI**: [10.5281/zenodo.11267905](https://doi.org/10.5281/zenodo.11267905) ## Support - **Issues**: Report bugs and request features via [GitHub Issues](https://github.com/bge-barcoding/bold-library-curation/issues) - **Documentation**: Detailed workflow documentation in `workflow/README.md` - **BGE Consortium**: Visit [biodiversitygenomics.eu](https://biodiversitygenomics.eu) for project updates ## Acknowledgments This work is supported by the Biodiversity Genomics Europe (BGE) consortium and contributes to the International Barcode of Life (IBOL) initiative. Biodiversity Genomics Europe (Grant no.101059492) is funded by Horizon Europe under the Biodiversity, Circular Economy and Environment call (REA.B.3); co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract numbers 22.00173 and 24.00054; and by the UK Research and Innovation (UKRI) under the Department for Business, Energy and Industrial Strategy's Horizon Europe Guarantee Scheme.
Owner
- Name: BGE barcoding
- Login: bge-barcoding
- Kind: organization
- Website: https://biodiversitygenomics.eu/
- Twitter: BioGenEurope
- Repositories: 1
- Profile: https://github.com/bge-barcoding
Biodiversity Genomics Europe (BGE) - (meta)barcoding software artifacts
GitHub Events
Total
- Create event: 6
- Release event: 1
- Issues event: 17
- Watch event: 1
- Delete event: 2
- Member event: 3
- Issue comment event: 6
- Push event: 64
- Pull request event: 9
Last Year
- Create event: 6
- Release event: 1
- Issues event: 17
- Watch event: 1
- Delete event: 2
- Member event: 3
- Issue comment event: 6
- Push event: 64
- Pull request event: 9
## Overview
A comprehensive Snakemake pipeline for processing and curating [BOLD Systems](https://boldsystems.org) barcode reference sequence data. This tool implements standardized quality assessment criteria developed by the [Biodiversity Genomics Europe (BGE)](https://biodiversitygenomics.eu) consortium to evaluate and rank DNA barcode sequences for library curation.
### Key Features
- **Comprehensive Quality Assessment**: Evaluates specimens against 16 standardized criteria including metadata completeness, voucher information, sequence quality, and phylogenetic analyses
- **Advanced Phylogenetic Analysis**: Includes haplotype identification and OTU clustering for genetic diversity assessment
- **BAGS Species Assessment**: Automated species-level quality grading system with subspecies inheritance
- **Geographic Representation**: Country representative selection for balanced geographic sampling
- **Scalable Architecture**: Family-level database splitting for efficient analysis of large datasets
- **FAIR Compliance**: Built with reproducibility and provenance tracking using Snakemake workflows
## Background
The classification criteria are actively developed by the BGE consortium and documented in this [living document](https://docs.google.com/document/d/18m-7UnoJTG49TbvTsq_VncKMYZbYVbau98LE_q4rQvA/edit). The pipeline evaluates sequences based on multiple quality dimensions to support evidence-based curation decisions.
## Pipeline Workflow
[DETAILED WORKFLOW](https://github.com/bge-barcoding/bold-library-curation/blob/main/workflow/README.md)
[Example RUN OUTPUT](https://github.com/bge-barcoding/bold-library-curation/blob/main/workflow/example_run_report.md)
The pipeline processes BOLD data through six main phases:
1. **Data Preparation**: Optional pre-filtering by taxa, geography, or genetic markers
2. **Database Setup**: SQLite database creation with taxonomic enrichment
3. **Quality Assessment**: Evaluation against 16 standardized criteria plus phylogenetic analyses
4. **BAGS Assessment**: Species-level quality grading with database optimization
5. **Data Integration**: Ranking system combining all assessments with country representative selection
6. **Family Splitting**: Creation of family-level databases for scalable downstream analysis
### Assessment Criteria
**Specimen Metadata**: Collection date, collectors, identifier, identification method
**Geographic Data**: Country, region, site, sector, coordinates
**Repository Info**: Institution, museum ID, public voucher
**Sequence Quality**: DNA quality metrics, species ID, type specimen, images
**Phylogenetic**: Haplotype identification, OTU clustering
## Requirements
- **Operating System**: Linux, macOS, or Windows with WSL
- **Package Manager**: Mamba (recommended) or Conda
- **Memory**: Minimum 8GB RAM (16GB+ recommended for large datasets)
- **Storage**: Sufficient space for BOLD data and family databases
- **Dependencies**: VSEARCH (for OTU clustering), SQLite, Perl, Python
Data**: Country, region, site, sector, coordinates
**Repository Info**: Institution, museum ID, public voucher
**Sequence Quality**: DNA quality metrics, species ID, type specimen, images
**Phylogenetic**: Haplotype identification, OTU clustering
## Requirements
- **Operating System**: Linux, macOS, or Windows with WSL
- **Package Manager**: Mamba (recommended) or Conda
- **Memory**: Minimum 8GB RAM (16GB+ recommended for large datasets)
- **Storage**: Sufficient space for BOLD data and family databases
- **Dependencies**: VSEARCH (for OTU clustering), SQLite, Perl, Python
## Installation
1. **Clone the repository:**
```bash
git clone https://github.com/bge-barcoding/bold-library-curation.git
cd bold-library-curation
```
2. **Set up the environment:**
```bash
mamba env create -f environment.yml
mamba activate bold-curation
```
## Quick Start
1. **Configure your analysis** in `config/config.yml`:
```yaml
# Input data
BOLD_TSV: "resources/your_bold_data.tsv"
# Optional filtering
ENABLE_PRESCORING_FILTER: false
USE_TARGET_LIST: false
# Analysis parameters
OTU_CLUSTERING_THRESHOLD: 0.99
FAMILY_SIZE_THRESHOLD: 10000
```
2. **Run the pipeline:**
```bash
snakemake --cores 4 --use-conda
```
3. **Check results** in `results/result_output.tsv` and `results/family_databases/`
## Configuration Options
### Core Settings
- **BOLD_TSV**: Path to BOLD data dump (BCDM TSV format)
- **RESULTS_DIR/LOG_DIR**: Customizable output directories
- **TAXONOMY_CHUNK_SIZE**: Memory optimization (default: 10,000)
### Optional Filtering
- **Pre-scoring Filter**: Early dataset reduction by taxa, countries, markers, or BIN sharing
- **Target Lists**: Focus on specific species of interest
- **Geographic Filtering**: Country-based specimen filtering
### Analysis Parameters
- **OTU_CLUSTERING_THRESHOLD**: Genetic similarity for clustering (default: 0.99)
- **OTU_CLUSTERING_THREADS**: Parallel processing threads (default: 8)
- **FAMILY_SIZE_THRESHOLD**: Minimum records for individual family databases (default: 10,000)
## Usage Examples
### Basic Analysis
```bash
# Standard pipeline execution
snakemake --cores 8 --use-conda
```
### Large Dataset with Pre-filtering
```bash
# Configure pre-filtering in config.yml first
# ENABLE_PRESCORING_FILTER: true
# FILTER_TAXA: true
# FILTER_TAXA_LIST: "resources/target_taxa.txt"
snakemake --cores 16 --use-conda --resources mem_mb=32000
```
### HPC Cluster Execution (SLURM)
```bash
#!/bin/bash
#SBATCH --partition=day
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16
source activate bold-curation
snakemake --cores 16 --use-conda
```
### Target Species Focus
```bash
# Enable target list in config.yml
# USE_TARGET_LIST: true
# TARGET_LIST: "resources/target_species.csv"
snakemake --cores 8 --use-conda
```
## Output Files
### Primary Outputs
- **`results/result_output.tsv`**: Final scored and ranked specimens with all assessments
- **`results/family_databases/`**: Family-level SQLite databases organized by phylum
- **`results/pipeline_summary.txt`**: Comprehensive execution summary
### Assessment Results
- **Individual criteria files**: `assessed_*.tsv` for each quality criterion
- **BAGS assessment**: Species-level quality grades with BIN sharing analysis
- **Phylogenetic analyses**: Haplotype IDs and OTU clustering results
- **Database**: Complete SQLite database with specialized tables for complex queries
### Performance Monitoring
- **Comprehensive logging**: Step-by-step execution logs in `logs/` directory
- **Progress tracking**: Real-time monitoring for long-running operations
- **Error handling**: Detailed debugging information for troubleshooting
## Performance Guidelines
### Dataset Size Recommendations
- **Small** (< 10K records): 30-60 minutes, 8GB RAM
- **Medium** (10K-100K records): 2-6 hours, 16GB RAM
- **Large** (100K+ records): 6-24 hours, 32GB+ RAM
- **Very large** (1M+ records): 1-3 days, consider pre-filtering
### Optimization Tips
1. **Use pre-scoring filter** for very large datasets
2. **Adjust memory settings** based on available resources
3. **Configure threading** for OTU clustering based on CPU cores
4. **Use SSD storage** for database operations when possible
## Project Structure
```
bold-library-curation/
config/ # Configuration files
workflow/ # Snakemake workflow
bold-ranker.smk # Main workflow definition
scripts/ # Analysis scripts
envs/ # Conda environments
README.md # Detailed workflow documentation
resources/ # Input data and references
results/ # Output files and databases
logs/ # Execution logs
```
## Advanced Features
### Phylogenetic Integration
- **Haplotype Analysis**: Identifies genetic variants within species/BIN groups
- **OTU Clustering**: VSEARCH-based clustering with configurable similarity thresholds
- **Multi-threaded Processing**: Parallel execution for computationally intensive steps
### Geographic Representation
- **Country Representatives**: Systematic selection of optimal specimens per region
- **Balanced Sampling**: Maintains geographic diversity while optimizing quality
### Database Architecture
- **Hierarchical Organization**: Family-level databases organized by phylum
- **Specialized Tables**: Optimized schema for different data types
- **Efficient Querying**: Comprehensive indexing for complex analyses
## Contributing
We welcome contributions! Please see our [contributing guidelines](CONTRIBUTING.md) and:
1. Fork the repository
2. Create a feature branch
3. Make changes with appropriate tests
4. Submit a pull request
## License
This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details.
## Citation
If you use this tool in your research, please cite:
```
[Citation information - to be added when published]
```
**DOI**: [10.5281/zenodo.11267905](https://doi.org/10.5281/zenodo.11267905)
## Support
- **Issues**: Report bugs and request features via [GitHub Issues](https://github.com/bge-barcoding/bold-library-curation/issues)
- **Documentation**: Detailed workflow documentation in `workflow/README.md`
- **BGE Consortium**: Visit [biodiversitygenomics.eu](https://biodiversitygenomics.eu) for project updates
## Acknowledgments
This work is supported by the Biodiversity Genomics Europe (BGE) consortium and contributes to the International Barcode of Life (IBOL) initiative. Biodiversity Genomics Europe (Grant no.101059492) is funded by Horizon Europe under the Biodiversity, Circular Economy and Environment call (REA.B.3); co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract numbers 22.00173 and 24.00054; and by the UK Research and Innovation (UKRI) under the Department for Business, Energy and Industrial Strategy's Horizon Europe Guarantee Scheme.