https://github.com/bge-barcoding/fasta_compare
A Python script for analysing DNA barcode sequences from multiple FASTA files, evaluating sequence quality based on multiple metrics, and selecting the best sequences according to configurable BOLD BIN quality criteria.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
A Python script for analysing DNA barcode sequences from multiple FASTA files, evaluating sequence quality based on multiple metrics, and selecting the best sequences according to configurable BOLD BIN quality criteria.
Basic Info
- Host: GitHub
- Owner: bge-barcoding
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 26.4 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
fasta_compare
A Python script for analysing DNA barcode sequences from multiple FASTA files, evaluating sequence quality based on multiple metrics, and selecting the best sequences according to configurable BOLD BIN quality criteria.
Overview
This tool can process multiple FASTA files containing DNA sequences for barcode markers (cox1/COI, rbcl, matk), perform quality assessment, and output filtered sequences to separate FASTA files along with detailed analysis reports. It's specifically designed for barcode sequence analysis and BOLD (Barcode of Life Data) BIN compliance.
Features
- Quality assessment: Evaluate sequences based on gaps, ambiguous bases, and continuous stretches (without gaps or ambiguous bases).
- Flexible ranking systems: 'Standard' and 'relaxed' barcode quality criteria.
- Target-specific processing: Handles and extracts different barcode sequences using known barcoding regions.
- Comprehensive output and reporting: Produces detailed CSV output with the following metrics:
| Column | Description |
|--------|-------------|
| file | Source FASTA file path |
| processid | Extracted process identifier (e.g. BSNHM001-24) |
| parameters | Extracted BGEE run parameters (e.g. r1s100) |
| seqid | Full sequence identifier/found from fasta header (e.g. BSNHM001-24r1s100BSNHM001-24) |
| length | Full sequence length (including gaps ('-') and ambiguous bases ('N') |
| leadinggaps | Count of leading gap ('-') characters |
| trailinggaps | Count of trailing gap ('-') characters |
| internalgaps | Count of internal gap ('-') characters |
| ambiguousbases | Count of 'N' characters |
| longeststretch | Longest continuous sequence without gaps or ambiguous bases |
| barcodelength | Length of barcode region (fixed by --target) |
| barcodeambiguousbases | Count of N in barcode region |
| barcodelongeststretch | Longest continuous barcode sequence without gaps or ambiguous bases |
| barcoderank | Barcode quality rank (1-6) |
| fullrank | Full sequence quality rank (1-3) |
| bestsequence | Whether this is the best sequence for the processid |
| selectedfullfasta | Whether this sequence was output to full FASTA |
| selectedbarcodefasta | Whether this sequence was output to barcode FASTA |
Installation
Requirements/Dependencies
```bash pip install biopython
+ Standard Python libraries (csv, re, os, argparse, logging, datetime)
```
Usage
Basic Usage
bash
python fasta_compare.py \
--output-csv results.csv \
--output-fasta best_sequences.fasta \
--output-barcode best_barcodes.fasta \
--input file1.fasta file2.fasta \
--target cox1
Advanced Usage
```bash
With custom rank threshold and relaxed criteria
python fastacompare.py \ --output-csv results.csv \ --output-fasta bestsequences.fasta \ --output-barcode best_barcodes.fasta \ --input *.fasta \ --target cox1 \ --rank 2 \ --relaxed \ --verbose ```
Command Line Arguments
Required Arguments
| Argument | Short | Description |
|----------|-------|-------------|
| --output-csv | -o | Path to output CSV analysis file |
| --output-fasta | -of | Path to output FASTA file for best full sequences |
| --output-barcode | -ob | Path to output FASTA file for best barcode regions |
| --input | -i | Input FASTA files to analyze (space-separated) |
| --target | -t | Target genetic marker (cox1/COI/CO1, rbcl/RBCL, or matk/MATK) |
Optional Arguments
| Argument | Description | Default |
|----------|-------------|---------|
| --rank | Maximum acceptable barcode rank for selection (see below for rank explanation) | 3 |
| --relaxed | Use relaxed barcode ranking criteria (see below for explanation) | False |
| --log-file | Custom path for log file | Auto-generated timestamp |
| --verbose -v | Enable detailed debug logging | False |
Barcode Regions by Target
| Target | Barcode Region | Description | |--------|----------------|-------------| | cox1/COI | 40-700 | Cytochrome c oxidase subunit I | | rbcl | 1-700 | RuBisCO large subunit | | matk | 1-900 | Maturase K | * ITS(2) region to be added in the near future
Ranking Systems
Standard Barcode Ranking (1-6, lower is better)
| Rank | Criteria | |------|----------| | 1 | No ambiguous bases, longest stretch ≥ 650 | | 2 | No ambiguous bases, longest stretch ≥ 500 | | 3 | No ambiguous bases, 300 ≤ longest stretch ≤ 499 | | 4 | No ambiguous bases, 1 ≤ longest stretch ≤ 299 | | 5 | Has ambiguous bases | | 6 | Other cases |
Relaxed Barcode Ranking (1-6, lower is better)
| Rank | Criteria | Notes | |------|----------|-------| | 1 | No ambiguous bases, longest stretch ≥ 500 | BIN compliant | | 2 | No ambiguous bases, 200 ≤ longest stretch ≤ 499 | Good but no BIN | | 3 | No ambiguous bases, 100 ≤ longest stretch ≤ 199 | Not great, better than nothing | | 4 | <6 ambiguous bases, longest stretch ≥ 500 | | | 5 | <6 ambiguous bases, longest stretch ≥ 300 | | | 6 | Other cases | |
Full Sequence Ranking (1-3, lower is better)
| Rank | Criteria | |------|----------| | 1 | No ambiguous bases | | 2 | Has ambiguous bases | | 3 | Other cases |
Processing Logic
1. Sequence Analysis Loop
For each FASTA file:
For each sequence record:
├── Parse sequence ID and extract process_id/parameters
├── Calculate basic metrics:
│ ├── Length, leading/trailing gaps, internal gaps
│ ├── Ambiguous bases (N count)
│ └── Longest continuous stretch without gaps
├── Extract barcode region based on target marker
├── Calculate barcode-specific metrics:
│ ├── Barcode length, ambiguous bases
│ └── Longest barcode stretch without gaps
├── Determine rankings:
│ ├── Barcode rank (standard or relaxed)
│ └── Full sequence rank
└── Store all metrics and sequence record
2. Best Sequence Selection
For each unique processid:
```
Selection criteria (in order of priority):
├── Barcode rank (lower is better)
├── Full sequence rank (lower is better)
├── Barcode longest stretch (higher is better)
├── Full sequence longest stretch (higher is better)
├── Internal gaps (lower is better)
├── Ambiguous bases (lower is better)
└── Unique identifier (for deterministic selection)
Result = One "best" sequence per processid
```
3. Output Sequence Selection
The script uses a two-tier selection process:
For each process_id:
├── First attempt: Find sequences where:
│ ├── full_rank == 1 (no ambiguous bases)
│ └── barcode_rank <= rank_threshold
├── If found: Use for BOTH full sequence and barcode output
├── Second attempt: Find sequences where:
│ ├── full_rank == 2 (has ambiguous bases)
│ └── barcode_rank <= rank_threshold
├── If found: Use ONLY for barcode output (no full sequence)
└── If none found: No output for this process_id
4. Sequence Formatting
Full Sequences
Original (full) sequence
├── Trim leading and trailing gaps ('-', '~')
├── Remove internal '~' characters (stitch sequence)
├── Replace internal '-' characters with 'N'
└── Trim leading and trailing 'N' characters
Barcode Sequences
Extract barcode region
├── Remove internal '~' characters (stitch sequence)
├── For gaps marked with '-':
│ ├── If ≤ 6 bases: fill with 'N'
│ └── If > 6 bases: keep longest fragment
└── Trim leading and trailing 'N' characters
5. Output Generation
Generate three output files:
├── CSV report: All sequences with complete metrics
├── Full sequences FASTA: Best full sequences (one per process_id)
└── Barcode FASTA: Best barcode regions (one per process_id)
Valid FASTA header formats
The script expects sequence IDs to be in a particular format to correctly parse processid and parameters, e.g.: ``` For BOLD123r1s100BOLD123: - processid = BOLD123 - parameters = r1s100_BOLD123
For SAMPLEArtrial1: - processid = SAMPLEA - parameters = rtrial1
For complexnamehererNsNotherstuff: - processid = complexnamehere - parameters = rNsNotherstuff
For ConsensusBOLD123rparam1param2:
- processid = BOLD123
- parameters = `rparam1_param2
``
Gap Character Handling
| Character | Type | Processing |
|-----------|------|------------|
| - | Standard gap | Converted to N (if ≤6) or causes fragmentation (if >6) |
| ~ | Sequencing gap | Removed, sequences stitched together |
| N | Ambiguous base | Counted in quality metrics, trimmed from ends |
Performance Considerations
- Memory usage: Sequences are held in memory during processing
- Progress reporting: Logs progress every 100 sequences for large files
- File handling: Processes files sequentially, not in parallel
- Output formatting: FASTA sequences written without line breaks for consistency
Troubleshooting
Common Issues
- "Input file does not exist": Check file paths and permissions
- "Invalid target marker": Use cox1/COI/CO1, rbcl/RBCL, or matk/MATK
- "Invalid rank value": Rank must be between 1-6
- Empty output files: No sequences met the quality criteria; try
--relaxedor a higher--rank
Authors
Created by Ben Price & Dan Parsons @ Natural History Museum UK (NHMUK)
Contributing
- Please suggest any improvements in 'Issues' - contributions welcome!
Owner
- Name: BGE barcoding
- Login: bge-barcoding
- Kind: organization
- Website: https://biodiversitygenomics.eu/
- Twitter: BioGenEurope
- Repositories: 1
- Profile: https://github.com/bge-barcoding
Biodiversity Genomics Europe (BGE) - (meta)barcoding software artifacts
GitHub Events
Total
- Push event: 3
- Create event: 2
Last Year
- Push event: 3
- Create event: 2