https://github.com/bge-barcoding/fasta-cleaner
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: bge-barcoding
- License: mit
- Language: Python
- Default Branch: main
- Size: 102 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
FASTA Cleaner
A Python tool for cleaning and analyzing FASTA sequence alignments (e.g. from https://github.com/cmayer/MitoGeneExtractor) using multiple filtering approaches. This tool is designed to help researchers identify and remove problematic sequences from their alignments while maintaining data quality and integrity.
Features
The FASTA Sequence Cleaner implements a sophisticated filtering pipeline that includes:
Human COX1 Contamination Detection
- Identifies sequences with high similarity to human COX1
- Configurable similarity threshold
- Uses efficient local alignment for comparison
- Helps prevent contamination from human DNA
AT Content Analysis
- Compares AT content between sequences and consensus
- Identifies sequences with divergent nucleotide composition
- Supports multiple filtering modes (absolute, higher, lower)
- Customizable difference threshold
- Only considers overlapping regions for comparison
Statistical Outlier Detection
- Uses both weighted and unweighted deviation scores
- Position-specific residue frequency analysis
- Conservation-weighted sequence comparison
- Adjustable percentile threshold for outlier detection
- Robust handling of gaps and missing data
Reference Sequence Comparison
- Optional comparison against known reference sequences
- Supports multiple reference sequence files
- Additional metrics for reference-based filtering
- Weighted deviation scoring based on conservation
Installation
Prerequisites
- Python 3.6 or higher
- BioPython
- NumPy
- typing (for type hints)
bash
pip install biopython numpy typing
Installation Steps
Clone this repository:
bash git clone https://github.com/bge-barcoding/fasta-cleaner.git cd fasta-cleanerInstall dependencies:
bash pip install biopython numpy typing
Usage
Basic Usage
bash
python fasta_cleaner_combined.py -i input_dir -o output_dir
Advanced Usage
bash
python fasta_cleaner_combined.py \
-i input_dir \
-o output_dir \
-r reference_dir \
--human_threshold 0.95 \
--at_difference 0.1 \
--at_mode absolute \
--percentile_threshold 90.0 \
--consensus_threshold 0.5
Command Line Arguments
| Argument | Description | Default |
|----------|-------------|---------|
| -i, --input_dir | Directory containing input FASTA files | Required |
| -o, --output_dir | Output directory for processed files | Required |
| -r, --reference_dir | Directory containing reference sequences | Optional |
| -u, --human_threshold | Human COX1 similarity threshold (0-1) | 0.95 |
| -d, --at_difference | Maximum allowed AT content difference | 0.1 |
| -m, --at_mode | AT content filtering mode (absolute/higher/lower) | absolute |
| -p, --percentile_threshold | Percentile for outlier detection (0-100) | 90.0 |
| -c, --consensus_threshold | Consensus sequence generation threshold | 0.5 |
AT Content Filtering Modes
The tool supports three modes for AT content filtering:
absolute: Removes sequences if AT content differs from consensus by more than threshold in either directionhigher: Removes only sequences with AT content above consensus + threshold (i.e. AT is too high)lower: Removes only sequences with AT content below consensus - threshold (i.e. AT is too low)
Filter Control Flags
| Flag | Description |
|------|-------------|
| --disable_human | Disable human COX1 similarity filtering |
| --disable_at | Disable AT content difference filtering |
| --disable_outliers | Disable statistical outlier detection |
Output Files
For each input FASTA file, the tool generates:
*_cleaned.fasta: Sequences that passed all filters, ordered by start position*_removed_all.fasta: All removed sequences combined into one file*_removed_human.fasta: Sequences removed due to human similarity*_removed_at.fasta: Sequences removed due to AT content*_removed_outlier.fasta: Sequences removed as statistical outliers*_removed_reference.fasta: Sequences removed as reference-based outliers*_consensus.fasta: Final consensus sequence*_metrics.csv: Detailed metrics for all sequences*_log.txt: Processing log with parameters and statistics*_ordered_annotated.fasta: All original sequences with fate annotations, ordered by start position
Metrics and Analysis
The tool calculates comprehensive metrics for each sequence:
- Sequence length and composition
- AT content and deviation from consensus
- Human COX1 similarity scores using local alignment
- Position-specific conservation scores
- Weighted and unweighted deviation measures
- Conservation-based statistical scores
- Reference-based metrics (if enabled)
- Gap handling and position-specific frequencies
All metrics are saved in the CSV report for further analysis.
Sequence Processing Pipeline
The filtering pipeline processes sequences in this specific order:
- Remove sequences with high human COX1 similarity
- Filter sequences with divergent AT content
- Remove statistical outliers
- Compare against reference sequences (if provided)
After each filtering step: - A new consensus sequence is generated from remaining sequences - New position-specific frequencies are calculated - New metrics are computed for all remaining sequences
Example
```python
Process a directory of FASTA files with custom thresholds and AT mode lower
python fastacleanercombined.py \ -i /path/to/fasta/files \ -o /path/to/output \ -r /path/to/references \ --humanthreshold 0.90 \ --atdifference 0.15 \ --atmode lower \ --percentilethreshold 95.0 ```
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this tool in your research, please cite:
bibtex
@software{fasta_cleaner,
author = {Ben Price AND Daniel Parsons AND Jordan Beasley AND Claude Sonnet},
title = {FASTA Sequence Cleaner},
version = {1.0.0},
year = {2024},
url = {https://github.com/bge-barcoding/fasta-cleaner},
note = {Implements multiple sequence filtering approaches with position-specific analysis}
}
Acknowledgments
- Uses BioPython for sequence analysis
- Implements methods inspired by various sequence quality control approaches
- Developed to address common contamination and quality issues in sequence data
Support
For bugs, feature requests, or questions, please open an issue on GitHub.
Owner
- Name: BGE barcoding
- Login: bge-barcoding
- Kind: organization
- Website: https://biodiversitygenomics.eu/
- Twitter: BioGenEurope
- Repositories: 1
- Profile: https://github.com/bge-barcoding
Biodiversity Genomics Europe (BGE) - (meta)barcoding software artifacts
GitHub Events
Total
- Issues event: 10
- Issue comment event: 3
- Member event: 1
- Push event: 22
- Pull request event: 10
- Fork event: 2
Last Year
- Issues event: 10
- Issue comment event: 3
- Member event: 1
- Push event: 22
- Pull request event: 10
- Fork event: 2
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 6
- Total pull requests: 7
- Average time to close issues: 23 days
- Average time to close pull requests: 15 days
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 7
- Average time to close issues: 23 days
- Average time to close pull requests: 15 days
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bwprice (6)
Pull Request Authors
- SchistoDan (7)