https://github.com/bge-barcoding/fasta-cleaner

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bge-barcoding
License: mit
Language: Python
Default Branch: main
Size: 102 KB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

FASTA Cleaner

A Python tool for cleaning and analyzing FASTA sequence alignments (e.g. from https://github.com/cmayer/MitoGeneExtractor) using multiple filtering approaches. This tool is designed to help researchers identify and remove problematic sequences from their alignments while maintaining data quality and integrity.

Features

The FASTA Sequence Cleaner implements a sophisticated filtering pipeline that includes:

Human COX1 Contamination Detection
- Identifies sequences with high similarity to human COX1
- Configurable similarity threshold
- Uses efficient local alignment for comparison
- Helps prevent contamination from human DNA
AT Content Analysis
- Compares AT content between sequences and consensus
- Identifies sequences with divergent nucleotide composition
- Supports multiple filtering modes (absolute, higher, lower)
- Customizable difference threshold
- Only considers overlapping regions for comparison
Statistical Outlier Detection
- Uses both weighted and unweighted deviation scores
- Position-specific residue frequency analysis
- Conservation-weighted sequence comparison
- Adjustable percentile threshold for outlier detection
- Robust handling of gaps and missing data
Reference Sequence Comparison
- Optional comparison against known reference sequences
- Supports multiple reference sequence files
- Additional metrics for reference-based filtering
- Weighted deviation scoring based on conservation

Installation

Prerequisites

Python 3.6 or higher
BioPython
NumPy
typing (for type hints)

bash pip install biopython numpy typing

Installation Steps

Clone this repository: bash git clone https://github.com/bge-barcoding/fasta-cleaner.git cd fasta-cleaner
Install dependencies: bash pip install biopython numpy typing

Usage

Basic Usage

bash python fasta_cleaner_combined.py -i input_dir -o output_dir

Advanced Usage

bash python fasta_cleaner_combined.py \ -i input_dir \ -o output_dir \ -r reference_dir \ --human_threshold 0.95 \ --at_difference 0.1 \ --at_mode absolute \ --percentile_threshold 90.0 \ --consensus_threshold 0.5

Command Line Arguments

| Argument | Description | Default | |----------|-------------|---------| | -i, --input_dir | Directory containing input FASTA files | Required | | -o, --output_dir | Output directory for processed files | Required | | -r, --reference_dir | Directory containing reference sequences | Optional | | -u, --human_threshold | Human COX1 similarity threshold (0-1) | 0.95 | | -d, --at_difference | Maximum allowed AT content difference | 0.1 | | -m, --at_mode | AT content filtering mode (absolute/higher/lower) | absolute | | -p, --percentile_threshold | Percentile for outlier detection (0-100) | 90.0 | | -c, --consensus_threshold | Consensus sequence generation threshold | 0.5 |

AT Content Filtering Modes

The tool supports three modes for AT content filtering:

absolute: Removes sequences if AT content differs from consensus by more than threshold in either direction
higher: Removes only sequences with AT content above consensus + threshold (i.e. AT is too high)
lower: Removes only sequences with AT content below consensus - threshold (i.e. AT is too low)

Filter Control Flags

| Flag | Description | |------|-------------| | --disable_human | Disable human COX1 similarity filtering | | --disable_at | Disable AT content difference filtering | | --disable_outliers | Disable statistical outlier detection |

Output Files

For each input FASTA file, the tool generates:

*_cleaned.fasta: Sequences that passed all filters, ordered by start position
*_removed_all.fasta: All removed sequences combined into one file
*_removed_human.fasta: Sequences removed due to human similarity
*_removed_at.fasta: Sequences removed due to AT content
*_removed_outlier.fasta: Sequences removed as statistical outliers
*_removed_reference.fasta: Sequences removed as reference-based outliers
*_consensus.fasta: Final consensus sequence
*_metrics.csv: Detailed metrics for all sequences
*_log.txt: Processing log with parameters and statistics
*_ordered_annotated.fasta: All original sequences with fate annotations, ordered by start position

Metrics and Analysis

The tool calculates comprehensive metrics for each sequence:

Sequence length and composition
AT content and deviation from consensus
Human COX1 similarity scores using local alignment
Position-specific conservation scores
Weighted and unweighted deviation measures
Conservation-based statistical scores
Reference-based metrics (if enabled)
Gap handling and position-specific frequencies

All metrics are saved in the CSV report for further analysis.

Sequence Processing Pipeline

The filtering pipeline processes sequences in this specific order:

Remove sequences with high human COX1 similarity
Filter sequences with divergent AT content
Remove statistical outliers
Compare against reference sequences (if provided)

After each filtering step: - A new consensus sequence is generated from remaining sequences - New position-specific frequencies are calculated - New metrics are computed for all remaining sequences

Example

```python

Process a directory of FASTA files with custom thresholds and AT mode lower

python fastacleanercombined.py \ -i /path/to/fasta/files \ -o /path/to/output \ -r /path/to/references \ --humanthreshold 0.90 \ --atdifference 0.15 \ --atmode lower \ --percentilethreshold 95.0 ```

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

bibtex @software{fasta_cleaner, author = {Ben Price AND Daniel Parsons AND Jordan Beasley AND Claude Sonnet}, title = {FASTA Sequence Cleaner}, version = {1.0.0}, year = {2024}, url = {https://github.com/bge-barcoding/fasta-cleaner}, note = {Implements multiple sequence filtering approaches with position-specific analysis} }

Acknowledgments

Uses BioPython for sequence analysis
Implements methods inspired by various sequence quality control approaches
Developed to address common contamination and quality issues in sequence data

Support

For bugs, feature requests, or questions, please open an issue on GitHub.

Owner

Name: BGE barcoding
Login: bge-barcoding
Kind: organization

Website: https://biodiversitygenomics.eu/
Twitter: BioGenEurope
Repositories: 1
Profile: https://github.com/bge-barcoding

Biodiversity Genomics Europe (BGE) - (meta)barcoding software artifacts

GitHub Events

Total

Issues event: 10
Issue comment event: 3
Member event: 1
Push event: 22
Pull request event: 10
Fork event: 2

Last Year

Issues event: 10
Issue comment event: 3
Member event: 1
Push event: 22
Pull request event: 10
Fork event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 6
Total pull requests: 7
Average time to close issues: 23 days
Average time to close pull requests: 15 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 7
Average time to close issues: 23 days
Average time to close pull requests: 15 days
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0