extracttaxafromfastadatabase

This script extracts sequences from a FASTA file that belong to a specific taxon and writes them to a new file.

https://github.com/rebentisu/extracttaxafromfastadatabase

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

This script extracts sequences from a FASTA file that belong to a specific taxon and writes them to a new file.

Basic Info
  • Host: GitHub
  • Owner: Rebentisu
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 44.9 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Extract Specific Taxa Sequences

This script is a Python program that extracts sequences from a FASTA file that belong to a specific taxon and writes them to a new file. It also generates a summary file containing the count and names of the sequences that were extracted. The script accepts command-line arguments for the input file, output file, summary file, and target taxon. The script also includes error handling to provide helpful messages in case of incorrect arguments or other errors.

Purpose and Aim of the Script:

The primary aim of this script is to extract sequences belonging to a specific taxon from large FASTA databases, which is a crucial step in preparing data for phylogenetic placement. High-quality and accurate phylogenetic analyses require well-curated reference sequence databases. Manually curating these databases ensures the selection of relevant and known sequences, yielding higher-quality results for specific projects. In many cases, the process of selecting sequences for phylogenetic placement is labor-intensive, as it involves manually hand-picking sequences from reference databases such as SILVA, NCBI, Greengenes, or RDP. Despite the manual effort required, this approach often results in the highest-quality curated datasets, which are critical for building reliable reference sets (RSs) tailored to the needs of a given project. This script is designed to automate part of the curation process by efficiently filtering sequences based on taxonomic assignments, thereby reducing the manual burden while maintaining the integrity of the reference data for accurate downstream phylogenetic analyses. For further information, refer to #Czech et al. 2022.

Usage

```sh python extractspecifictaxasequences.py -f <inputfastafile> -o <outputfastafile> -s <summaryfile> -t

```

Example

To extract sequences for the genus Pseudomonas from "fastadb.fasta", and write them to outputPseudomonas.fasta with a summary in summary.txt, run:

sh python extract_specific_taxa_sequences.py -f fasta_db.fasta -o output_Pseudomonas.fasta -s summary.txt -t Pseudomonas

Note: The fasta_db.fasta file could be a FASTA database containing 16S rRNA full-length sequences from widely used resources such as SILVA or Greengenes.

Usage instructions

sh optional arguments: -h, --help show this help message and exit -f INPUT_FILE, --input_file INPUT_FILE Path to the input FASTA file containing sequences with taxonomic information in headers. -o OUTPUT_FILE, --output_file OUTPUT_FILE Path to the output file where sequences of the specified taxon will be written. -s SUMMARY_FILE, --summary_file SUMMARY_FILE Path to the summary file where the count and names of the sequences will be written. -t TARGET_TAXON, --target_taxon TARGET_TAXON The taxon to search for in the headers of the input sequences.

References

Czech, L., Stamatakis, A., Dunthorn, M., & Barbera, P. (2022). Metagenomic analysis using phylogenetic placementa review of the first decade. Frontiers in Bioinformatics, 2, 871393.

Script created by Georgios Leventis

DOI

Owner

  • Name: Rebentisu
  • Login: Rebentisu
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
title: Python Script for Extracting Taxon-Specific Sequences from FASTA Files with Summary Report Generation
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: 'Georgios '
    family-names: Leventis
    orcid: 'https://orcid.org/0000-0002-2062-6292'
    affiliation: Agricultural University of Athens
identifiers:
  - type: doi
    value: 10.5281/zenodo.13874732
repository-code: 'https://github.com/Rebentisu/ExtractTaxafromFastaDatabase'
abstract: >-
  This script is a Python program that extracts sequences
  from a FASTA file that belong to a specific taxon and
  writes them to a new file. It also generates a summary
  file containing the count and names of the sequences that
  were extracted. The script accepts command-line arguments
  for the input file, output file, summary file, and target
  taxon. The script also includes error handling to provide
  helpful messages in case of incorrect arguments or other
  errors.
keywords:
  - FASTA file
  - Sequence extraction
  - Bioinformatics
  - Taxon filtering
  - Sequence processing
  - Python script
  - Sequence identification
  - Taxon-specific extraction
license: MIT
version: 1.1.3
date-released: '2024-10-01'

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2