datafishing

DataFishing is a Python tool that automates searches in genomic databases for biodiversity research. It's faster and more efficient than R packages, streamlining the retrieval of DNA sequences, common names, synonyms, conservation status, and species occurrence data.

https://github.com/luanrabelo/datafishing

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

DataFishing is a Python tool that automates searches in genomic databases for biodiversity research. It's faster and more efficient than R packages, streamlining the retrieval of DNA sequences, common names, synonyms, conservation status, and species occurrence data.

Basic Info
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 4
Created over 2 years ago · Last pushed 11 months ago
Metadata Files
Readme Funding License Code of conduct Citation

README.md

dataFishing Logo

Published in%20

Contents Overview


System Overview

:rocket: Go to Contents Overview

dataFishing Logo

dataFishing is an efficient Python tool and user-friendly web-form for mining Mitochondrial/Chloroplast Sequences and biodiversity data. It is designed to facilitate and automate access to information from various databases, including NCBI GenBank, Bold Systems, GBIF, WoRMS, IUCN Red List, and Eschmeyer's Catalog of Fishes. dataFishing is faster and more efficient than other tools for obtaining taxonomic information from the databases consulted. It also allows the retrieval of DNA sequences, Common Names, Synonyms, Conservation Status, and Occurrence Points of species. The dataFishing repository, hosted on GitHub and licensed under MIT, is a freely accessible resource for the scientific community.

Key Features

🌍 Multiple Database Support: Access 6 major biodiversity databases
🧬 Sequence Download: Automated download of mitochondrial and chloroplast sequences
📊 Performance Benchmarking: Built-in performance analysis and visualization
Asynchronous Processing: High-speed concurrent API requests
📋 Comprehensive Results: Excel, CSV, and TSV output formats
🔧 Easy Configuration: Simple command-line interface with helpful documentation


How to cite dataFishing

:rocket: Go to Contents Overview

When referencing the dataFishing tool, please cite it appropriately in your academic or professional work:

Rabelo, L., Sodré, D., Balcázar, O. D. A., do Rosário, M. F., Guimarães-Costa, A. J., Gomes, G., Sampaio, I., & Vallinoto, M. (2025). dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data. Ecological Informatics, 85, 102970. https://doi.org/10.1016/j.ecoinf.2024.102970


License

dataFishing is released under the MIT License. This license permits reuse within proprietary software provided that all copies of the licensed software include a copy of the MIT License terms and the copyright notice.

For more details, please see the MIT License.


The Hitchhiker's Guide to dataFishing

Change Log

:rocket: Go to Contents Overview
  • Version 1.6.1 (2025-01-30)

    • Added asynchronous processing with aiohttp for improved performance
    • Implemented comprehensive IUCN Red List data extraction
    • Added performance benchmarking and visualization
    • Enhanced command-line interface with better argument descriptions
    • Added API key configuration system
    • Improved error handling and logging
    • Added support for Eschmeyer's Catalog of Fishes
  • Version 1.0.1 (2024-10-15)

    • Added the ability to download sequence data from BOLD System and/or GenBank
    • Added the ability to obtain data of Threats from the IUCN database
  • Version 1.0.0 (2024-10-01)

    • Initial release of dataFishing

Getting Started

:rocket: Go to Contents Overview

Prerequisites

Before you run dataFishing, make sure you have the following prerequisites installed:

Python Environment

  • Python version 3.8 or higher
  • pip (Python package installer)
  • conda (optional but recommended)

System Requirements

  • Internet connection for API access
  • Minimum 4GB RAM (8GB recommended for large datasets)
  • 1GB free disk space for results and sequences

Installation

:rocket: Go to Contents Overview

Option 1: Install from PyPI (Recommended)

bash pip install dataFishing

Option 2: Install from Source

bash git clone https://github.com/luanrabelo/dataFishing.git cd dataFishing pip install -r requirements.txt pip install -e .

Option 3: Using Conda Environment

bash conda create -n dataFishing python=3.11 conda activate dataFishing pip install dataFishing

API Keys Configuration

:rocket: Go to Contents Overview

Some databases require API keys for access. Create an apikeys.env file in your working directory:

```bash

Create apikeys.env file

touch apikeys.env ```

Add your API keys to the file:

```env

NCBI Configuration (Required for NCBI database)

NCBIEMAIL=your-email@university.edu NCBIAPI_KEY=your-ncbi-api-key-here

IUCN Configuration (Required for IUCN database)

IUCNAPIKEY=your-iucn-api-token-here ```

How to Obtain API Keys:

NCBI GenBank: 1. Register at: https://account.ncbi.nlm.nih.gov/signup/ 2. Email is required, API key is optional but increases rate limits 3. Get API key at: https://www.ncbi.nlm.nih.gov/account/settings/

IUCN Red List: 1. Request token at: https://api.iucnredlist.org/ 2. Academic use is usually free 3. Commercial use requires subscription

Other databases (GBIF, WoRMS, BOLD, Eschmeyer) do not require API keys

Usage

:rocket: Go to Contents Overview

Basic syntax: bash dataFishing --input SPECIES_FILE --output RESULTS_DIR [OPTIONS]

Command Line Arguments

:rocket: Go to Contents Overview

📁 Input and Output Arguments

  • --input, -i PATH (required): Path to species list file (.txt or .tsv)
  • --output, -o PATH (required): Output directory for results

🌍 Biodiversity Databases Arguments

  • --all: Query all available databases
  • --iucn: Query IUCN Red List (requires API key)
  • --ncbi: Query NCBI GenBank (requires email)
  • --bold: Query BOLD Systems
  • --gbif: Query GBIF
  • --worms: Query WoRMS
  • --eschmeyer: Query Eschmeyer's Catalog

🧬 NCBI GenBank Arguments

  • --email, -e EMAIL: Email address for NCBI access (required for NCBI)
  • --ncbi-api-key KEY: NCBI API key for higher rate limits

⬇️ Sequence Download Arguments

  • --download-sequences: Enable sequence download
  • --genes-list FILE: File containing gene names (one per line)

📊 Performance and Logging Arguments

  • --benchmark: Enable performance benchmarking
  • --plot-benchmark TSV_FILE: Generate plots from benchmark data
  • --verbose, -v: Enable detailed logging
  • --log-file: Save logs to files

🔧 API Configuration Arguments

  • --max-concurrent N: Maximum concurrent requests
  • --rate-limit SECONDS: Delay between requests

Examples

:rocket: Go to Contents Overview

Basic Usage - All Databases

bash dataFishing --input species.txt --output results/ --all --email your@email.com

Specific Databases Only

bash dataFishing --input species.txt --output results/ --iucn --worms --gbif --verbose

Download Sequences from NCBI

bash dataFishing --input species.txt --output results/ --ncbi \ --email your@email.com --download-sequences --genes-list genes.txt

Enable Performance Benchmarking

bash dataFishing --input species.txt --output results/ --all \ --email your@email.com --benchmark --verbose

Generate Plots from Existing Benchmark

bash dataFishing --plot-benchmark results/benchmark_results.tsv

Input File Formats

:rocket: Go to Contents Overview

Text File (.txt)

Panthera tigris Canis lupus Ursus americanus Ailuropoda melanoleuca

TSV File from BOLD Systems

Download TSV data from BOLD Systems: 1. Search for your taxonomic group 2. Click "Combined: TSV" to download

Gene List File Example

COI COII COIII ND5 CYTB Control Region 16S 12S

Supported Genes

:rocket: Go to Contents Overview

| Category | Mitochondrial Genes | Chloroplast Genes | |----------|---------------------|-------------------| | rRNA | 12S, 16S | - | | Complex I | ND1, ND2, ND3, ND4, ND4L, ND5, ND6 | - | | Complex III | CYTB | - | | Complex IV | COI, COII, COIII | - | | Complex V | ATP6, ATP8 | - | | Control Region | Control Region | - | | ATP Synthase | - | atpA, atpB, atpE, atpF, atpH, atpI | | Cytochrome | - | petA, petB, petD, petE, petG, petL, petN | | RNA Polymerase | - | rpoA, rpoB, rpoC1, rpoC2 | | Ribosome (Large) | - | rpl2, rpl14, rpl16, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36 | | NADH-dehydrogenase | - | ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK | | PhotoSystem | - | psaA, psaB, psaC, psaI, psaJ, psaM, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbZ | | Ribosome (Small) | - | rps2, rps3, rps4, rps7, rps8, rps11, rps12, rps14, rps15, rps16, rps18, rps19 | | Rubisco | - | rbcL |

Database Information

:rocket: Go to Contents Overview

| Database | Data Retrieved | Requirements | |----------|----------------|--------------| | IUCN Red List | Conservation status, threats, habitats, population trends | API token | | NCBI GenBank | Genetic sequences, taxonomy | Email (+ API key optional) | | BOLD Systems | DNA barcodes, specimen records, taxonomy | None | | GBIF | Occurrence records, taxonomy, synonyms | None | | WoRMS | Marine taxonomy, habitat preferences | None | | Eschmeyer's Catalog | Fish taxonomy, nomenclature, synonyms | None |

Outputs Files

:rocket: Go to Contents Overview

dataFishing generates several output files:

Main Results

  • dataFishing_results.xlsx: Formatted Excel file with all data
  • Individual CSV/TSV files for each database
  • Consolidated summary files

Sequence Files (if downloaded)

  • FASTA files organized by species and gene
  • Separate folders for NCBI and BOLD sequences

Performance Files (if benchmarking enabled)

  • benchmark_results.tsv: Raw performance data
  • benchmark_performance_analysis.png: Performance plots
  • benchmark_summary.tsv: Summary statistics

Log Files (if enabled)

  • IUCNLog.txt, NCBILog.txt, etc.

IUCN Example

| Tax ID | Kingdom | Phylum | Class | Order | Family | Genus | Species | Red List Category | Population Trend | Main Common Name | |:----------:|:-----------:|:----------:|:---------:|:---------:|:----------:|:---------:|:-----------:|:---------------------:|:--------------------:|:--------------------:| | 15955 | Animalia | Chordata | Mammalia | Carnivora | Felidae | Panthera | Panthera tigris | Endangered | Decreasing | Tiger (eng) | | 22823 | Animalia | Chordata | Mammalia | Carnivora | Canidae | Canis | Canis lupus | Least Concern | Stable | Gray Wolf (eng) |

WoRMS Example

| AphiaID | Kingdom | Phylum | Class | Order | Family | Genus | Species | Status | Authority | |:-----------:|:-----------:|:----------:|:---------:|:---------:|:----------:|:---------:|:-----------:|:----------:|:-------------:| | 137205 | Animalia | Chordata | Actinopteri | Carangiformes | Carangidae | Caranx | Caranx hippos | accepted | (Linnaeus, 1766) |

Performance Optimization

:rocket: Go to Contents Overview

For Large Datasets (>1000 species):

bash dataFishing --input large_species.txt --output results/ --all \ --max-concurrent 10 --rate-limit 1.5 --benchmark

For Slow Connections:

bash dataFishing --input species.txt --output results/ --all \ --max-concurrent 5 --rate-limit 2.0

For Fast Processing:

bash dataFishing --input species.txt --output results/ --all \ --max-concurrent 50 --rate-limit 0.5

Troubleshooting

:rocket: Go to Contents Overview

Common Issues:

"API token required" - Ensure apikeys.env file is in your working directory - Check API key format and validity

"Connection timeout" - Increase --rate-limit value - Decrease --max-concurrent value - Check internet connection

"No species found" - Verify input file format - Check species name spelling - Ensure one species per line in text files

"Permission denied" - Check output directory permissions - Ensure disk space is available

Advanced Usage

:rocket: Go to Contents Overview

Custom Configuration File

Create a configuration script for repeated usage:

```bash

!/bin/bash

datafishing_config.sh

EMAIL="your@email.com" INPUTDIR="/path/to/species/files" OUTPUTDIR="/path/to/results" GENES_LIST="/path/to/genes.txt"

dataFishing --input "$INPUTDIR/species.txt" \ --output "$OUTPUTDIR" \ --all \ --email "$EMAIL" \ --download-sequences \ --genes-list "$GENES_LIST" \ --benchmark \ --verbose \ --log-file ```

Batch Processing Multiple Files

```bash

!/bin/bash

Process multiple species files

for file in species*.txt; do echo "Processing $file..." dataFishing --input "$file" \ --output "results$(basename $file .txt)" \ --all \ --email your@email.com \ --verbose done ```


dataFishing Development Team

:rocket: Go to Contents Overview
  • Luan Rabelo (Lead Developer)
  • Clayton Sodré
  • Oscar Balcázar
  • Murilo Furtado
  • Aurycéia Guimarães-Costa
  • Iracilda Sampaio
  • Marcelo Vallinoto

Contact

:rocket: Go to Contents Overview

For reporting bugs, requesting assistance, or providing feedback, please reach out to:

Primary Contact: Luan Rabelo - Email: luanrabelo@outlook.com - GitHub: @luanrabelo

Issues and Bug Reports: - GitHub Issues: https://github.com/luanrabelo/dataFishing/issues

Documentation and Wiki: - GitHub Wiki: https://github.com/luanrabelo/dataFishing/wiki


⭐ If you find dataFishing useful, please consider giving it a star on GitHub!

GitHub stars


Owner

  • Name: Luan Rabelo
  • Login: luanrabelo
  • Kind: user
  • Location: Bragança, Pará, Brasil
  • Company: Universidade Federal do Pará

I am a PhD candidate in Environmental Biology, with a focus on Bioinformatics, I develop and use scripts to understand evolutionary processes and biodiversity.

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it using the metadata below.
title: 'dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data'
authors:
- given-names: Luan
  family-names: Rabelo
  orcid: https://orcid.org/0000-0002-1223-8943
  affiliation: Universidade Federal do Pará (UFPA) / Instituto Tecnológico Vale (ITV)
- given-names: Davidson
  family-names: Sodré
  affiliation: Universidade Federal Rural da Amazônia (UFRA)
- given-names: Oscar David Albito
  family-names: Balcázar
  affiliation: Universidade Federal do Pará (UFPA)
- given-names: Murilo Furtado do
  family-names: Rosário
  affiliation: Universidade Federal do Pará (UFPA)
- given-names: Aurycéia Jaquelyne
  family-names: Guimarães-Costa
  affiliation: AFYA, Faculdade de Ciências Médicas
- given-names: Grazielle
  family-names: Gomes
  affiliation: Universidade Federal do Pará (UFPA)
- given-names: Iracilda
  family-names: Sampaio
  affiliation: Universidade Federal do Pará (UFPA)
- given-names: Marcelo
  family-names: Vallinoto
  orcid: https://orcid.org/0000-0002-3465-3830
  affiliation: Universidade Federal do Pará (UFPA)
date-released: '2024-12-22'
doi: 10.1016/j.ecoinf.2024.102970
url: https://github.com/luanrabelo/dataFishing
repository-code: https://github.com/luanrabelo/dataFishing
license: MIT
abstract: dataFishing is a Python-based tool and web form designed to facilitate the
  mining, retrieval, and processing of organelle DNA sequences and biodiversity information
  from public databases such as GenBank, BOLD Systems, GBIF, IUCN, and WoRMS. It supports
  both command-line and web-based usage, enabling users with varying levels of technical
  expertise to access taxonomic, conservation, and occurrence data efficiently. The
  tool outputs results in Excel, CSV, and TSV formats and is optimized for performance
  and usability.
type: software
keywords:
- biodiversity
- taxonomy
- genetic data
- bioinformatics
- data mining
- Python
- conservation
- DNA sequences
- open science
contact:
- name: Luan Rabelo
  email: luanrabelo@outlook.com
- name: Marcelo Vallinoto
  email: mvallino@ufpa.br
preferred-citation:
  type: article
  title: 'dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data'
  authors:
  - Luan Rabelo
  - Davidson Sodré
  - Oscar David Albito Balcázar
  - Murilo Furtado do Rosário
  - Aurycéia Jaquelyne Guimarães-Costa
  - Grazielle Gomes
  - Iracilda Sampaio
  - Marcelo Vallinoto
  journal: Ecological Informatics
  volume: '85'
  pages: '102970'
  year: '2025'
  doi: 10.1016/j.ecoinf.2024.102970

GitHub Events

Total
  • Watch event: 6
  • Push event: 24
  • Fork event: 1
Last Year
  • Watch event: 6
  • Push event: 24
  • Fork event: 1