distributed-downloader

MPI-based distributed downloading tool for retrieving data from diverse domains.

https://github.com/imageomics/distributed-downloader

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization imageomics has institutional domain (imageomics.osu.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.8%) to scientific vocabulary

Keywords

dataset-generation downloader mpi-applications
Last synced: 4 months ago · JSON representation ·

Repository

MPI-based distributed downloading tool for retrieving data from diverse domains.

Basic Info
  • Host: GitHub
  • Owner: Imageomics
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 998 KB
Statistics
  • Stars: 3
  • Watchers: 9
  • Forks: 0
  • Open Issues: 4
  • Releases: 2
Topics
dataset-generation downloader mpi-applications
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Distributed Downloader

A high-performance, MPI-based distributed downloading tool for retrieving large-scale image datasets from diverse web sources.

Overview

The Distributed Downloader was initially developed to handle the massive scale of downloading all images from the monthly GBIF occurrence snapshot, which contains approximately 200 million images distributed across 545 servers. The tool is designed with general-purpose capabilities and can efficiently process any collection of URLs.

Why Build This Tool?

We chose to develop this custom solution instead of using existing tools like img2dataset for several key reasons:

  • Server-friendly operation: Implements sophisticated rate limiting to avoid overloading source servers
  • Enhanced control: Provides fine-grained control over dataset construction and metadata management
  • Scalability: Handles massive datasets that exceed the capabilities of single-machine solutions
  • Fault tolerance: Robust checkpoint and recovery system for long-running downloads
  • Flexibility: Supports diverse output formats and custom processing pipelines

Installation

Prerequisites

  • Python 3.10 or 3.11
  • MPI implementation (OpenMPI or Intel MPI)
  • High-performance computing environment with Slurm (recommended)

The downloader can be directly installed using Pip. Alternatively, one can clone the repo and install with Conda. Either way, it is recommended to install within a virtual environment.

Conda Installation

  1. Install Miniconda
  2. Create the environment:

bash conda env create -f environment.yaml --solver=libmamba -y 3. Activate the environment:

bash conda activate distributed-downloader 4. Install the package:

bash pip install -e .[dev]

Pip Installation (Recommended)

  1. Install the package:

bash # For general use pip install distributed-downloader

bash # For development pip install -e .[dev]

Script Configuration

After installation, create the necessary Slurm scripts for your environment. See the Scripts Documentation for detailed instructions.

Configuration

The downloader uses YAML configuration files to specify all operational parameters:

```yaml

Core paths

pathtoinput: "/path/to/input/urls.csv" pathtooutput: "/path/to/output"

Output structure

outputstructure: urlsfolder: "urls" logsfolder: "logs" imagesfolder: "images" schedulesfolder: "schedules" profilestable: "profiles.csv" ignoredtable: "ignored.csv" innercheckpointfile: "checkpoint.json" toolsfolder: "tools"

Downloader parameters

downloaderparameters: numdownloads: 1 maxnodes: 20 workerspernode: 20 cpuperworker: 1 header: true imagesize: 224 loggerlevel: "INFO" batchsize: 10000 ratemultiplier: 0.5 defaultrate_limit: 3

Tools parameters

toolsparameters: numworkers: 1 maxnodes: 10 workerspernode: 20 cpuperworker: 1 thresholdsize: 10000 newresizesize: 224 ```

Usage

Primary Downloading Interface

Python API

```python from distributeddownloader import downloadimages

Start or continue downloading process

download_images("/path/to/config.yaml") ```

Command-Line Interface

```bash

Continue from current state

distributed_downloader /path/to/config.yaml

Reset and restart from initialization

distributeddownloader /path/to/config.yaml --resetbatched

Restart from profiling step

distributeddownloader /path/to/config.yaml --resetprofiled ```

CLI Options:

  • No flags: Resume from current checkpoint
  • --reset_batched: Restart completely, including file initialization and partitioning
  • --reset_profiled: Keep partitioned files but redo server profiling

Tools Pipeline

After completing the download process, use the tools pipeline to perform post-processing operations on downloaded images.

Python API

```python from distributeddownloader import applytools

Apply a specific tool

apply_tools("/path/to/config.yaml", "resize") ```

Command-Line Interface

```bash

Continue tool pipeline from current state

distributeddownloadertools /path/to/config.yaml resize

Reset pipeline stages

distributeddownloadertools /path/to/config.yaml resize --resetfiltering distributeddownloadertools /path/to/config.yaml resize --resetscheduling distributeddownloadertools /path/to/config.yaml resize --reset_runners

Use custom tools not in registry

distributeddownloadertools /path/to/config.yaml mycustomtool --toolnameoverride ```

CLI Options:

  • No flags: Continue from current tool state
  • --reset_filtering: Restart entire tool pipeline
  • --reset_scheduling: Keep filtered data, redo scheduling
  • --reset_runners: Keep scheduling, restart runner jobs
  • --tool_name_override: Allow unregistered custom tools

Available Tools

The following built-in tools are available:

  • resize: Resizes images to specified dimensions
  • image_verification: Validates image integrity and identifies corruption
  • duplication_based: Removes duplicate images using MD5 hash comparison
  • size_based: Filters out images below specified size thresholds

Custom Tool Development

Create custom tools by implementing three pipeline stages:

```python from distributed_downloader.tools import (FilterRegister, SchedulerRegister, RunnerRegister, PythonFilterToolBase, MPIRunnerTool, DefaultScheduler)

@FilterRegister("mycustomtool") class MyCustomToolFilter(PythonFilterToolBase): def run(self): # Filter implementation pass

@SchedulerRegister("mycustomtool") class MyCustomToolScheduler(DefaultScheduler): def run(self): # Scheduling implementation pass

@RunnerRegister("mycustomtool") class MyCustomToolRunner(MPIRunnerTool): def run(self): # Processing implementation pass ```

Data Format and Storage

Input Requirements

Input files must be tab-delimited or CSV format containing URLs with the following required columns:

  • uuid: Unique internal identifier
  • identifier: Image URL
  • source_id: Source-specific identifier

Optional columns:

  • license: License URL
  • source: Source attribution
  • title: Image title

Output Structure

Downloaded data is stored in the configured images_folder, partitioned by server name and partition ID:

Success Records (successes.parquet)

  • uuid: Dataset internal identifier
  • source_id: Source-provided identifier
  • identifier: Original image URL
  • is_license_full: Boolean indicating complete license information
  • license, source, title: Attribution information
  • hashsum_original, hashsum_resized: MD5 hashes
  • original_size, resized_size: Image dimensions
  • image: Binary image data

Error Records (errors.parquet)

  • uuid: Dataset internal identifier
  • identifier: Failed image URL
  • retry_count: Number of download attempts
  • error_code: HTTP or internal error code
  • error_msg: Detailed error description

Supported Image Formats

The downloader supports common web image formats:

  • JPEG/JPG
  • PNG
  • GIF (first frame extraction)
  • BMP
  • TIFF

Logging and Monitoring

Logging Levels

  • INFO: Essential information including batch progress and errors
  • DEBUG: Detailed information including individual download events

Log Organization

Logs are organized hierarchically by:

  • Pipeline stage (initialization, profiling, downloading)
  • Batch number and iteration
  • Worker process ID

See Structure Documentation for detailed log organization.

Performance and Troubleshooting

Common Performance Issues

  1. Rate limiting errors (429, 403):
  • Reduce default_rate_limit in configuration
  • Increase rate_multiplier for longer delays
  1. Memory constraints:
  • Reduce batch_size or workers_per_node
  • Monitor system memory usage
  1. Network timeouts:
  • Check connectivity to source servers
  • Review firewall and proxy settings

Error Recovery

The system automatically resumes from checkpoints. For manual intervention:

  • Review error distributions in parquet files
  • Check server-specific error patterns
  • Use ignored server list for problematic hosts

See Troubleshooting Guide for comprehensive error resolution.

Environment Variables

The system exports numerous environment variables for script coordination:

General Parameters:

  • CONFIG_PATH, PATH_TO_INPUT, PATH_TO_OUTPUT
  • OUTPUT_*_FOLDER variables for each output component

Downloader-Specific:

  • DOWNLOADER_MAX_NODES, DOWNLOADER_WORKERS_PER_NODE
  • DOWNLOADER_BATCH_SIZE, DOWNLOADER_RATE_MULTIPLIER

Tools-Specific:

  • TOOLS_MAX_NODES, TOOLS_WORKERS_PER_NODE
  • TOOLS_THRESHOLD_SIZE, TOOLS_NEW_RESIZE_SIZE

System Requirements

Minimum Requirements

  • Multi-node HPC cluster with Slurm
  • High-bandwidth network connectivity
  • Substantial storage capacity for downloaded datasets
  • MPI-capable compute environment

Recommended Configuration

  • 20+ compute nodes with 20+ cores each
  • High-speed interconnect (InfiniBand recommended)
  • Parallel file system (Lustre, GPFS)
  • Dedicated network bandwidth for external downloads

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation

Contributing

We welcome contributions! Please see our contributing guidelines and ensure all tests pass before submitting pull requests.

Owner

  • Name: Imageomics Institute
  • Login: Imageomics
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Distributed-downloader
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Andrei
    family-names: Kopanev
    email: kopanev.1@buckeyemail.osu.edu
    affiliation: Imageomics Institute
  - given-names: 'Matthew J'
    family-names: Thompson
    affiliation: Imageomics Institute
    email: thompson.4509@osu.edu
    orcid: 'https://orcid.org/0000-0003-0583-8585'
  - given-names: 'Elizabeth G'
    family-names: Campolongo
    email: campolongo.4@osu.edu
    affiliation: Imageomics Institute
    orcid: 'https://orcid.org/0000-0003-0846-2413'
repository-code: 'https://github.com/Imageomics/distributed-downloader'
identifiers:
  - description: "The GitHub release URL of tag v0.2.0-beta."
    type: url
    value: "https://github.com/Imageomics/distributed-downloader/releases/tag/v0.2.0-beta"
  - description: "The GitHub URL of the commit tagged with v0.2.0-beta."
    type: url
    value: "https://github.com/Imageomics/distributed-downloader/tree/<commit-hash>" # update post release
abstract: >-
  MPI-based distributed downloading tool for retrieving data
  from diverse domains.
keywords:
  - parallel
  - distributed
  - download
  - url
  - "dataset generation"
  - "MPI application"
license: MIT
version: 0.2.0-beta
date-released: '2025-04-17'

GitHub Events

Total
  • Create event: 7
  • Issues event: 10
  • Release event: 4
  • Delete event: 13
  • Issue comment event: 11
  • Push event: 34
  • Pull request review comment event: 34
  • Pull request event: 9
  • Pull request review event: 30
Last Year
  • Create event: 7
  • Issues event: 10
  • Release event: 4
  • Delete event: 13
  • Issue comment event: 11
  • Push event: 34
  • Pull request review comment event: 34
  • Pull request event: 9
  • Pull request review event: 30

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 8
  • Total pull requests: 8
  • Average time to close issues: 8 months
  • Average time to close pull requests: 3 months
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.25
  • Average comments per pull request: 0.38
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 6
  • Average time to close issues: 2 months
  • Average time to close pull requests: 5 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.17
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • egrace479 (10)
  • tms2003 (1)
  • Andrey170170 (1)
Pull Request Authors
  • Andrey170170 (10)
  • dependabot[bot] (9)
  • egrace479 (3)
Top Labels
Issue Labels
documentation (4) structure (4) question (2) bug (2) dependencies (1)
Pull Request Labels
dependencies (10) bug (1) documentation (1) enhancement (1) structure (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 13 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: distributed-downloader

MPI-based tool for downloading images from a list of URLs in parallel.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 13 Last month
Rankings
Dependent packages count: 9.3%
Stargazers count: 30.1%
Average: 30.8%
Forks count: 31.5%
Dependent repos count: 52.3%
Maintainers (1)
Last synced: 5 months ago