distributed-downloader
MPI-based distributed downloading tool for retrieving data from diverse domains.
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization imageomics has institutional domain (imageomics.osu.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.8%) to scientific vocabulary
Keywords
Repository
MPI-based distributed downloading tool for retrieving data from diverse domains.
Basic Info
Statistics
- Stars: 3
- Watchers: 9
- Forks: 0
- Open Issues: 4
- Releases: 2
Topics
Metadata Files
README.md
Distributed Downloader
A high-performance, MPI-based distributed downloading tool for retrieving large-scale image datasets from diverse web sources.
Overview
The Distributed Downloader was initially developed to handle the massive scale of downloading all images from the monthly GBIF occurrence snapshot, which contains approximately 200 million images distributed across 545 servers. The tool is designed with general-purpose capabilities and can efficiently process any collection of URLs.
Why Build This Tool?
We chose to develop this custom solution instead of using existing tools like img2dataset for several key reasons:
- Server-friendly operation: Implements sophisticated rate limiting to avoid overloading source servers
- Enhanced control: Provides fine-grained control over dataset construction and metadata management
- Scalability: Handles massive datasets that exceed the capabilities of single-machine solutions
- Fault tolerance: Robust checkpoint and recovery system for long-running downloads
- Flexibility: Supports diverse output formats and custom processing pipelines
Installation
Prerequisites
- Python 3.10 or 3.11
- MPI implementation (OpenMPI or Intel MPI)
- High-performance computing environment with Slurm (recommended)
The downloader can be directly installed using Pip. Alternatively, one can clone the repo and install with Conda. Either way, it is recommended to install within a virtual environment.
Conda Installation
- Install Miniconda
- Create the environment:
bash
conda env create -f environment.yaml --solver=libmamba -y
3. Activate the environment:
bash
conda activate distributed-downloader
4. Install the package:
bash
pip install -e .[dev]
Pip Installation (Recommended)
- Install the package:
bash
# For general use
pip install distributed-downloader
bash
# For development
pip install -e .[dev]
Script Configuration
After installation, create the necessary Slurm scripts for your environment. See the Scripts Documentation for detailed instructions.
Configuration
The downloader uses YAML configuration files to specify all operational parameters:
```yaml
Core paths
pathtoinput: "/path/to/input/urls.csv" pathtooutput: "/path/to/output"
Output structure
outputstructure: urlsfolder: "urls" logsfolder: "logs" imagesfolder: "images" schedulesfolder: "schedules" profilestable: "profiles.csv" ignoredtable: "ignored.csv" innercheckpointfile: "checkpoint.json" toolsfolder: "tools"
Downloader parameters
downloaderparameters: numdownloads: 1 maxnodes: 20 workerspernode: 20 cpuperworker: 1 header: true imagesize: 224 loggerlevel: "INFO" batchsize: 10000 ratemultiplier: 0.5 defaultrate_limit: 3
Tools parameters
toolsparameters: numworkers: 1 maxnodes: 10 workerspernode: 20 cpuperworker: 1 thresholdsize: 10000 newresizesize: 224 ```
Usage
Primary Downloading Interface
Python API
```python from distributeddownloader import downloadimages
Start or continue downloading process
download_images("/path/to/config.yaml") ```
Command-Line Interface
```bash
Continue from current state
distributed_downloader /path/to/config.yaml
Reset and restart from initialization
distributeddownloader /path/to/config.yaml --resetbatched
Restart from profiling step
distributeddownloader /path/to/config.yaml --resetprofiled ```
CLI Options:
- No flags: Resume from current checkpoint
--reset_batched: Restart completely, including file initialization and partitioning--reset_profiled: Keep partitioned files but redo server profiling
Tools Pipeline
After completing the download process, use the tools pipeline to perform post-processing operations on downloaded images.
Python API
```python from distributeddownloader import applytools
Apply a specific tool
apply_tools("/path/to/config.yaml", "resize") ```
Command-Line Interface
```bash
Continue tool pipeline from current state
distributeddownloadertools /path/to/config.yaml resize
Reset pipeline stages
distributeddownloadertools /path/to/config.yaml resize --resetfiltering distributeddownloadertools /path/to/config.yaml resize --resetscheduling distributeddownloadertools /path/to/config.yaml resize --reset_runners
Use custom tools not in registry
distributeddownloadertools /path/to/config.yaml mycustomtool --toolnameoverride ```
CLI Options:
- No flags: Continue from current tool state
--reset_filtering: Restart entire tool pipeline--reset_scheduling: Keep filtered data, redo scheduling--reset_runners: Keep scheduling, restart runner jobs--tool_name_override: Allow unregistered custom tools
Available Tools
The following built-in tools are available:
resize: Resizes images to specified dimensionsimage_verification: Validates image integrity and identifies corruptionduplication_based: Removes duplicate images using MD5 hash comparisonsize_based: Filters out images below specified size thresholds
Custom Tool Development
Create custom tools by implementing three pipeline stages:
```python from distributed_downloader.tools import (FilterRegister, SchedulerRegister, RunnerRegister, PythonFilterToolBase, MPIRunnerTool, DefaultScheduler)
@FilterRegister("mycustomtool") class MyCustomToolFilter(PythonFilterToolBase): def run(self): # Filter implementation pass
@SchedulerRegister("mycustomtool") class MyCustomToolScheduler(DefaultScheduler): def run(self): # Scheduling implementation pass
@RunnerRegister("mycustomtool") class MyCustomToolRunner(MPIRunnerTool): def run(self): # Processing implementation pass ```
Data Format and Storage
Input Requirements
Input files must be tab-delimited or CSV format containing URLs with the following required columns:
uuid: Unique internal identifieridentifier: Image URLsource_id: Source-specific identifier
Optional columns:
license: License URLsource: Source attributiontitle: Image title
Output Structure
Downloaded data is stored in the configured images_folder, partitioned by server name and partition ID:
Success Records (successes.parquet)
uuid: Dataset internal identifiersource_id: Source-provided identifieridentifier: Original image URLis_license_full: Boolean indicating complete license informationlicense,source,title: Attribution informationhashsum_original,hashsum_resized: MD5 hashesoriginal_size,resized_size: Image dimensionsimage: Binary image data
Error Records (errors.parquet)
uuid: Dataset internal identifieridentifier: Failed image URLretry_count: Number of download attemptserror_code: HTTP or internal error codeerror_msg: Detailed error description
Supported Image Formats
The downloader supports common web image formats:
- JPEG/JPG
- PNG
- GIF (first frame extraction)
- BMP
- TIFF
Logging and Monitoring
Logging Levels
INFO: Essential information including batch progress and errorsDEBUG: Detailed information including individual download events
Log Organization
Logs are organized hierarchically by:
- Pipeline stage (initialization, profiling, downloading)
- Batch number and iteration
- Worker process ID
See Structure Documentation for detailed log organization.
Performance and Troubleshooting
Common Performance Issues
- Rate limiting errors (429, 403):
- Reduce
default_rate_limitin configuration - Increase
rate_multiplierfor longer delays
- Memory constraints:
- Reduce
batch_sizeorworkers_per_node - Monitor system memory usage
- Network timeouts:
- Check connectivity to source servers
- Review firewall and proxy settings
Error Recovery
The system automatically resumes from checkpoints. For manual intervention:
- Review error distributions in parquet files
- Check server-specific error patterns
- Use ignored server list for problematic hosts
See Troubleshooting Guide for comprehensive error resolution.
Environment Variables
The system exports numerous environment variables for script coordination:
General Parameters:
CONFIG_PATH,PATH_TO_INPUT,PATH_TO_OUTPUTOUTPUT_*_FOLDERvariables for each output component
Downloader-Specific:
DOWNLOADER_MAX_NODES,DOWNLOADER_WORKERS_PER_NODEDOWNLOADER_BATCH_SIZE,DOWNLOADER_RATE_MULTIPLIER
Tools-Specific:
TOOLS_MAX_NODES,TOOLS_WORKERS_PER_NODETOOLS_THRESHOLD_SIZE,TOOLS_NEW_RESIZE_SIZE
System Requirements
Minimum Requirements
- Multi-node HPC cluster with Slurm
- High-bandwidth network connectivity
- Substantial storage capacity for downloaded datasets
- MPI-capable compute environment
Recommended Configuration
- 20+ compute nodes with 20+ cores each
- High-speed interconnect (InfiniBand recommended)
- Parallel file system (Lustre, GPFS)
- Dedicated network bandwidth for external downloads
License
This project is licensed under the MIT License. See the LICENSE file for details.
Documentation
- Process Overview — High-level workflow description
- Output Structure — Detailed output organization
- Example Output — Example output files for schedule and log generation processes
- Scripts Documentation — Slurm script configuration
- Troubleshooting Guide — Common issues and solutions
Contributing
We welcome contributions! Please see our contributing guidelines and ensure all tests pass before submitting pull requests.
Owner
- Name: Imageomics Institute
- Login: Imageomics
- Kind: organization
- Website: https://imageomics.osu.edu
- Twitter: imageomics
- Repositories: 4
- Profile: https://github.com/Imageomics
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Distributed-downloader
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Andrei
family-names: Kopanev
email: kopanev.1@buckeyemail.osu.edu
affiliation: Imageomics Institute
- given-names: 'Matthew J'
family-names: Thompson
affiliation: Imageomics Institute
email: thompson.4509@osu.edu
orcid: 'https://orcid.org/0000-0003-0583-8585'
- given-names: 'Elizabeth G'
family-names: Campolongo
email: campolongo.4@osu.edu
affiliation: Imageomics Institute
orcid: 'https://orcid.org/0000-0003-0846-2413'
repository-code: 'https://github.com/Imageomics/distributed-downloader'
identifiers:
- description: "The GitHub release URL of tag v0.2.0-beta."
type: url
value: "https://github.com/Imageomics/distributed-downloader/releases/tag/v0.2.0-beta"
- description: "The GitHub URL of the commit tagged with v0.2.0-beta."
type: url
value: "https://github.com/Imageomics/distributed-downloader/tree/<commit-hash>" # update post release
abstract: >-
MPI-based distributed downloading tool for retrieving data
from diverse domains.
keywords:
- parallel
- distributed
- download
- url
- "dataset generation"
- "MPI application"
license: MIT
version: 0.2.0-beta
date-released: '2025-04-17'
GitHub Events
Total
- Create event: 7
- Issues event: 10
- Release event: 4
- Delete event: 13
- Issue comment event: 11
- Push event: 34
- Pull request review comment event: 34
- Pull request event: 9
- Pull request review event: 30
Last Year
- Create event: 7
- Issues event: 10
- Release event: 4
- Delete event: 13
- Issue comment event: 11
- Push event: 34
- Pull request review comment event: 34
- Pull request event: 9
- Pull request review event: 30
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 8
- Total pull requests: 8
- Average time to close issues: 8 months
- Average time to close pull requests: 3 months
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 0.25
- Average comments per pull request: 0.38
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 6
- Average time to close issues: 2 months
- Average time to close pull requests: 5 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.17
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- egrace479 (10)
- tms2003 (1)
- Andrey170170 (1)
Pull Request Authors
- Andrey170170 (10)
- dependabot[bot] (9)
- egrace479 (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 13 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
pypi.org: distributed-downloader
MPI-based tool for downloading images from a list of URLs in parallel.
- Homepage: https://github.com/Imageomics/distributed-downloader
- Documentation: https://distributed-downloader.readthedocs.io/
- License: MIT License
-
Latest release: 0.2.0b0
published 9 months ago