cautious-robot
Simple images from CSV downloader that runs and records checksums on downloaded image folder.
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization imageomics has institutional domain (imageomics.osu.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
Simple images from CSV downloader that runs and records checksums on downloaded image folder.
Basic Info
Statistics
- Stars: 4
- Watchers: 10
- Forks: 0
- Open Issues: 8
- Releases: 4
Topics
Metadata Files
README.md
cautious-robot 

I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV and will warn you if the parent image folder already exists before overwriting it. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, sum-buddy helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
The Cautious Robot Logo was designed using Canva Magic Media.
Requirements
Python 3.10+
Installation
bash
pip install cautious-robot
How it Works
Cautious-robot will check the provided CSV for IMG_NAME, URL, and SUBFOLDERS (if provided), then download all images that have a value in the IMG_NAME column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the provided OUTPUT folder already exists, asking the user before proceeding. Images that have a filename but no URL are recorded in the error log; the user is prompted whether to ignore or address the missing URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
If desired, a secondary output directory (OUTPUT_downsized) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following HTTP response status codes: 429, 500, 502, 503, 504.
After downloading the images, cautious-robot calls sum-buddy to calculate and record checksums of the OUTPUT folder contents. It prints the number of images contained in the OUTPUT folder along with the expected number (based on a count of the unique, non-null filenames in the source file). If provided a column with checksums in the source file, it will then further verify that all expected images are downloaded through an inner merge on the checksum and filename columns of the source file with the checksum CSV (thus avoiding confusion in case of duplicate images).
Command Line Usage
``` usage: cautious-robot [-h] -i [INPUTFILE] -o [OUTPUTDIR] [-s [SUBDIRCOL]] [-n [IMGNAMECOL]] [-u [URLCOL]] [-w WAITTIME] [-r MAXRETRIES] [-l SIDELENGTH] [-x STARTINGIDX] [-a CHECKSUMALGORITHM] [-v [VERIFIERCOL]]
options: -h, --help show this help message and exit
required arguments: -i [INPUTFILE], --input-file [INPUTFILE] path to CSV file with urls. -o [OUTPUTDIR], --output-dir [OUTPUTDIR] main directory to download images into.
optional arguments: -s [SUBDIRCOL], --subdir-col [SUBDIRCOL] name of column to use for subfolders in image directory (defaults to flat directory if left blank) -n [IMGNAMECOL], --img-name-col [IMGNAMECOL] column to use for image filename (default: filename) -u [URLCOL], --url-col [URLCOL] column with URLs to download (default: fileurl) -w WAITTIME, --wait-time WAITTIME seconds to wait between retries for an image (default: 3) -r MAXRETRIES, --max-retries MAXRETRIES max times to retry download on a single image (default: 5) -l SIDELENGTH, --side-length SIDELENGTH number of pixels per side for resized square images (default: no resized images created) -x STARTINGIDX, --starting-idx STARTINGIDX index of CSV at which to start download (default: 0) -a CHECKSUMALGORITHM, --checksum-algorithm CHECKSUMALGORITHM checksum algorithm to use on images (default: md5, available: sha256, sha384, md5-sha1, blake2b, sha512, sha1, sm3, sha3256, sha512256, sha224, sha3224, ripemd160, sha3384, shake128, blake2s, md5, sha3512, sha512224, shake256) -v [VERIFIERCOL], --verifier-col [VERIFIER_COL] name of column in source CSV with checksums (same hash as -a) to verify download ```
CLI Examples
Sample CSVs [1] are provided in the examples/ directory to test the CLI.
Defaults:
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_imagesOutput:
console 100%|██████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.18it/s] Images downloaded from examples/HCGSD_testNA.csv to examples/test_images. Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl. Calculating md5 checksums on examples/test_images: 100%|███████████████████████████████████████████| 16/16 [00:00<00:00, 3133.00it/s] md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv 8 images were downloaded to examples/test_images of the 8 expected.head -n 9 examples/HCGSD_testNA_checksums.csvOutput:console filepath,filename,md5 examples/test_images/10429021_V_lowres.jpg,10429021_V_lowres.jpg,c6aeb9d2f6db412ff5be0eb0b5435b83 examples/test_images/10428595_D_lowres.jpg,10428595_D_lowres.jpg,55882a0f3fdf8a68579c07254395653b examples/test_images/10428972_V_lowres.jpg,10428972_V_lowres.jpg,0047e7454ce444f67fee1c90cc3ba9cb examples/test_images/10428803_D_lowres.jpg,10428803_D_lowres.jpg,d8bfb73f2d3556390de04aa98822b815 examples/test_images/10428169_V_lowres.jpg,10428169_V_lowres.jpg,042c9dc294d589ce3f140f14ddab0166 examples/test_images/10428321_D_lowres.jpg,10428321_D_lowres.jpg,fbeeed30274e424831b06360b587ceb3 examples/test_images/10428140_V_lowres.jpg,10428140_V_lowres.jpg,c11538f2de5a5e2d6013fc800848d43a examples/test_images/10428250_V_lowres.jpg,10428250_V_lowres.jpg,14ac99b1a9913a9d420f21b94d6136d6Download Images to Subfolders Based on Column Value:
cautious-robot -i examples/HCGSD_testNA.csv -o examples/test_images_subdirs --subdir-col SpeciesOutput:
console 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.47it/s] Images downloaded from examples/HCGSD_testNA.csv to examples/test_images_subdirs. Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl. Calculating md5 checksums on examples/test_images_subdirs: 100%|█████████████████████████████████████████████| 8/8 [00:00<00:00, 3106.60it/s] md5 checksums for examples/test_images_subdirs written to examples/HCGSD_testNA_checksums.csv 8 images were downloaded to examples/test_images_subdirs of the 8 expected.ls examples/test_imagesOutput:console erato melpomene
head -n 9 examples/HCGSD_testNA_checksums.csv
Output:
console filepath,filename,md5 examples/test_images_subdirs/erato/10429021_V_lowres.jpg,10429021_V_lowres.jpg,c6aeb9d2f6db412ff5be0eb0b5435b83 examples/test_images_subdirs/erato/10428595_D_lowres.jpg,10428595_D_lowres.jpg,55882a0f3fdf8a68579c07254395653b examples/test_images_subdirs/erato/10428972_V_lowres.jpg,10428972_V_lowres.jpg,0047e7454ce444f67fee1c90cc3ba9cb examples/test_images_subdirs/erato/10428803_D_lowres.jpg,10428803_D_lowres.jpg,d8bfb73f2d3556390de04aa98822b815 examples/test_images_subdirs/melpomene/10428169_V_lowres.jpg,10428169_V_lowres.jpg,042c9dc294d589ce3f140f14ddab0166 examples/test_images_subdirs/melpomene/10428321_D_lowres.jpg,10428321_D_lowres.jpg,fbeeed30274e424831b06360b587ceb3 examples/test_images_subdirs/melpomene/10428140_V_lowres.jpg,10428140_V_lowres.jpg,c11538f2de5a5e2d6013fc800848d43a examples/test_images_subdirs/melpomene/10428250_V_lowres.jpg,10428250_V_lowres.jpg,14ac99b1a9913a9d420f21b94d6136d6
- Image Checksum Mismatch: one value is intentionally altered in the source CSV
cautious-robot -i examples/HCGSD_test_MD5_mismatch.csv -o examples/test_images_md5_mismatch --subdir-col Species -v "md5"> Output: >console > 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.23it/s] > Images downloaded from examples/HCGSD_test_MD5_mismatch.csv to examples/test_images_md5_mismatch. > Download logs are in examples/HCGSD_test_MD5_mismatch_log.jsonl and examples/HCGSD_test_MD5_mismatch_error_log.jsonl. > Calculating md5 checksums on examples/test_images_md5_mismatch: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s] > md5 checksums for examples/test_images_md5_mismatch written to examples/HCGSD_test_MD5_mismatch_checksums.csv > 8 images were downloaded to examples/test_images_md5_mismatch of the 8 expected. > Image mismatch: 1 image(s) not aligned, see examples/HCGSD_test_MD5_mismatch_missing.csv for missing image info and check logs. ># Check on that mis-aligned image head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csv> Output: >console > nhm_specimen,species,subspecies,sex,file_url,filename,md5 > 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.jpg,mismatch >
Development
To develop the package further:
- Clone the repository and create a branch.
- Install with dev dependencies:
bash pip install -e ".[dev]" - Install pre-commit hook:
bash pre-commit install pre-commit autoupdate # optionally update - Run tests:
bash pytest
[1] The test images are from the Cuthill Gold Standard Dataset, which was processed from Cuthill, et. al. (original dataset available at doi:10.5061/dryad.2hp1978).
Owner
- Name: Imageomics Institute
- Login: Imageomics
- Kind: organization
- Website: https://imageomics.osu.edu
- Twitter: imageomics
- Repositories: 4
- Profile: https://github.com/Imageomics
Citation (CITATION.cff)
abstract: "Images from CSV sequential downloader that runs and records checksums on downloaded image folder."
authors:
- family-names: "Campolongo"
given-names: "Elizabeth G."
orcid: "https://orcid.org/0000-0003-0846-2413"
- family-names: "Thompson"
given-names: "Matthew J."
orcid: "https://orcid.org/0000-0003-0583-8585"
- family-names: "Duan"
given-names: "Zoe"
orcid: "https://orcid.org/0000-0002-8547-5907"
- family-names: "Lapp"
given-names: "Hilmar"
orcid: "https://orcid.org/0000-0001-9107-0714"
cff-version: 1.2.0
date-released: "2025-04-03"
identifiers:
- description: "The GitHub release URL of tag v1.0.0."
type: url
value: "https://github.com/Imageomics/cautious-robot/releases/tag/v1.0.0"
- description: "The GitHub URL of the commit tagged with v1.0.0."
type: url
value: "https://github.com/Imageomics/cautious-robot/tree/e7d6718ce33fb4c1d71fd4e4bc1b7b809d5133b0"
keywords:
- imageomics
- metadata
- CSV
- images
- download
- "sequential download"
- verifier
- checksums
- downsized
- downloader
- url
license: MIT
message: "If you use this software, please cite it using the metadata from this file."
repository-code: "https://github.com/Imageomics/cautious-robot"
title: "Cautious Robot"
version: "1.0.0"
doi: "10.5281/zenodo.15133580"
type: software
GitHub Events
Total
- Create event: 6
- Release event: 1
- Issues event: 4
- Watch event: 1
- Delete event: 6
- Member event: 1
- Issue comment event: 3
- Push event: 22
- Pull request event: 12
- Pull request review event: 13
- Pull request review comment event: 6
Last Year
- Create event: 6
- Release event: 1
- Issues event: 4
- Watch event: 1
- Delete event: 6
- Member event: 1
- Issue comment event: 3
- Push event: 22
- Pull request event: 12
- Pull request review event: 13
- Pull request review comment event: 6
Packages
- Total packages: 1
-
Total downloads:
- pypi 10 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 2
pypi.org: cautious-robot
Simple downloader that downloads images from URLs in a CSV and names them by the given column. Then uses sum-buddy to gather and record checksums for all downloaded images.
- Homepage: https://github.com/Imageomics/cautious-robot
- Documentation: https://cautious-robot.readthedocs.io/
- License: MIT License
-
Latest release: 1.0.0
published 9 months ago
Rankings
Maintainers (2)
Dependencies
- argparse *
- pandas *
- pillow *
- requests *
- sum-buddy @ git+https://github.com/Imageomics/sum-buddy.git@v0.1.0-alpha
- actions/checkout v4 composite
- actions/setup-python v5 composite