2018-neon-beetles-processing
Processing code and exploratory notebooks for BeetlePalooza dataset (2018 NEON Ethanol-preserved Ground Beetles).
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization imageomics has institutional domain (imageomics.osu.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Keywords
Repository
Processing code and exploratory notebooks for BeetlePalooza dataset (2018 NEON Ethanol-preserved Ground Beetles).
Basic Info
- Host: GitHub
- Owner: Imageomics
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://huggingface.co/datasets/imageomics/2018-NEON-beetles
- Size: 63.6 MB
Statistics
- Stars: 0
- Watchers: 5
- Forks: 2
- Open Issues: 4
- Releases: 1
Topics
Metadata Files
README.md
BeetlePalooza Dataset Code 
This repository hosts the code and notebooks used to explore and process the BeetlePalooza dataset: 2018 NEON Ethanol-preserved Ground Beetles.
Data Exploration and Analysis
Getting Started
In a fresh python environment, run:
pip install -r requirements.txt.
CSVs explored in the notebook are pulled directly from Huggingface through their URL (these are pointing to the particular commit for the version). Adjusted CSVs are saved to a data/ folder which is ignored by git since they are too large (versioning requires git lfs, so they are stored on Hugging Face).
Notebooks
Note: The first two notebooks are exploratory, but 0-3 and 0-4 are largely data curation, not exploration. Each notebook has a paired py file generated using jupytext.
- EDA-0-1 gives an initial exploration of the data. It adds and renames some columns in the metadata file for the dataset.
- EDA-0-2 explores the variation in the measurements of individuals (provides graphs). It also checks the potential outliers and creates a measurement ID, providing a unique ID for the beetle measurement CSV.
- EDA-0-3 fixes the outliers that were mislabeled, then generates individual-based CSVs for segmentation and connection to the individual images to be created from the segmentation process.
- EDA-0-4 adds "scientificName", "genus", "species", "NEONsampleID", and "siteID" columns to the resized beetle metadata file to display alongside the resized images in the dataset viewer on HF. Also, adds metadata files for `groupimages
andgroupimagesmasks` for the dataset viewer and fixes a mis-labeled image.
Metadata
- all_measurements is a CSV with all the measurements done by each annotator (each row is a pair of measurements for a single beetle).
- groupimagessb and groupimagesmasks_sb are intermediate CSVs generated to align metadata files for the dataset viewer for those folders in the 2018-NEON-beetles Dataset. They were generated using sum-buddy as described in
EDA-0-4.ipynb. - individualmetadatafull is a CSV with all the measurements done by Isadora Fluck (each row represents an individual beetle with its pair of elytra measurements). This was created for the segmentation process.
- multiannotatorcount is a CSV with counts of annotations per image, the expected number (based on the number of rows and annotators associated with that image), and the maximum
individualnumber provided for that image (ifmax_individualis less than 99, that is the number of individuals in that image; if it's 99 or greater, then there may be more individuals based on the individual count and numeric export from Zooniverse).
Note that all_measurments.csv and individual_metadata_full.csv are supersets of the individual_metadata.csv in 2018 NEON Ethanol-preserved Ground Beetles (they contributed to its creation from BeetleMeasurements.csv), and are thus reproduced here under the CC BY-SA 4.0 license and should be cited appropriately if re-used.
Segmentation
The segmentation folder contains scripts to leverage the elytra length and width coordinates and Meta's Segment-Anything model to segment beetles out.
To configure your environment using conda run:
cd segmentation
conda env create --file environment.yaml
conda activate beetles
To predict segmentation masks for beetles imaged, run:
python3 predict_masks.py --images <path to images> --csv <path to image metadata csv> --results <optional; name for csv of segmentation results>
To remove the background of beetle images using their segmentation masks run:
python3 remove_background.py --images <path to images> --masks <path to segmentation masks>
To crop out individual beetles from images run:
python3 individual_beetles.py --images <path to group_images> --csv <path to metadata/individual_metadata_full.csv>
FYI: The script to crop out individual beetles works well for the images that have coordspixlength and coordspixwidth information correctly align to beetles. However, there are a couple images where this is not the case, and thus the segmentation of beetles will not result in a nice crop of the individual beetles.
To remove the background from the individual images, run:
python3 remove_individual_background.py --images <path to group_images> --result <path to folder where results will be saved>
To crop out elytra from the individual images, run:
python3 segment_elytra.py --images <path to images> --result <path to folder where results will be saved>
Owner
- Name: Imageomics Institute
- Login: Imageomics
- Kind: organization
- Website: https://imageomics.osu.edu
- Twitter: imageomics
- Repositories: 4
- Profile: https://github.com/Imageomics
Citation (CITATION.cff)
abstract: "This repository hosts the code and notebooks used to explore and process the [BeetlePalooza](https://github.com/Imageomics/BeetlePalooza-2024/wiki) dataset: [2018 NEON Ethanol-preserved Ground Beetles](https://huggingface.co/datasets/imageomics/2018-NEON-beetles)."
authors:
- family-names: Ramirez
given-names: Michelle
orcid: "https://orcid.org/0009-0008-8162-5729"
- family-names: Nepovinnykh
given-names: Ekaterina
orcid: "https://orcid.org/0000-0002-5045-5041"
- family-names: Ali
given-names: Sarwan
orcid: "https://orcid.org/0000-0001-8121-2168"
- family-names: Campolongo
given-names: "Elizabeth G."
orcid: "https://orcid.org/0000-0003-0846-2413"
cff-version: 1.2.0
date-released: "2025-08-28"
keywords:
- imageomics
- biology
- image
- animals
- CV
- segmentation
- EDA
- annotation
- beetles
- "elytra measurements"
- "body size"
license: MIT
message: "If you find this software helpful in your research, please cite both the software and data."
repository-code: "https://github.com/Imageomics/2018-NEON-beetles-processing"
title: "2018 NEON Ethanol-preserved Ground Beetles Processing"
version: 2.0.0
doi: "10.5281/zenodo.16989738"
type: software
references:
- authors:
- family-names: Fluck
given-names: "Isadora E."
- family-names: Baiser
given-names: Benjamin
- family-names: Wolcheski
given-names: Riley
- family-names: Chinniah
given-names: Isha
- family-names: Record
given-names: Sydne
title: "2018 NEON Ethanol-preserved Ground Beetles"
#version:
type: dataset
doi: "10.57967/hf/5252"
url: "https://huggingface.co/datasets/imageomics/2018-NEON-beetles"
date-released: 2025
GitHub Events
Total
- Create event: 7
- Release event: 1
- Issues event: 2
- Delete event: 3
- Issue comment event: 9
- Push event: 10
- Pull request review comment event: 1
- Pull request review event: 5
- Pull request event: 10
Last Year
- Create event: 7
- Release event: 1
- Issues event: 2
- Delete event: 3
- Issue comment event: 9
- Push event: 10
- Pull request review comment event: 1
- Pull request review event: 5
- Pull request event: 10
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 1
- Total pull requests: 6
- Average time to close issues: 6 months
- Average time to close pull requests: about 1 month
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 6
- Average time to close issues: 6 months
- Average time to close pull requests: about 1 month
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- egrace479 (1)
Pull Request Authors
- egrace479 (7)
- ramirezmichelle (2)
- hlapp (2)
- kwadraterry (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- jupyter *
- jupytext ==1.16.4
- pandas ==2.2.2
- seaborn ==0.13.2
- _libgcc_mutex 0.1
- _openmp_mutex 5.1
- bzip2 1.0.8
- ca-certificates 2024.7.2
- ld_impl_linux-64 2.38
- libffi 3.4.4
- libgcc-ng 11.2.0
- libgomp 11.2.0
- libstdcxx-ng 11.2.0
- libuuid 1.41.5
- ncurses 6.4
- openssl 3.0.14
- pip 24.0
- python 3.10.14
- readline 8.2
- setuptools 69.5.1
- sqlite 3.45.3
- tk 8.6.14
- wheel 0.43.0
- xz 5.4.6
- zlib 1.2.13