2018-neon-beetles-processing

Processing code and exploratory notebooks for BeetlePalooza dataset (2018 NEON Ethanol-preserved Ground Beetles).

https://github.com/imageomics/2018-neon-beetles-processing

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization imageomics has institutional domain (imageomics.osu.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

annotation beetles eda images segmentation
Last synced: 4 months ago · JSON representation ·

Repository

Processing code and exploratory notebooks for BeetlePalooza dataset (2018 NEON Ethanol-preserved Ground Beetles).

Basic Info
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 2
  • Open Issues: 4
  • Releases: 1
Topics
annotation beetles eda images segmentation
Created over 1 year ago · Last pushed 5 months ago
Metadata Files
Readme License Citation

README.md

BeetlePalooza Dataset Code DOI

This repository hosts the code and notebooks used to explore and process the BeetlePalooza dataset: 2018 NEON Ethanol-preserved Ground Beetles.

Data Exploration and Analysis

Getting Started

In a fresh python environment, run: pip install -r requirements.txt.

CSVs explored in the notebook are pulled directly from Huggingface through their URL (these are pointing to the particular commit for the version). Adjusted CSVs are saved to a data/ folder which is ignored by git since they are too large (versioning requires git lfs, so they are stored on Hugging Face).

Notebooks

Note: The first two notebooks are exploratory, but 0-3 and 0-4 are largely data curation, not exploration. Each notebook has a paired py file generated using jupytext.

  • EDA-0-1 gives an initial exploration of the data. It adds and renames some columns in the metadata file for the dataset.
  • EDA-0-2 explores the variation in the measurements of individuals (provides graphs). It also checks the potential outliers and creates a measurement ID, providing a unique ID for the beetle measurement CSV.
  • EDA-0-3 fixes the outliers that were mislabeled, then generates individual-based CSVs for segmentation and connection to the individual images to be created from the segmentation process.
  • EDA-0-4 adds "scientificName", "genus", "species", "NEONsampleID", and "siteID" columns to the resized beetle metadata file to display alongside the resized images in the dataset viewer on HF. Also, adds metadata files for `groupimagesandgroupimagesmasks` for the dataset viewer and fixes a mis-labeled image.

Metadata

  • all_measurements is a CSV with all the measurements done by each annotator (each row is a pair of measurements for a single beetle).
  • groupimagessb and groupimagesmasks_sb are intermediate CSVs generated to align metadata files for the dataset viewer for those folders in the 2018-NEON-beetles Dataset. They were generated using sum-buddy as described in EDA-0-4.ipynb.
  • individualmetadatafull is a CSV with all the measurements done by Isadora Fluck (each row represents an individual beetle with its pair of elytra measurements). This was created for the segmentation process.
  • multiannotatorcount is a CSV with counts of annotations per image, the expected number (based on the number of rows and annotators associated with that image), and the maximum individual number provided for that image (if max_individual is less than 99, that is the number of individuals in that image; if it's 99 or greater, then there may be more individuals based on the individual count and numeric export from Zooniverse).

Note that all_measurments.csv and individual_metadata_full.csv are supersets of the individual_metadata.csv in 2018 NEON Ethanol-preserved Ground Beetles (they contributed to its creation from BeetleMeasurements.csv), and are thus reproduced here under the CC BY-SA 4.0 license and should be cited appropriately if re-used.

Segmentation

The segmentation folder contains scripts to leverage the elytra length and width coordinates and Meta's Segment-Anything model to segment beetles out.

To configure your environment using conda run: cd segmentation conda env create --file environment.yaml conda activate beetles

To predict segmentation masks for beetles imaged, run: python3 predict_masks.py --images <path to images> --csv <path to image metadata csv> --results <optional; name for csv of segmentation results>

To remove the background of beetle images using their segmentation masks run: python3 remove_background.py --images <path to images> --masks <path to segmentation masks>

To crop out individual beetles from images run: python3 individual_beetles.py --images <path to group_images> --csv <path to metadata/individual_metadata_full.csv>

FYI: The script to crop out individual beetles works well for the images that have coordspixlength and coordspixwidth information correctly align to beetles. However, there are a couple images where this is not the case, and thus the segmentation of beetles will not result in a nice crop of the individual beetles.

To remove the background from the individual images, run: python3 remove_individual_background.py --images <path to group_images> --result <path to folder where results will be saved>

To crop out elytra from the individual images, run: python3 segment_elytra.py --images <path to images> --result <path to folder where results will be saved>

Owner

  • Name: Imageomics Institute
  • Login: Imageomics
  • Kind: organization

Citation (CITATION.cff)

abstract: "This repository hosts the code and notebooks used to explore and process the [BeetlePalooza](https://github.com/Imageomics/BeetlePalooza-2024/wiki) dataset: [2018 NEON Ethanol-preserved Ground Beetles](https://huggingface.co/datasets/imageomics/2018-NEON-beetles)."
authors:
- family-names: Ramirez
  given-names: Michelle
  orcid: "https://orcid.org/0009-0008-8162-5729"
- family-names: Nepovinnykh
  given-names: Ekaterina
  orcid: "https://orcid.org/0000-0002-5045-5041"
- family-names: Ali
  given-names: Sarwan
  orcid: "https://orcid.org/0000-0001-8121-2168"
- family-names: Campolongo
  given-names: "Elizabeth G."
  orcid: "https://orcid.org/0000-0003-0846-2413"
cff-version: 1.2.0
date-released: "2025-08-28"
keywords:
  - imageomics
  - biology
  - image
  - animals
  - CV
  - segmentation
  - EDA
  - annotation
  - beetles
  - "elytra measurements"
  - "body size"
license: MIT
message: "If you find this software helpful in your research, please cite both the software and data."
repository-code: "https://github.com/Imageomics/2018-NEON-beetles-processing"
title: "2018 NEON Ethanol-preserved Ground Beetles Processing"
version: 2.0.0
doi: "10.5281/zenodo.16989738"
type: software
references:
  - authors:
      - family-names: Fluck
        given-names: "Isadora E."
      - family-names: Baiser
        given-names: Benjamin
      - family-names: Wolcheski
        given-names: Riley
      - family-names: Chinniah
        given-names: Isha
      - family-names: Record
        given-names: Sydne
    title: "2018 NEON Ethanol-preserved Ground Beetles"
    #version:
    type: dataset
    doi: "10.57967/hf/5252"
    url: "https://huggingface.co/datasets/imageomics/2018-NEON-beetles"
    date-released: 2025

GitHub Events

Total
  • Create event: 7
  • Release event: 1
  • Issues event: 2
  • Delete event: 3
  • Issue comment event: 9
  • Push event: 10
  • Pull request review comment event: 1
  • Pull request review event: 5
  • Pull request event: 10
Last Year
  • Create event: 7
  • Release event: 1
  • Issues event: 2
  • Delete event: 3
  • Issue comment event: 9
  • Push event: 10
  • Pull request review comment event: 1
  • Pull request review event: 5
  • Pull request event: 10

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 1
  • Total pull requests: 6
  • Average time to close issues: 6 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 6
  • Average time to close issues: 6 months
  • Average time to close pull requests: about 1 month
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • egrace479 (1)
Pull Request Authors
  • egrace479 (7)
  • ramirezmichelle (2)
  • hlapp (2)
  • kwadraterry (2)
Top Labels
Issue Labels
wontfix (1)
Pull Request Labels
documentation (4) enhancement (1)

Dependencies

requirements.txt pypi
  • jupyter *
  • jupytext ==1.16.4
  • pandas ==2.2.2
  • seaborn ==0.13.2
segmentation/environment.yaml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 5.1
  • bzip2 1.0.8
  • ca-certificates 2024.7.2
  • ld_impl_linux-64 2.38
  • libffi 3.4.4
  • libgcc-ng 11.2.0
  • libgomp 11.2.0
  • libstdcxx-ng 11.2.0
  • libuuid 1.41.5
  • ncurses 6.4
  • openssl 3.0.14
  • pip 24.0
  • python 3.10.14
  • readline 8.2
  • setuptools 69.5.1
  • sqlite 3.45.3
  • tk 8.6.14
  • wheel 0.43.0
  • xz 5.4.6
  • zlib 1.2.13