minnow_segmented_traits
Using trait segmentation to understand minnow trait evolution across an ecological gradient
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 15 DOI reference(s) in README -
✓Academic publication links
Links to: springer.com, ieee.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Repository
Using trait segmentation to understand minnow trait evolution across an ecological gradient
Basic Info
- Host: GitHub
- Owner: hdr-bgnn
- License: mit
- Language: R
- Default Branch: main
- Size: 144 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 9
- Releases: 2
Metadata Files
README.md
Minnow Segmented Traits
We use a segmentation model to extract traits from minnows (Family: Cyprinidae).
This repository serves as a case study of an automated workflow and extraction of morphological traits using machine learning on image data.
We expand upon work already done by BGNN, including metadata collection by the Tulane Team and the Drexel Team (see Leipzig et al. 2021, Pepper et al. 2021, and Narnani et al. 2022), and a segmentation model developed by the Virginia Tech Team. We developed morphology extraction tools (Morphology-analysis) with the help of the Tulane Team. We incorporate these tools into BGNNCoreWorkflow.
Finally, with the help of the Duke Team, we create an automated workflow.

Goals
- Create a use case for using an automated workflow
- Show best practices for interacting with other repositories
- Show utility of using a machine learning segmentation model to accelerate trait extraction from images of specimens
Organization
Scripts - Data_Manipulation.R: code for manipulating and merging data files - MinnowSelectionImageQualityMetadata.R: code for image selection - PresenceAbsenceAnalysis.R: code for analyzing machine learning outputs - init.R: code to load functions in Functions
Files - Previous_Measurements: a file of measurements of minnow traits by found in the supplemental information. See Burress.md for more details.
Results - a folder for the outputs from the workflow 1. tables of results from analyses 2. /Figures contains all figures created from analyses
Config
- contains the config.yml file
- the user can change the file inputs or number of images under limit_images
Inputs
Data Files
The Previous_Measurements file is included in this repository.
The Fish-AIR input files will be downloaded from the Fish-AIR API.
This requires a Fish-AIR API key be added to Fish_AIR_API_Key in config/config.yaml.
Alternatively you can download the Fish-AIR input files from Zenodo and place them in the Files/Fish-AIR/Tulane directory.
Components
The total size of the components are 5.6G (as of 5 May 2023).
All weights and dependencies for all components of the workflow are uploaded to Hugging Face or Zenodo.
Metadata by Drexel Team
- Object detection of fish and rule from fish images
- Repository
- Model Archive
Reformatting of metadata
- Trim metadata output from Metadata step to only the values necessary for this project
- Repository
- Code Archive
Crop Image
- Extract bounding box information from metadata file
- Resizes and crops fish from image
- Repository
- Code Archive
Segmentation Model by Virginia Tech Team
- Segments fish traits from fish images
- Repository
- Model Archive
Morphology analysis by Tulane Team and Battelle Team
- Tool to calculate presence of traits
- Repository
- Code Archive
Machine Learning Workflow by Battelle Team and Duke Team
- Calls all the above containers
- Repository
- Code Archive
Images
The fish images are from the Great Lakes Invasives Network (GLIN) and stored on Fish-AIR. We are using images specifically from the Illinois Natural History Survey (INHS images).
Image Selection
R code (MinnowSelectionImageQualityMetadata.R) was used to filter out high quality, minnow images using the IQM and IM metadata files.
All image metadata files are downloaded from Fish-AIR and the version used is stored on the OSC data commons under the Fish Traits dataverse. The metadata files have been generated using the Tulane worflow.
Criteria for selection of an image was based on findings from Pepper et al. 2021.
Criteria chosen:
- family == "Cyprinidae"
- specimenView == "left"
- specimenCurved == "straight"
- allPartsVisible == "True"
- partsOverlapping == "True"
- partsFolded == "False"
- uniformBackground == "True"
- partsMissing == "False"
- brightness == "normal"
- onFocus == "True"
- colorIssues == "none"
- containsScaleBar == "True"
- from either INHS or UWZM institutions
Analysis
See more details in Morphology-analysis.
Each segmented image has the following traits: trunk, head, eye, dorsal fin, caudal fin, anal fin, pelvic fin, and pectoral fin. For each segmented trait, there may be more than one "blob", or group of pixels identifying a trait. We created a matrix of presence.absence.matrix.csv.
For each trait, we counted the number of "blobs" and the percentage of the largest blob as a proportion of all blobs for a trait.
All intermediate tables will be saved in the folder "Results".
Figures
We created a heat map to show the success of the segmentation to detect traits from the images.
Figures are in the folder "Results".
Running the Workflow
Instructions are provided for running the workflow on a single computer or a SLURM cluster.
The run time for 20 images is about 45 minutes and the run time for all the images is about 2 hours.
Software Requirements
To run the workflow conda and Singularity (now Apptainer) must to be installed.
Component Software Dependencies
This workflow will automatically download and setup the software dependencies required by the workflow components. These dependencies are provided using either Singularity Containers or Conda Environments. Singularity Containers are used to provide the machine learning components essential to this workflow. Singularity Containers enable highly reproducible and portable software components. However, using Singularity Containers can pose challenges for script development by domain scientists. Therefore, we use Conda Environments for the domain scientist scripts included in this workflow.
Hardware Requirements
Minimally the workflow requires 1 CPU, 5 GB memory, and 30 GB disk space. A Linux machine is required for this workflow to provide Singularity containerization.
Install Workflow Runner
To run the workflow Snakemake v7 with mamba must be installed. (The workflow definition is not compatible with Snakemake v8+.) To handle this we create a new conda environment named "snakemake".
If you are running the workflow on a cluster that provides a conda environment module you should load that module
(eg. module load miniconda3).
Run the following command to create a conda environment named "snakemake" with the required workflow dependencies.
console
conda create -c conda-forge -c bioconda -n snakemake snakemake=7 mamba
Enter "Y" when prompted to install snakemake and mamba.
If you loaded an environment module you should unload it (eg. module purge).
See the official instructions for installing snakemake for more options.
Limit images
In the config/config.yaml file, the user can limit the number of images for a test run by change the integer under limit_images, or run them all by entering "". Be aware that downloading all the images and running the workflow takes a couple of hours.
Run snakemake
Run the following commands to activate the conda environment and run the workflow:
console
source activate snakemake
snakemake --jobs 1 --use-singularity --use-conda
The --jobs argument specifies how many processes the snakemake can run at a time.
Run snakemake on a SLURM Cluster
Running the workflow on a SLURM cluster enables scaling beyond a single machine. The run-workflow.sh sbatch script is provided to run the workflow using sbatch and will process up to 20 jobs simultaneously.
If your SLURM cluster provides a conda environment module you should load that module before running the next step(eg. module load miniconda3).
Run the following commmand to activate the snakemake conda environment:
console
source activate snakemake
Running on the workflow in the background:
console
sbatch run-workflow.sh
Then you can monitor the job progress as you would with any SLURM background job.
Some SLURM clusters require providing sbatch a SLURM account name via the --account command line argument.
See the Run-on-OSC wiki article for the commands used to run the workflow on OSC.
Run on Docker
In some cases it is possible to run the workflow using Docker. See the experimental Docker Instructions for more details.
Owner
- Name: Biology Guided Neural Networks
- Login: hdr-bgnn
- Kind: organization
- Repositories: 10
- Profile: https://github.com/hdr-bgnn
Citation (CITATION.cff)
cff-version: 1.0.0
date-released: "2023-05-23"
abstract: "This repository serves as a case study of an automated workflow and extraction of morphological traits using machine learning on image data. We use a segmentation model to extract traits from minnows (Family: Cyprinidae). We expand upon work already done by BGNN, including metadata collection by the Tulane Team and the Drexel Team (see Leipzig et al. 2021, Pepper et al. 2021, and Narnani et al. 2022), and a segmentation model developed by the Virginia Tech Team. We developed morphology extraction tools (Morphology-analysis) with the help of the Tulane Team. We incorporate these tools into BGNN_Core_Workflow. Finally, with the help of the Duke Team, we create an automated workflow."
authors:
- family-names: Balk
given-names: Meghan
orcid: "https://orcid.org/0000-0003-2699-3066"
- family-names: Bradley
given-names: John
orcid: "https://orcid.org/0000-0003-3858-848X"
- family-names: Tabarin
given-names: Thibault
orcid: "https://orcid.org/0000-0003-4256-849X"
- family-names: Lapp
given-names: Hilmar
orcid: "https://orcid.org/0000-0001-9107-0714"
identifiers:
- description: "Snapshot of Minnow_Segmented_Traits Version 1.0.0"
type: doi
value: 10.5281/zenodo.7963343
keywords:
- "imageomics"
- "workflow"
- "machine learning"
license: MIT
message: "If you use this software, please cite it using these metadata."
references:
- authors:
- family-names: Tabarin
given-names: Thibault
- family-names: Bradley
given-names: John
- family-names: Balk
given-names: Meghan
- family-names: Lapp
given-names: Hilmar
doi: 10.5281/zenodo.7987705
repository-code: "https://github.com/hdr-bgnn/BGNN_Core_Workflow"
title: BGNN_Core_Workflow
type: software
version: 1.0.1
- authors:
- family-names: Karnani
given-names: Kevin
- family-names: Pepper
given-names: Joel
- family-names: Bakis
given-names: Yasin
- family-names: Wang
given-names: Xiaojun
- family-names: "Bart, Jr."
given-names: Henry
- family-names: Breen
given-names: "David E"
- family-names: Greenberg
given-names: Jane
doi: 10.57967/hf/0904
repository-code: "https://github.com/hdr-bgnn/drexel_metadata"
title: Drexel-metadata-generator
type: software
version: 0.6
- authors:
- family-names: Tabarin
given-names: Thibault
- family-names: Bradley
given-names: John
- family-names: Lapp
given-names: Hilmar
doi: 10.5281/zenodo.7987576
repository-code: "https://github.com/hdr-bgnn/drexel_metadata_formatter"
title: drexel_metadata_formatter
type: software
version: 0.0.1
- authors:
- family-names: Tabarin
given-names: Thibault
- family-names: Bradley
given-names: John
- family-names: Lapp
given-names: Hilmar
doi: 10.5281/zenodo.7987485
repository-code: "https://github.com/hdr-bgnn/Crop_image"
title: Crop_Image
type: software
version: 0.0.4
- authors:
- family-names: Maruf
given-names: "M."
- family-names: Karpatne
given-names: Anuj
doi: 10.57967/hf/0832
repository-code: "https://github.com/hdr-bgnn/BGNN-trait-segmentation"
title: BGNN-trait-segmentation
type: software
version: 0.0.7
- authors:
- family-names: Tabarin
given-names: Thibault
- family-names: Bradley
given-names: John
- family-names: Balk
given-names: Meghan
- family-names: Lapp
given-names: Hilmar
doi: 10.5281/zenodo.7987697
repository-code: "https://github.com/hdr-bgnn/Morphology-analysis"
title: Morphology-analysis
type: software
version: 1.0.0
- authors:
- family-names: Burress
given-names: "E.D."
- family-names: Holcomb
given-names: "J.M."
- family-names: Tan
given-names: "M."
- family-names: Armbruster
given-names: "J.W."
doi: 10.1111/jeb.13024
title: "Ecological diversification associated with the benthic-to-pelagic transition by North American minnows"
type: paper
- authors:
- family-names: "Bart, Jr."
given-names: "Henry L."
- family-names: Bakis
given-names: Yasin
- family-names: Altintas
given-names: Bahadir
- family-names: Wang
given-names: Xiaojun
- family-names: Jebbia
given-names: Dom
repository-code: "https://fishair.org/"
title: "Fish-AIR"
type: data
repository-code: "https://github.com/hdr-bgnn/Minnow_Segmented_Traits"
title: "Minnow Segmented Traits"
version: 1.0.0