https://github.com/bioscan-ml/bioscan-1m

A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

https://github.com/bioscan-ml/bioscan-1m

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

Basic Info
Statistics
  • Stars: 26
  • Watchers: 3
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

BIOSCAN-1M

Alt Text

Overview

This repository houses the codes and data pertaining to the BIOSCAN-1M-Insect project. Within this project, we introduce the BIOSCAN-1M Insect dataset, which can be accessed for download via the provided links. The repository encompasses code for data sampling and splitting, dataset statistics analysis, as well as image-based classification experiments centered around the taxonomy classification of insects.

Anyone interested in using BIOSCAN-1M Insect dataset and/or the corresponding code repository, please cite the Paper:

@inproceedings{gharaee2023step, title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset}, booktitle={Advances in Neural Information Processing Systems}, author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.}, editor={A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine}, pages={43593--43619}, publisher={Curran Associates, Inc.}, year={2023}, volume={36}, url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf}, }

ℹ️ Note
The samples from the BIOSCAN-1M dataset are included in the larger BIOSCAN-5M dataset.
BIOSCAN-5M enhances the original with additional attributes such as: - Geographic coordinates
- Specimen size information
- Cleaned and pruned taxonomic labels
- A new split strategy optimized for multimodal learning

For more details kindly visit the BIOSCAN-5M repository.

Dataset Access

The BIOSCAN-1M Insect dataset is available on GoogleDrive, Zenodo, Kaggle, and HuggingFace. To download a file from GoogleDrive run the following:

bash python main.py --file_to_download <file_name>

The list of files available for download from GoogleDrive are: - Metadata (TSV file format): BIOSCAN1MInsectDatasetmetadata.tsv - Metadata (JSONLD file format): BIOSCAN1MInsectDatasetmetadata.jsonld - Original images resized to 256 on smaller dimension (ZIP file format): original256.zip - Original images resized to 256 on smaller dimension (HDF5 file format): original256.hdf5 - Cropped images resized to 256 on smaller dimension (ZIP file format): cropped256.zip - Cropped images resized to 256 on smaller dimension (HDF5 file format): cropped256.hdf5 - Original full size images (113 ZIP files): bioscanimagesoriginalfullpart{1:113}.zip - Cropped images (113 ZIP files): bioscanimagescroppedfullpart{1:113}.zip

Dataset

BIOSCAN dataset provides researchers with information about insects. Each record of the BIOSCAN-1M Insect dataset contains four primary attributes: * DNA Barcode Sequence * Barcode Index Number (BIN) * Biological Taxonomy Classification * RGB image

I. DNA Barcode Sequence

The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene. This sequence is visually represented in blocks of distinct colors: TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTATTGGTAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTCGGAAATTGACTAGTCCCATTAATATTAGGAGCTCCTGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATGTTACCTCCTTCATTAACTCTATTATTATCAAGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCCCCTTTATCCTCAGGAACTGCTCATGCAGGAGCTTCTGTTGATCTTGCTATTTTCTCTTTACATTTAGCAGGAATTTCTTCAATTCTTGGAGCTGTAAATTTTATTACAACAATTATTAATATACGATCTTCAGGAATTACACTTGATCGAATACCTTTATTTGTTTGATCTGTAATTATTACAGCTATTCTACTTTTACTGTCTCTTCCAGTATTAGCTGGAGCTATTACAATATTATTAACTGATCGTAATTTAAATACATCTTTTTTTGACCCAATTGGAGGAGGAGATCCAATTCTATATCAACATTTAT

Alt Text

This visual representation offers a glimpse into the intricate structure of DNA. The color scheme is designed as follows:

  • Adenine (A): Red
  • Thymine (T): Blue
  • Cytosine (C): Green
  • Guanine (G): Yellow

These nucleotides, represented by their respective colors, play a pivotal role in defining the genetic information encoded within the DNA sequence.

II. Barcode Index Number (BIN)

Organisms are grouped into Operational Taxonomic Units (OTUs) through genetic similarity, forming a genetic proxy for species. Each OTU is assigned a unique Barcode Index Number (BIN), serving as a Uniform Resource Identifier (URI). This BIN ensures that genetically identical taxa share the same identifier, registered in the Barcode Of Life Data system (BOLD).

BOLD:AER5166 Alt Text

BINs, acting as an alternative to Linnean names, provide a genetic-centric classification for organisms, emphasizing the significance of genetic code in taxonomy.

III. Biological Taxonomy Classification (Linnean names)

Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships. It organizes species into groups based on shared characteristics and genetic relatedness. My Image

Figure illustrates the taxonomic classifications of five distinct living organisms within the insect class.

IV. RGB Images

We have published six packages, each containing 1,128,313 BIOSCAN-1M Insect dataset's images. These packages follow a consistent data structure, where the images are divided into 113 data chunks. Each chunk consists of 10,000 images, except for chunk 113, which contains 8,313 images. - (1) Original JPEG images (113 zip files). - (2) Cropped JPEG images (113 zip files). - (3) Original JPEG images resized to 256 on the smaller dimension (ZIP and HDF5). - (4) Cropped JPEG images resized to 256 on their smaller dimension (ZIP and HDF5).

| | | | | |:--------------------------------------------------------------------------:|:------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------:|:----------------------------------------------------------------------------:| | **Diptera: 896,324** | **Hymenoptera: 89,311** | **Coleoptera: 47,328** | **Hemiptera: 46,970** | | | | | | |:------------------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-------------------------------------------------------------------------------:|:------------------------------------------------------------------------------:| | **Lepidoptera: 32,538** | **Psocodea: 9,635** | **Thysanoptera: 2,088** | **Trichoptera: 1,296** | | | | | | |:-----------------------------------------------------------------------------:|:----------------------------------------------------------------------------:|:-----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:| | **Orthoptera: 1,057** | **Blattodea: 824** | **Neuroptera: 676** | **Ephemeroptera: 96** | | | | | | |:-----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------:|:-----------------------------------------------------------------------------:| | **Dermaptera: 66** | **Archaeognatha: 63** | **Plecoptera: 30** | **Embioptera: 6** |

Figure shows original insect images from 16 orders of the BIOSCAN-1M Insect dataset. The numbers below each image identify the number of images in each order group, and clearly illustrate the degree of class imbalance in the BIOSCAN-1M Insect dataset.

Metadata

In addition to the image dataset, we have also published a corresponding metadata file for our dataset, named BIOSCANInsectDataset_metadata. This metadata file is available in both dataframe format (.tsv) and JSON-LD format (.jsonld). The metadata file encompasses valuable information, including taxonomy annotations, DNA barcode sequences, and indexes and labels for each data sample. Furthermore, the metadata file includes the image names and unique IDs that reference the corresponding storage location of each image. It also provides insights into the roles of the images within the split sets. Specifically, it indicates whether an image is used for training, validation, or testing in the six experiments conducted in our paper.

To run the following steps you first need to download dataset and the metadata file, and make path settings appropriately.

Dataset Statistics

To see the statistics of the BIOSCAN-1M Insect dataset, run the following: bash python main.py --print_statistics --exp_name <experiment_name>

Dataset Sampling

To split BIOSCAN-1M Insect dataset into Train, Validation and Test sets using a stratified class-based sampling and split run the following: bash python main.py --make_split

To see the statistics of the BIOSCAN-1M Insect dataset split sets, run the following: bash python main.py --print_split_statistics --exp_name <experiment_name>

Preprocessing

In order to enhance efficiency in terms of time and computational resources for conducting experiments on the BIOSCAN-1M Insect dataset's RGB images, we implemented an offline preprocessing step composed of two main modules: - Resize tool - Crop tool

The resizing tool together with our cropping tool are utilized to modify the original RGB images. By applying this preprocessing step, we aimed to optimize the subsequent experimental processes.

| | | | | |:------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------:| | Original | Original | Original | Original |

| | | | | |:----------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------:| | Cropped | Cropped | Cropped | Cropped |

To resize and save original full size images, run the following:

bash python main.py --resize_image --resized_image_path <path_to_resized_images> --resized_hdf5_path <path_to_resized_hdf5>

To use our cropping tool, from project's GoogleDrive, download the available checkpoint BIOSCANInsectcroptoolcheckpoint.ckpt stored in a designated directory BIOSCAN1MInsectcheckpoints/croptool_checkpoint ensuring accurate path configuration in the main.py script and run the following to create and save cropped images as well as their resized versions:

bash python main.py --crop_image --cropped_image_path <path_to_cropped_images> --resized_cropped_image_path <path_to_resized_cropped_images>

By setting --croppedhdf5path and --resizedcroppedhdf5_path, cropped images and resized cropped images will be saved in HDF5 file format as well.

Classification Experiments

Two image-based classification experiments were conducted, focusing on the taxonomy ranking of insects. The first set of experiments involved classifying BIOSCAN-1M Insect dataset's images into 16 orders. The second set of experiments specifically targeted the Order Diptera and aimed to classify its members into 40 families, which constitute a significant portion of the order.

My Image Figure depicts class distribution and class imbalance in the BIOSCAN-1M Insect dataset. We focus on the 16 most densely populated orders (top) and the 40 most densely populated diptera families (bottom). The image demonstrates that class imbalance is an inherent characteristic within the insect community.

Train

To train the model on a classification task using a baseline model, you can run the following command, setting the name of the experiment: bash python main.py --loader --train --data_format <hdf5/folder> --exp_name <experiment_name> Both the folder and HDF5 data formats are supported, making it convenient to conduct experiments using dataset packages.

Test

To evaluate our top-performing models, which were trained through the experiments outlined and executed in the BIOSCAN-1M-Insect paper, please proceed to download the available checkpoints from the GoogleDrive,
stored in a designated directory BIOSCAN1MInsectcheckpoints/classificationcheckpoints, ensuring accurate path configuration to the dataset images, metadata file and results within the main.py script.

Subsequently, for order-level classification utilizing the resized and cropped images of the BIOSCAN-1M Insect Large dataset, execute the following instructions:

bash python main.py --loader --test --exp_name large_insect_order --best_model large_insect_order_vit_base_patch16_224_CE_s2 --model vit_base_patch16_224 --loss CE --seed 2 My Image Figure presents per-class top-1 test accuracy of the Insect-Order and Diptera-Family classification experiments of the Large dataset.

Generalization

To assess the generalization capabilities of our models, which were trained on the BIOSCAN-1M-Insect dataset, specifically for order-level classification involving resized and cropped images from the BIOSCAN-1M Insect Large dataset, it is imperative to ensure precise path configurations to the new images as well as trained model within the generalization.py script. Subsequently, follow these steps:

bash python generalization.py

Requirement

The requirements used to run the experiments are available in the requirements.txt file.

Copyright and License

The images included in the BIOSCAN-1M Insect dataset available through this repository are subject to copyright and licensing restrictions shown in the following:

  • Copyright Holder: CBG Photography Group
  • Copyright Institution: Centre for Biodiversity Genomics (email:CBGImaging@gmail.com)
  • Photographer: CBG Robotic Imager
  • Copyright License: Creative Commons Attribution 3.0 Unported (CC BY 3.0)
  • Copyright Contact: collectionsBIO@gmail.com
  • Copyright Year: 2021

Owner

  • Name: BIOSCAN
  • Login: bioscan-ml
  • Kind: organization
  • Email: contact@bioscancanada.org

Illuminating biodiversity with DNA-based identification systems

GitHub Events

Total
  • Watch event: 7
  • Push event: 17
  • Fork event: 1
Last Year
  • Watch event: 7
  • Push event: 17
  • Fork event: 1

Dependencies

requirements.txt pypi
  • JPype1 ==1.4.1
  • Jinja2 ==3.1.2
  • Markdown ==3.4.3
  • MarkupSafe ==2.1.1
  • Pillow ==9.4.0
  • PySocks ==1.7.1
  • PyYAML ==6.0
  • Werkzeug ==2.2.3
  • absl-py ==1.4.0
  • brotlipy ==0.7.0
  • cachetools ==5.3.0
  • certifi ==2022.12.7
  • cffi ==1.15.1
  • charset-normalizer ==2.0.4
  • coco-eval ==0.0.4
  • contourpy ==1.0.7
  • cryptography ==39.0.1
  • cycler ==0.11.0
  • filelock ==3.9.0
  • flit_core ==3.8.0
  • fonttools ==4.39.2
  • gmpy2 ==2.1.2
  • google-auth ==2.16.3
  • google-auth-oauthlib ==0.4.6
  • grpcio ==1.51.3
  • h5py ==3.8.0
  • huggingface-hub ==0.13.3
  • idna ==3.4
  • jupyter *
  • kiwisolver ==1.4.4
  • matplotlib ==3.7.1
  • mkl-fft ==1.3.1
  • mkl-random ==1.2.2
  • mkl-service ==2.4.0
  • mpmath ==1.2.1
  • networkx ==2.8.4
  • numpy ==1.23.5
  • oauthlib ==3.2.2
  • opencv-python ==4.6.0.66
  • packaging ==23.0
  • pandas ==1.5.3
  • pip ==23.0.1
  • protobuf ==4.22.1
  • pyOpenSSL ==23.0.0
  • pyasn1 ==0.4.8
  • pyasn1-modules ==0.2.8
  • pycocotools ==2.0.6
  • pycparser ==2.21
  • pyparsing ==3.0.9
  • python-dateutil ==2.8.2
  • pytorch-lightning ==1.9.3
  • pytz ==2023.2
  • requests ==2.28.1
  • requests-oauthlib ==1.3.1
  • rsa ==4.9
  • scikit-learn ==1.2.0
  • setuptools ==65.6.3
  • six ==1.16.0
  • sympy ==1.11.1
  • tensorboard ==2.12.0
  • tensorboard-data-server ==0.7.0
  • tensorboard-plugin-wit ==1.8.1
  • timm ==0.6.13
  • torch ==1.12.0
  • torchaudio ==0.12.0
  • torchvision ==0.13.0
  • tqdm ==4.65.0
  • transformers ==4.26.1
  • typing_extensions ==4.4.0
  • urllib3 ==1.26.14
  • wget ==3.2
  • wheel ==0.38.4