biosignature-detection-with-py-gc-ms-data-using-machine-learning

https://github.com/ghystad/biosignature-detection-with-py-gc-ms-data-using-machine-learning

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: ghystad
  • License: gpl-3.0
  • Language: R
  • Default Branch: main
  • Size: 80.1 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Biosignature Detection with Py GC-MS Data using Machine Learning

The R-scripts are created for the paper

Detecting Biosignatures in Complex Molecular Mixtures from pyrolysis Gas Chromatography Mass Spectrometry Data using Machine Learning

Grethe Hystad1, H. James Cleaves II 2,3,4, Collin A. Garmon1,5*, Michael L. Wong6,7, Anirudh Prabhu6, George D. Cody6, and Robert M. Hazen6

1. Department of Mathematics and Statistics, Purdue University Northwest, Hammond, IN, 46323, USA.

2. Department of Chemistry, Howard University, Washington, D.C. 20059, USA.

3. Earth Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan.

4. Blue Marble Space Institute for Science, Seattle, WA 98104, USA.

5. Current Address: Department of Mathematical Sciences, Purdue University Fort Wayne, Fort Wayne, IN, 46805, USA.*

6. Earth and Planets Laboratory, Carnegie Science, Washington, DC 20015, USA.

7. NHFP Sagan Fellow, NASA Hubble Fellowship Program, Space Telescope Science Institute, Baltimore, MD 21218, USA.

Introduction

Three-dimensional (scan number /mass-to-charge ratio/intensity) data from biotic and abiotic samples are obtained by pyrolysis-gas chromatography mass spectrometry. The R-scripts created are for preprocessing these data and to use machine learning to predict whether a sample is biotic or abiotic. Nested resampling is used to obtain an estimate for the prediction performance of the model. The pattern of features that are "important" in distinguishing the abiotic from the biotic species are then determined and shown graphically. The following machine learning classification methods are used: random forest, logistic regression with elastic net penalty, support vector machines (SVM), and eXtreme Gradient Boosting (XGBoost). The Benjamini-Hochberg procedure is used for multiple hypothesis testing.

Data

The 150 pyr-GC-MS samples can be found at https://osf.io/8sywr/?view_only=7d450ad4f9af48dfab5e194d041c0c13 with reference:

Cleaves, H. J. (2023). A robust molecular biosignature based on machine learning (Version 1) [Dataset]. OSF. DOI 10.17605/OSF.IO/EMBH8

The 16 new samples are provided in the folder named "HystadEtAl16newFiles.zip". The other 134 samples are provided in the folder named "Cleavesetal.pyrGCMSData.zip".

The outputs of the R Markdown files are displayed on:

https://ghystad.github.io/Biosignature-Detection-with-Py-GC-MS-Data-using-Machine-Learning/

The outputs of the R Markdown files are also displayed on RPubs:

https://rpubs.com/ghystad/nestedresamplingXGBoost

https://rpubs.com/ghystad/nestedresamplingsupportvectormachines

https://rpubs.com/ghystad/nestedresamplingrandom_forest

https://rpubs.com/ghystad/nestedresamplingelastic_net

https://rpubs.com/ghystad/MonteCarlosimulationsrandomforest

https://rpubs.com/ghystad/igraphsrandomforest

https://rpubs.com/ghystad/graphsrandomforest

https://rpubs.com/ghystad/graphselasticnetandBenjamini_Hochberg

https://rpubs.com/ghystad/graphselasticnet

https://rpubs.com/ghystad/graphsBenjaminiHochberg

https://rpubs.com/ghystad/correlationgraphsrandom_forest

https://rpubs.com/ghystad/chromatograms

Licence

The application is released under GNU GPL version 3 license.

Author of the R-scripts

Grethe Hystad

Sessioninfo

R version 4.3.3 (2024-02-29 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.utf8 LCCTYPE=EnglishUnited States.utf8 LCMONETARY=EnglishUnited States.utf8 [4] LCNUMERIC=C LCTIME=English_United States.utf8

time zone: America/Chicago tzcode source: internal

attached base packages: [1] stats graphics grDevices datasets utils methods base

loaded via a namespace (and not attached): [1] compiler4.3.3 fastmap1.1.1 cli3.6.2 htmltools0.5.8.1 tools4.3.3 yaml2.3.8 rmarkdown2.26
[8] knitr
1.46 xfun0.43 digest0.6.35 rlang1.1.3 renv1.0.7 evaluate_0.23

Cite as: DOI

Owner

  • Login: ghystad
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.1.2
title: >-
  Using Machine Learning for Biosignature Detection with
  Pyrolysis Gas Chromatography-Mass Spectrometry Data
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Grethe
    family-names: Hystad
    name-particle: Grethe
    email: ghystad@pnw.edu
    affiliation: Purdue University Northwest
    orcid: 'https://orcid.org/0000-0001-9572-1019'
repository-code: >-
  https://github.com/ghystad/Biosignature-Detection-with-Py-GC-MS-Data-using-Machine-Learning.git
license: GPL-3.0
version: '1.1.2'
date-released: '2025-06-07'

GitHub Events

Total
  • Release event: 3
  • Push event: 15
  • Create event: 5
Last Year
  • Release event: 3
  • Push event: 15
  • Create event: 5