wapiaw3

Public repository for the Personomics Lab's third annual "write a paper in a week" festival!

https://github.com/tyo8/wapiaw3

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: pubmed.ncbi, ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Public repository for the Personomics Lab's third annual "write a paper in a week" festival!

Basic Info
  • Host: GitHub
  • Owner: tyo8
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 197 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 2
Created almost 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank

Source and cluster submission code for transdiagnostic patient-control classification in 17 ICD-10 diagnostic groups in the UK BioBank dataset.

Rough Overview

This repository contains code for and records the directory tree structure of the project published (in GigaScience; bioarXiv preprint) as - T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank, bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

The diretories of this repository are roughly subdivided into three categories of function: 1. data selection and pre-processing 2. specification and deployment of classification models 4. visualization and post-hoc statistical analysis of classification results

Structure of the repository

The directory-wise contents of each function category are a given detailed overview below.

Data Selection and Pre-processing Pipeline

The data generation pipeline splits into two main components: developing patient/control lists based on inclusion/matching criteria and generating neuroimaging features.

brainrep_data

Note that no subject data of any kind is included in this public repository! Instead, the following directories contain the extraction/computation/processing code used to create differentiate types of feature sets within each ICD-10 diagnostic group. Wherever relevant, group-level analyses were computed anew for each group. - gradientdata: code to compute diffusion-network based gradient representations from connectivity data at both the subject and group level - ICAdata: sub-repository to execute ICA dual-regression (ICA-DR) procedure within each diagnostic group; melodic group regression and ICA dual regression, both implemented in FSL, and addtional code for extracting secondary features (FC network matrices, partial FC matrices, and amplitudes) - PROFUMOdata: sub-repository for the computation of the PROFUMO parcellation for each diagnostic group, and additional code for extracting secondary features (FC network matrices, spatial correlation matrices, and amplitudes) - Schaeferdata: code for extracting parcellation-level timeseries from data and computing FC network matrices, partial FC matrices, and amplitude features from Schaefer-parcellated data - T1_data: storage location for Freesurfer-extracted structural volume and cortical surface features from T1-weighted structural MRI scans - sociodemographic: contains data cleaning and feature extraction code to pull sociodemographic features from the UKB biobank for all subjects.

subject_lists

Repository containing lists of electronic ID numbers (eIDs) of patients in each diagnostic group, code for selecting corresponding matched healthy controls, and matched patient/control lists. Also contains some code and eID lists used to troubleshoot problems encountered in the UKB with incomplete or corrupted imaging, diagnostic, or sociodemographic data.

Subject Classification Pipeline

Subject classification code specifies, parameterizes, and propogates the classification model. This pipeline assumes that classification is deployed massively in parallel on a distributed system (e.g., high-performance computing cluster) operating under a SLURM queue manager.

classification_model

This directory contains the two most important pieces of code in the repository: classify_patients.py and model_specification.py. These python mini-modules are the central workhorse of the classification project.

multiclass

Adapts the binary prediction engine to a multiclass setting in multiclass.py.

jobsubmissionportal

Central hub for infrastructural bash scripts to assign, distribute, and organize the submission of compute jobs to the job manager. Split into three classes of jobs: - cross-prediction: multiclassification jobs - extraction: jobs extracting neuorimaging predictive features from raw scan data - prediction: jobs classifying patients vs. controls within diagnostic groups

utils

General-purpose bash code to perform basic bookkeeping functions while navigating large collections of UK BioBank data on the compute cluster.

Statistical Testing and Visualization Pipeline

Both the figure (results visualization) and stat_testing (post-hoc statistical analysis of results) directories contain code referencing directories not present in this repository (i.e., prediction_outputs and cross-prediction_outputs), whose outputs would need to be created to test the replicability of our findings.

figures

Visualization code producing summary swarm plots of prediction outputs under varying experimental conditions.

stat_testing

Code to compute statsitcal signficance testing with family-wise error correction and statistical summarization of the multiclass prediction's confusion matrix.

Preparing the Computing Environment

As much as possible, we confined our architecture to commonly used and publicly available code and packages. However, because of the large-scale nature of the problem at hand, our code reflects our use of a high-performance computing cluster (managed, in our case, with SLURM).

Software Dependecies

The analyses in this have several dependencies; they are listed below according to their functional role.

Neuroimaging Data Pre-processing: - nibabel - FSL - PROFUMO

Classification and Processing: - numpy - scipy - pandas - scikit-learn

Figures: - seaborn - matplotlib

Queue Management and Cluster Computing

As stated elsewhere above, the pipelines in this repository were designed for use on a high-performance cluster with SLURM job management.

Academic use

This code is available and is fully adaptable for individual user customization. If you use the our methods, please cite as the following:

T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank, bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

Owner

  • Name: Ty Easley
  • Login: tyo8
  • Kind: user

GitHub Events

Total
  • Release event: 1
  • Watch event: 1
  • Push event: 8
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 1
  • Push event: 8
  • Create event: 1