https://github.com/acg-team/mutation-tme-crc

Machine learning analysis of genetic mutations (STRs, SNPs, indels) and tumor microenvironment (TME) features in colorectal cancer using TCGA data.

https://github.com/acg-team/mutation-tme-crc

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Machine learning analysis of genetic mutations (STRs, SNPs, indels) and tumor microenvironment (TME) features in colorectal cancer using TCGA data.

Basic Info
  • Host: GitHub
  • Owner: acg-team
  • Default Branch: main
  • Size: 1.82 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Machine learning of genotype-phenotype associations in colorectal cancer tumors from mutation

Introduction

This project explores how genetic mutations relate to the tumor microenvironment (TME) in colorectal cancer. We focus on three types of mutations: - short tandem repeats (STRs) – repeated sequences in dna. - single nucleotide polymorphisms (SNPs) – single base-pair changes in the dna. - insertions and deletions (indels) – small additions or removals of DNA bases.

By studying these mutation types, we aim to understand their connection to mucin production and immune cell presence in tumors.

Using machine learning, we need structured ways to represent mutations. Here is two approaches: 1. Mutation counting: for each sample, count the number of specific mutation types (number of SNPs, number of indels) and use these counts as features.

  1. Dimensionality reduction: since mutation data is high-dimensional, apply methods like principal component analysis (PCA) to reduce feature numbers. This will be done separately for each mutation type before combining them in the ML model.

Steps

  1. Literature overview. Review existing studies on genetic mutations and their effect on the tumor microenvironment, focusing on mucin production and immune cell composition in colorectal cancer.

  2. Data preparation. Extract mutation data from TCGA and preprocess it for machine learning. Determine the best way to represent different mutation types for predictive modeling.

  3. Machine learning. Train models to predict mucin levels and immune cell presence based on mutation data. The goal is to find which mutation types are most important for understanding the tumor microenvironment.

3.1 Mutation representation - use mutation counting and dimensionality reduction as described above. - normalize and preprocess features.

3.2 Model selection - since the dataset has 300-500 samples, we will use models that work well with small datasets: - random forest - support vector machines (SVM) – useful for small datasets too - gradient boosting

3.3 Cross-validation and model evaluation - use k-fold cross-validation (e.g., 5-fold) for reliable evaluation. - apply metrics like roc-auc, accuracy, and f1-score for classification, and rmse or for regression. - analyze feature importance to interpret model predictions.

  1. Interpretation. Examine model results to find key genetic factors affecting mucin and immune cell levels. Compare predictions with existing research for validation.

Potential analysis

Understanding correlations between gene mutations and mucin expression may help reveal how mutations influence mucin production.

Owner

  • Name: Applied Computational Genomics Team
  • Login: acg-team
  • Kind: organization
  • Location: Wädenswil, Switzerland

Computational Genomics tools from Maria Anisimova and collaborators

GitHub Events

Total
  • Push event: 4
Last Year
  • Push event: 4