lc25000-cancer-classification
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Keywords
Repository
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
LC25000 Cancer Classification - Full ML Pipeline
This repository provides a complete, reproducible pipeline for training, evaluating, and interpreting a deep learning model on the LC25000 histopathology dataset.
Table of Contents
- Project Directory Setup
- Environment Setup
- Data Download & Extraction
- 2.1 Set Working Directory
- 2.2 Place kaggle.json
- 2.3 Download and Extract Dataset
- Dataset Splitting
- Model Summary
- Training
- Plotting Training Metrics
- Animated Training Curves
- Evaluation on Test Set
- Visualize Predictions as an Image Grid
- Visualize Misclassifications
- Grad-CAM Visualization
- Grad-CAM Grid (side-by-side)
0. Project Directory Setup
Purpose: Ensure all required folders exist so scripts and outputs work without errors.
Process:
- Run the setup cell in the notebook or execute the following in Python:
python
import os
folders = ['data', 'notebooks', 'outputs', 'results', 'sample_images', 'saved_models', 'scripts']
for folder in folders:
os.makedirs(folder, exist_ok=True)
Result: - Folders for data, outputs, results, models, scripts, etc. are created.
1. Environment Setup
Purpose: Install all required Python dependencies.
Process:
- Run:
bash
pip install -r requirements.txt
- For M1/M2 Mac GPU support, run:
bash
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
Result: - All necessary packages are installed for the pipeline to run.
2. Data Download & Extraction (Kaggle)
2.1 Set Working Directory
Purpose: Ensure your notebook or script is running from the project root so all file paths work correctly.
Instructions:
- In your notebook, run:
python
import os
os.chdir('/.../.../cancer_clasification_lc25000')
print("Current working directory:", os.getcwd())
- Replace /.../.../cancer_clasification_lc25000 with the actual path to your project root if needed.
2.2 Place kaggle.json
Purpose: Provide your Kaggle API credentials for dataset download.
Instructions:
- Go to Kaggle Account Settings and click "Create New API Token" to download kaggle.json.
- Place kaggle.json in your project root directory (the same directory as your notebook or script).
2.3 Download and Extract Dataset
Purpose: Download the LC25000 dataset from Kaggle and extract it for use in the pipeline.
Instructions:
- Install the Kaggle CLI:
bash
pip install kaggle
- Move kaggle.json to the correct location and set permissions:
bash
mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
- Download and unzip the dataset:
bash
mkdir -p data
kaggle datasets download andrewmvd/lung-and-colon-cancer-histopathological-images -p data/
unzip -q data/lung-and-colon-cancer-histopathological-images.zip -d data/Lung_and_Colon_Cancer
- Skip if you already have data/Lung_and_Colon_Cancer/.
Result:
- Raw images are available in data/Lung_and_Colon_Cancer/.
Troubleshooting:
- If you get a FileNotFoundError for kaggle.json, make sure your working directory is set to your project root and that kaggle.json is present there before running the commands above.
3. Dataset Splitting
Purpose: Split the raw dataset into train/val/test sets for reproducible experiments.
Process:
- Run:
bash
python -m scripts.split_dataset
- This creates data/lc25000_split/ with train/, val/, and test/ subfolders.
Result: - Data is organized for training, validation, and testing.
4. Model Summary
Purpose: Review the architecture, output shapes, and parameter counts of the model.
Process:
- Run:
bash
python -m scripts.model_summary --num_classes 5 --input_size 1 3 224 224
- Adjust arguments if your data/model shape is different.
Result: - A detailed summary table of the model is printed.
5. Training
Purpose: Train the ResNet18 model on the LC25000 dataset.
Process:
- Run:
bash
python -m scripts.train
- The script will save the best model to saved_models/ and training metrics to results/.
Result: - Trained model weights and training metrics are saved for later use.
6. Plotting Training Metrics
Purpose: Visualize loss and accuracy curves to monitor training progress.
Process:
- Run:
bash
python -m scripts.plot --json_path results/training_metrics_<timestamp>.json
- Replace <timestamp> with your actual metrics file.
Result:
- Plots are saved in outputs/ for loss and accuracy.
7. Animated Training Curves
Purpose: See an animated visualization of how loss and accuracy evolve over epochs.
Process:
- Run:
bash
python -m scripts.animate_training_curves
- The latest metrics file is used automatically.
- Thisstepanimates the training and validation loss and accuracycurvesover epochs, so you can visuallyseehow your model improves during training. How to use: You can either:
RunthePython script directly (in a terminal): python -m scripts.animatetrainingcurves This will display the animationina separate window (best for local use).
OR
Copy and run the provided code cellinyour notebook to see the animation inline in the notebook output (recommended for Jupyter/Colab).
Tip:
The notebook cell version is best for interactive exploration. The script version is useful for automated runs orwhenworking outside a notebook.
Result: - An animation of the training curves is displayed.
8. Evaluation on Test Set
Purpose: Evaluate the trained model on the test set and save detailed results.
Process:
- Run:
bash
python -m scripts.evaluate_on_test
- Outputs:
- outputs/classification_report.txt: Precision, recall, F1-score per class
- outputs/confusion_matrix.png: Confusion matrix plot
- outputs/test_predictions.csv: Per-image predictions (filename, true label, predicted label)
Result: - Quantitative evaluation and per-image predictions for further analysis.
9. Visualize Predictions as an Image Grid
Purpose: Visually inspect a random sample of test predictions.
Process:
- Run:
bash
python -m scripts.visualize_predictions --csv_path outputs/test_predictions.csv --n_images 9 --cols 3 --output_path outputs/prediction_grid.png
- Adjust --n_images and --cols as desired.
Result: - A grid of test images with true and predicted labels is displayed and saved.
10. Visualize Misclassifications
Purpose: Focus on and analyze the images the model got wrong.
Process:
- Run:
bash
python -m scripts.visualize_misclassifications --csv_path outputs/test_predictions.csv --n_images 9 --cols 3 --output_path outputs/misclassified_grid.png
Result: - A grid of misclassified images is displayed and saved for error analysis.
11. Grad-CAM Visualization
Purpose: Interpret model predictions by visualizing which parts of the image influenced the decision.
Process:
- Run:
bash
python -m scripts.gradcam --image_path <path_to_image> --model_path <path_to_model>
- Replace <path_to_image> and <path_to_model> as needed.
Result:
- Grad-CAM heatmap is saved in outputs/ for the selected image.
12. Grad-CAM Grid (side-by-side)
Purpose: Compare original images and Grad-CAM heatmaps for a set of (optionally misclassified) images.
Process:
- Run:
bash
python -m scripts.visualize_gradcam_grid --csv_path outputs/test_predictions.csv --model_path <path_to_model> --n_images 4 --cols 2 --only_misclassified --output_path outputs/gradcam_grid.png
- Adjust arguments as needed.
Result: - A grid of original and Grad-CAM images is displayed and saved for qualitative analysis.
Reproducibility & Tips
- Always run the steps in order for a clean workflow.
- If you change the dataset or scripts, re-run the relevant steps.
- All outputs are saved in the appropriate folders for easy access and sharing.
Run the Project on Google Colab
If you prefer running the LC25000 cancer classification workflow step by step in a cloud environment (no local setup required), use the dedicated Google Colab notebook below:
Open the LC25000 Classification Colab Notebook
Features:
No installation needed
GPU support available on Colab
All steps: dataset download, preprocessing, model training, evaluation, Grad-CAM
How to Use:
Click the link to open the notebook in Google Colab.
Follow each cell in order, from environment setup to final visualization.
Upload your kaggle.json when prompted to enable dataset download from Kaggle.
Run all cells to reproduce the results and visualizations.
Make sure you are signed in to your Google account to use Colab, and enable GPU under Runtime > Change runtime type > Hardware Accelerator.
Citation
If you use this pipeline, please cite the original LC25000 dataset and this repository.
License
This project is for academic and research use. Dataset usage must comply with original terms of use.
GitHub Events
Total
- Push event: 9
Last Year
- Push event: 9
Dependencies
- jupyterlab >=4.0
- matplotlib >=3.7
- numpy >=1.24,<2.0
- opencv-python >=4.0.0
- pillow >=9.5
- scikit-learn >=1.3
- seaborn >=0.12
- torch >=2.0
- torchinfo >=1.7
- torchvision >=0.15