spec-cnn

https://github.com/genteml/spec-cnn

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: GenTeML
Language: Jupyter Notebook
Default Branch: master
Size: 11.1 MB

Statistics

Stars: 2
Watchers: 0
Forks: 2
Open Issues: 0
Releases: 2

Created almost 6 years ago · Last pushed almost 3 years ago

Metadata Files

Readme Citation

Spec-CNN

The code can be compiled under any local or browser-based Python environment on Mac, Windows, or Linux. Source code and datasets are copyrighted under Creative Commons BY-NC. The latest script versions and corresponding DOIs can be found in Zenodo.

Quickstart:

Python files as well as Google Colaboratory Python files are available for execution. There are two basic programs to run: one using the raw Raman test and training spectra and one using the continuous wavelet transform (CWT) processed Raman test and training spectra. Training and testing will take ~10-45 minutes depending on the settings used. Results and metrics are displayed at the bottom of your Python or Google Colaboratory environment along with classification uncertainties. Run the program multiple times to get average results as there is natural variability.

The Machine Learning Raman Open Dataset (MLROD) used by this code is available at The Open Repository (DOI) and as part of NASA’s AHED. For further information on the software, training, and test datasets, please refer to the Earth and Space Sciences publication: Berlanga, Genesis., Williams, Quentin., & Temiquel, Nathan. (2022). Convolutional Neural Networks as a Tool for Raman Spectral Mineral Classification Under Low Signal, Dusty Mars Conditions. Please forward code questions to the authors.

There are two test sets within the raw and CWT that can be used: The clean 0% dust cover datasets (the default) and the dusty 50% dust cover datasets. To run one raw set or the other edit the “testinpath” on the corresponding Python script “Classify Test Spectra with Trained CNN” code block, to point to “Data/Raw Data/Labeled Test/” or “Data/Raw Data/Labeled Test/Dusty/”. To run one CWT set or the other edit the “testinpath” on the corresponding Python script “Classify Test Spectra with Trained CNN” code block, to point to “Data/Preprocessed/Continuous Wavelet Transformation/Test Set/Labeled/” or “Data/Preprocessed/Continuous Wavelet Transformation/Test Set/Labeled/Dusty/”. The scripts can also be directed to the RRUFF test set files used to externally verify CNN performance. Point the directory to “RRUFF_test” instead of “Dusty.”

Technology used:

Python 3.x as the general programming language Pandas and Numpy used for data processing and manipulation SciKit Learn used for traditional models, signal processing (continuous wavelet transform), PCA, Scaling, and splitting data into test and dev sets Tensorflow for CNN model Jupyter Notebooks on Google Colaboratory for the development environment

How to use the programs:

Adding new data 1. Point training or testing data to the corresponding finpath or testinpath data paths in the CNN Raw and CNN CWT core scripts. 2. Preprocess the data using DataProcessingBatch.ipynb This processes all raw data in the data folders. This program is described below in “What Happens In Preprocessing” The DataProcessingBatch program was made to point to "/.../Peaks Only/" and "/.../Continuous Wavelet Transformation/" folders as fout folders. Whatever folder it points to needs to have a subfolder called "/Labeled/" or "/Unlabeled/"

Training and Testing Models

Run CNN Raw or CNN CWT for all model training and testing within your local Python environment or Google Colaboratory. These are self contained scripts that will run as is or the user can edit CNN parameters as desired.
Upon completion, the confusion matrix, precision, recall, accuracy, and F1 scores are reported at the bottom of each model. Google Colaboratory should take ~10 minutes to execute on the full dataset depending on the settings. Your local Python environment execution times vary from system to system but can take ~45 minutes.

When in doubt, check all input and output file paths.

What Happens In Preprocessing:

Columns with non-numeric values are dropped
The data set is trimmed to wavenumbers in the range [150,1100]
Data shape is standardized - data is grouped in bands of 5 wave numbers and the minimum value is taken: eg. all data between 150 and 155 is grouped, the minimum value of those data is used for the new datapoint “150” This ensures that the data entering downstream processing is standardized and cleans up the noise created by cosmic rays
Continuous wavelet transformation (CWT) is performed on the data to smooth, baseline correct, and highlight the peaks in the data It should be noted that once this is transformed, it is no longer spectral data but a processed form of that data This pushes some non-peak data into negative values - this allows the peak-finding algorithm to better identify the peaks in the data using signal-to-noise ratio and will improve the speed with which our CNN model converges
A peak-finding algorithm using signal to noise ratio (SNR) to identify the peaks in the data An artifact of CWT performed on Raman spectra with Ricker wavelets is that peaks always appear at the extreme ends of the data - the first and last peaks are dropped before the largest peaks are extracted
Two datasets are output from preprocessing: A 10-feature dataset in the format [highest peak, second highest peak...fifth highest peak, highest peak relative intensity,...fifth highest peak intensity] This data is used in all models The CWT of each spectrum with 190 features (one for every 5 wave numbers between 150 and 1100) This data is used only in the neural network model: due to the large number of features and significant noise, it is not appropriate for traditional models like SVC or logistic regression

What is going on in the CNN:

The convolutional neural network can be viewed here and is made up of the following layers:

Layer 1 1-Dimensional Convolution (16 filters, filter size 13) Batch Normalization Leaky ReLU activation function 1-Dimensional Maxpooling (pool size 3)

Layer 2 1-Dimensional Convolution (32 filters, filter size 5) Batch Normalization Leaky ReLU activation function 1-Dimensional Maxpooling (pool size 2)

Layer 3 1-Dimensional Convolution (64 filters, filter size 3) Batch Normalization Leaky ReLU activation function 1-Dimensional Maxpooling (pool size 2) Flattening - concatenate all channels into one to prepare for the fully connected layers

Layer 4 Dense - a “fully connected layer” or traditional network layer, 2048 nodes (W dot X + b: except with no b - the bias parameter was not included because the following batch normalization would negate it) Batch normalization Tanh activation function 55% dropout (to reduce overfitting)

Layer 5 Dense with 8 nodes (to account for the labels that fall in the range [0,7]) Batch Normalization Softmax activation layer - outputs an array of probabilities (one for each possible label [0,7]) that total to 1

Citation (CITATION.cff)

Please cite code as:

Berlanga, G., Temiquel, N., Williams, Q. (2022) MLROD Raman CNN (Version 1.0.0-beta) [Source code]. Zenodo. https://doi.org/10.5281/zenodo.7036374

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Dependencies

requirements.txt pypi

Jinja2 ==2.11.2
Keras ==2.4.3
Keras-Preprocessing ==1.1.2
Markdown ==3.2.2
MarkupSafe ==1.1.1
PyYAML ==5.3.1
Pygments ==2.7.1
Send2Trash ==1.5.0
Werkzeug ==1.0.1
absl-py ==0.10.0
appnope ==0.1.0
argon2-cffi ==20.1.0
astunparse ==1.6.3
async-generator ==1.10
attrs ==20.2.0
backcall ==0.2.0
bleach ==3.2.1
cachetools ==4.1.1
certifi ==2020.6.20
cffi ==1.14.3
chardet ==3.0.4
decorator ==4.4.2
defusedxml ==0.6.0
entrypoints ==0.3
gast ==0.3.3
google-auth ==1.21.2
google-auth-oauthlib ==0.4.1
google-pasta ==0.2.0
grpcio ==1.32.0
h5py ==2.10.0
idna ==2.10
importlib-metadata ==1.7.0
ipykernel ==5.3.4
ipython ==7.18.1
ipython-genutils ==0.2.0
jedi ==0.17.2
joblib ==0.16.0
json5 ==0.9.5
jsonschema ==3.2.0
jupyter-client ==6.1.7
jupyter-core ==4.6.3
jupyterlab ==2.2.8
jupyterlab-pygments ==0.1.1
jupyterlab-server ==1.2.0
mistune ==0.8.4
nbclient ==0.5.0
nbconvert ==6.0.5
nbformat ==5.0.7
nest-asyncio ==1.4.0
notebook ==6.1.4
numpy ==1.18.5
oauthlib ==3.1.0
opt-einsum ==3.3.0
packaging ==20.4
pandas ==1.1.2
pandocfilters ==1.4.2
parso ==0.7.1
pexpect ==4.8.0
pickleshare ==0.7.5
prometheus-client ==0.8.0
prompt-toolkit ==3.0.7
protobuf ==3.13.0
ptyprocess ==0.6.0
pyasn1 ==0.4.8
pyasn1-modules ==0.2.8
pycparser ==2.20
pyparsing ==2.4.7
pyrsistent ==0.17.3
python-dateutil ==2.8.1
pytz ==2020.1
pyzmq ==19.0.2
requests ==2.24.0
requests-oauthlib ==1.3.0
rsa ==4.6
scikit-learn ==0.23.2
scipy ==1.4.1
six ==1.15.0
tensorboard ==2.3.0
tensorboard-plugin-wit ==1.7.0
tensorflow ==2.3.0
tensorflow-estimator ==2.3.0
termcolor ==1.1.0
terminado ==0.9.1
testpath ==0.4.4
threadpoolctl ==2.1.0
tornado ==6.0.4
traitlets ==5.0.4
urllib3 ==1.25.10
wcwidth ==0.2.5
webencodings ==0.5.1
wrapt ==1.12.1
zipp ==3.2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science