csfs

A cut-and-solve based feature selection for continous data

https://github.com/climerlab/csfs

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization climerlab has institutional domain (www.cs.umsl.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary

Keywords

cplex cpp discrete-optimization feature-selection integer-linear-programming openmpi

Last synced: 6 months ago · JSON representation ·

Repository

A cut-and-solve based feature selection for continous data

Basic Info

Host: GitHub
Owner: ClimerLab
License: bsd-3-clause
Language: C++
Default Branch: main
Homepage:
Size: 11.8 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Topics

cplex cpp discrete-optimization feature-selection integer-linear-programming openmpi

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

CSFS

A cut-and-solve based feature selection algorithm for continous data.

To Use

Configure the Makefile with the locaion of the IBL ILOG CPLEX and open mpi libraries and binary

Compile with the Makefile by navigating to the root directory and entering: make

Update configuration file

Run the program. For an example enter: mpirun -np 4 ./csfs

Configuration

DATAFILE - Tab seperated file where the first NUMCASES columns are cases and the next NUM_CTRLS columns are controls. The row indicate features.

RISK - Boolean that indicates if risk patterns (true) or protective patterns (false) should be found.

NUMCASES - The number of cases in DATAFILE.

NUMCTRLS - The number of controls in DATAFILE.

NUMEXPRS - The number of features in DATAFILE.

NUMHEADROWS - The number of header rows in DATA_FILE.

NUMHEADCOLS - The number of header columns in DATA_FILE.

PATTERN_SIZE - The number of marker states in the pattern(s) to be found.

USELOWERCUTOFF - A boolean indicates if the STARTINGLOWERBOUND is used. This lower bound can be updated during the search. USELOWERCUTOFF and USESOLUTIONPOOL_THRESHOLD cannot both be set to true at the same time.

USESOLUTIONPOOLTHRESHOLD - A boolean indicating of the SOLUTIONPOOLTHRESHOLD is to be used. When using the SOLUTIONPOOLTHRESHOLD, all patterns with a beter objective value will be retained. USELOWERCUTOFF and USESOLUTIONPOOLTHRESHOLD cannot both be set to true at the same time.

SOLUTIONPOOLTHRESHOLD - Threshold used to retaine solutions in the pool. Used if USESOLUTIONPOOL_THRESHOLD is true.

STARTINGLOWERBOUND - Starting lower bound when 0 USELOWERCUTOFF is true.

STARTINGUPPERBOUND - Starting upper bound when 0 USELOWERCUTOFF is true.

QUIET - Boolean that limits the output to only the most important items when true. QUIET and VERBOSE cannot both be set to true at the same time.

VERBOSE - Boolean that controls if all outputs are dispayed. QUIET and VERBOSE cannot both be set to true at the same time.

PRINTCPLEXOUTPUT - Boolean controlling if the CPLEX output is displayed.

TOL - Tolerance value for used for rounding decimals to integers in CPLEX.

IDPREFIX - Prefix of ID column in DATAFILE.

MISSINGSYMBOL - String used to indicate missing data in DATAFILE.

CPLEX_SEED - Seed provide to CPLEX.

USESPARSECONTRAINTS - Boolean that indicates if additional contraints for the sparse problem are used.

NUMBINS - The number of bins, from HIGH, NORM, LOW, NOTHIGH, and NOT_LOW to be used.

USE_HIGH - Set to true if HIGH variable will be used in pattern.

USE_NORM - Set to true if NORM variable will be used in pattern.

USE_LOW - Set to true if LOW variable will be used in pattern.

USENOTHIGH - Set to true if NOT_HIGH variable will be used in pattern.

USENOTLOW - Set to true if NOT_LOW variable will be used in pattern.

HIGHVALUE - Value in DATAFILE that indicates high expression.

NORMVALUE - Value in DATAFILE that indicates normal expression.

LOWVALUE - Value in DATAFILE that indicates low expression.

NOTLOWVALUE - Value in DATA_FILE that indicates not low expression.

NOTHIGHVALUE - Value in DATA_FILE that indicates not high expression.

SETNATRUE - Boolean used to indicate if missing data is treated as both high and low.

Output

*.log - File containing the collection of patterns

Notes

Requires Open MPI and IBM ILOG CPLEX

DATA_FILE should be tab seperate, the columns represent individuals and the rows represent features

Owner

Name: Climer Lab
Login: ClimerLab
Kind: organization
Location: Saint Louis Missouri

Website: http://www.cs.umsl.edu/~climer/
Repositories: 1
Profile: https://github.com/ClimerLab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Smith"
  given-names: "Ken"
  orcid: "https://orcid.org/0000-0002-7292-8268"
title: "Cut-and-Solve Feature Selection"
version: 1.0.0
license: BSD-3-Clause
license-url: "https://github.com/ClimerLab/CSFS/blob/main/LICENSE"
repository-code: "https://github.com/ClimerLab/CSFS/"
keywords:
  - feature selection
  - youden j
  - Cut-and-Solve
type: software
url: "https://github.com/ClimerLab/CSFS/"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science