csfs

A cut-and-solve based feature selection for continous data

https://github.com/climerlab/csfs

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization climerlab has institutional domain (www.cs.umsl.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary

Keywords

cplex cpp discrete-optimization feature-selection integer-linear-programming openmpi
Last synced: 4 months ago · JSON representation ·

Repository

A cut-and-solve based feature selection for continous data

Basic Info
  • Host: GitHub
  • Owner: ClimerLab
  • License: bsd-3-clause
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 11.8 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
cplex cpp discrete-optimization feature-selection integer-linear-programming openmpi
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

CSFS

A cut-and-solve based feature selection algorithm for continous data.

To Use

Configure the Makefile with the locaion of the IBL ILOG CPLEX and open mpi libraries and binary

Compile with the Makefile by navigating to the root directory and entering: make

Update configuration file

Run the program. For an example enter: mpirun -np 4 ./csfs

Configuration

DATAFILE - Tab seperated file where the first NUMCASES columns are cases and the next NUM_CTRLS columns are controls. The row indicate features.

RISK - Boolean that indicates if risk patterns (true) or protective patterns (false) should be found.

NUMCASES - The number of cases in DATAFILE.

NUMCTRLS - The number of controls in DATAFILE.

NUMEXPRS - The number of features in DATAFILE.

NUMHEADROWS - The number of header rows in DATA_FILE.

NUMHEADCOLS - The number of header columns in DATA_FILE.

PATTERN_SIZE - The number of marker states in the pattern(s) to be found.

USELOWERCUTOFF - A boolean indicates if the STARTINGLOWERBOUND is used. This lower bound can be updated during the search. USELOWERCUTOFF and USESOLUTIONPOOL_THRESHOLD cannot both be set to true at the same time.

USESOLUTIONPOOLTHRESHOLD - A boolean indicating of the SOLUTIONPOOLTHRESHOLD is to be used. When using the SOLUTIONPOOLTHRESHOLD, all patterns with a beter objective value will be retained. USELOWERCUTOFF and USESOLUTIONPOOLTHRESHOLD cannot both be set to true at the same time.

SOLUTIONPOOLTHRESHOLD - Threshold used to retaine solutions in the pool. Used if USESOLUTIONPOOL_THRESHOLD is true.

STARTINGLOWERBOUND - Starting lower bound when 0 USELOWERCUTOFF is true.

STARTINGUPPERBOUND - Starting upper bound when 0 USELOWERCUTOFF is true.

QUIET - Boolean that limits the output to only the most important items when true. QUIET and VERBOSE cannot both be set to true at the same time.

VERBOSE - Boolean that controls if all outputs are dispayed. QUIET and VERBOSE cannot both be set to true at the same time.

PRINTCPLEXOUTPUT - Boolean controlling if the CPLEX output is displayed.

TOL - Tolerance value for used for rounding decimals to integers in CPLEX.

IDPREFIX - Prefix of ID column in DATAFILE.

MISSINGSYMBOL - String used to indicate missing data in DATAFILE.

CPLEX_SEED - Seed provide to CPLEX.

USESPARSECONTRAINTS - Boolean that indicates if additional contraints for the sparse problem are used.

NUMBINS - The number of bins, from HIGH, NORM, LOW, NOTHIGH, and NOT_LOW to be used.

USE_HIGH - Set to true if HIGH variable will be used in pattern.

USE_NORM - Set to true if NORM variable will be used in pattern.

USE_LOW - Set to true if LOW variable will be used in pattern.

USENOTHIGH - Set to true if NOT_HIGH variable will be used in pattern.

USENOTLOW - Set to true if NOT_LOW variable will be used in pattern.

HIGHVALUE - Value in DATAFILE that indicates high expression.

NORMVALUE - Value in DATAFILE that indicates normal expression.

LOWVALUE - Value in DATAFILE that indicates low expression.

NOTLOWVALUE - Value in DATA_FILE that indicates not low expression.

NOTHIGHVALUE - Value in DATA_FILE that indicates not high expression.

SETNATRUE - Boolean used to indicate if missing data is treated as both high and low.

Output

*.log - File containing the collection of patterns

Notes

Requires Open MPI and IBM ILOG CPLEX

DATA_FILE should be tab seperate, the columns represent individuals and the rows represent features

Owner

  • Name: Climer Lab
  • Login: ClimerLab
  • Kind: organization
  • Location: Saint Louis Missouri

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Smith"
  given-names: "Ken"
  orcid: "https://orcid.org/0000-0002-7292-8268"
title: "Cut-and-Solve Feature Selection"
version: 1.0.0
license: BSD-3-Clause
license-url: "https://github.com/ClimerLab/CSFS/blob/main/LICENSE"
repository-code: "https://github.com/ClimerLab/CSFS/"
keywords:
  - feature selection
  - youden j
  - Cut-and-Solve
type: software
url: "https://github.com/ClimerLab/CSFS/"

GitHub Events

Total
Last Year