mrclean

Two Mixed Integer Programs for cleaning a data file.

https://github.com/climerlab/mrclean

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization climerlab has institutional domain (www.cs.umsl.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.1%) to scientific vocabulary

Keywords

cplex cpp data-cleaning data-cleansing discrete-optimization integer-linear-programming mixed-integer-programming
Last synced: 4 months ago · JSON representation ·

Repository

Two Mixed Integer Programs for cleaning a data file.

Basic Info
  • Host: GitHub
  • Owner: ClimerLab
  • License: bsd-3-clause
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 43.9 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
cplex cpp data-cleaning data-cleansing discrete-optimization integer-linear-programming mixed-integer-programming
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

mrclean

Two Mixed Integer Programs for cleaning a data file.

To Use

Configure the Makefile with the locaion of IBM ILOG CPLEX libraries and binary

Compile with the Makefile by navigating to the root directory and entering: make

Update configuration file

Run the program. For example enter: ./mrclean data_file.tsv 0.05 NA 100 200 1 1

or ./mrclean data_file.tsv 0.05 NA 100 200 1 1 greedy.sol

Inputs

- Tab seperated data file to clean.

- Maximum percent of data allowed in each row and column in the cleaned data file.

- Minimum number of rows allowed in a solution

- Minimum number of columns allowed in a solution

- String used to indicate missing data in .

- Number of header rows in .

- Number of header rows in .

- Optional input used to provide starting integer solution to the MIPs. File should contian two binary vectors represented as lines of tab-seperated numbers. The first line represents the retained rows in the solutio and the second line represents the retained columns in the solution.

Configuration File

PRINT_SUMMARY - Boolean controlling if a summary of the cleaning results is printed.

WRITE_STATS - Boolean controlling if the statistic of the cleaning algorithms are written to a file.

RUNROWCOL - Boolean controlling if the RowCol IP is executed.

RUN_ELEMENT - Boolean controlling if the Element IP is executed.

Outputs

RowColsummary.csv - Statistics file for the RowCol IP containing <datafile>, , , run time, number of valid elements kept number of rows kept, and number of columns kept.

Elementsummary.csv - Statistics file for the RowCol IP containing <datafile>, , , run time, number of valid elements kept number of rows kept, and number of columns kept.

_cleaned.tsv - Tab-seperated cleaned data file.

Notes

Recommend using [mrclean-greedy] to provide incumbent for MIPs

Requires IBM ILOG CPLEX

should be tab sperated.

Owner

  • Name: Climer Lab
  • Login: ClimerLab
  • Kind: organization
  • Location: Saint Louis Missouri

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Smith"
  given-names: "Ken"
  orcid: "https://orcid.org/0000-0002-7292-8268"
title: "Mr. Clean"
version: 1.0.0
license: BSD-3-Clause
license-url: "https://github.com/ClimerLab/mrclean/blob/main/LICENSE"
repository-code: "https://github.com/ClimerLab/mrclean/"
keywords:
  - data cleaning
  - integer program
type: software
url: "https://github.com/ClimerLab/mrclean/"

GitHub Events

Total
Last Year