mrclean-greedy

A greedy algorithm for cleaning a data file.

https://github.com/climerlab/mrclean-greedy

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization climerlab has institutional domain (www.cs.umsl.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.5%) to scientific vocabulary

Keywords

cpp data-cleaning discrete-optimization greedy-algorithms
Last synced: 6 months ago · JSON representation ·

Repository

A greedy algorithm for cleaning a data file.

Basic Info
  • Host: GitHub
  • Owner: ClimerLab
  • License: bsd-3-clause
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 53.7 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
cpp data-cleaning discrete-optimization greedy-algorithms
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

mrclean-greedy

Cleans a text file of data by removing rows and columns until the percentage of missing data in each remaining row and columns is below a certain threshold.

A greedy algorithm is used to determine which rows and columns to remove. The percent of missing data in each retained row and column is calculated. The row/column with the largest percentage of missing data is selected. If a row is selected, the algorithm selects the number of columns, with missing data in the selected row, that need to be removed so that the amount of missing data in the row is below the threshold. The columns are selected so that the smallest amount of valid elements would be removed. If the number of valid elements in the row is less than the number of valid elements in the selected columns, the row is removed. Otherwise, the columns are removed. If a column has the largest percentage of missing data, the above process is repeated, but the rows and columns are swapped. After removing the rows(s) or column(s), the number of valid elements in each remaining row and column are recalculated and the process repeats until each remaining row and column have an acceptable amount of missing data.

To Use

Compile with the Makefile by navigating to the root directory and entering: make

Run the program by entering: ./mrclean-greedy (opt) (opt)

Inputs

- Path to data file

- Decimal value indicating maximum percentage of missing data in each row and column of the cleaned matrix

- Minimum number of rows allowed in a solution

- Minimum number of columns allowed in a solution

- Symbol used to indicate missing values in the data_file

- Directory to record results

- (Optional) Number of header rows in the data file. Defaults to 1 if no value is provided

- (Optional) Number of header columns in the data file. Defaults to 1 if no value is provided

Outputs

Greedy Summary

Greedy_summary.csv - File containing details of cleaning result. The following columns are recorded each time the program runs.

data_file - Input data file

maxpercmissing - Maximum percentage of missing data that the file was cleaned to

time - Run time of mrclean-greedy

numvalelements - Number of valid elements in cleaned file

numrowskept - Number of data rows in cleaned matrix

numcolskept - Number of data columns in cleaned matrix

Cleaned Data File

_gamma<maxmissing>_cleaned.tsv - File containing the cleaned data, along with the retained header rows and header columns.

Retained Rows and Columns File

_gamma<maxmissing>_cleaned.sol - File containing two binary vectors indicating which rows and columns were retained. First vector corresponds to rows and the second to columns.

Notes

The should be tab seperated.

The original data file is unaltered.

Owner

  • Name: Climer Lab
  • Login: ClimerLab
  • Kind: organization
  • Location: Saint Louis Missouri

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Smith"
  given-names: "Ken"
  orcid: "https://orcid.org/0000-0002-7292-8268"
title: "Mr.Clean Greedy"
version: 1.1.0
license: BSD-3-Clause
license-url: "https://github.com/ClimerLab/mrclean-greedy/blob/main/LICENSE"
repository-code: "https://github.com/ClimerLab/mrclean-greedy/"
keywords:
  - data cleaning
  - greedy algorithm
type: software
url: "https://github.com/ClimerLab/mrclean-greedy/"

GitHub Events

Total
Last Year