mrclean-greedy
A greedy algorithm for cleaning a data file.
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization climerlab has institutional domain (www.cs.umsl.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.5%) to scientific vocabulary
Keywords
Repository
A greedy algorithm for cleaning a data file.
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
mrclean-greedy
Cleans a text file of data by removing rows and columns until the percentage of missing data in each remaining row and columns is below a certain threshold.
A greedy algorithm is used to determine which rows and columns to remove. The percent of missing data in each retained row and column is calculated. The row/column with the largest percentage of missing data is selected. If a row is selected, the algorithm selects the number of columns, with missing data in the selected row, that need to be removed so that the amount of missing data in the row is below the threshold. The columns are selected so that the smallest amount of valid elements would be removed. If the number of valid elements in the row is less than the number of valid elements in the selected columns, the row is removed. Otherwise, the columns are removed. If a column has the largest percentage of missing data, the above process is repeated, but the rows and columns are swapped. After removing the rows(s) or column(s), the number of valid elements in each remaining row and column are recalculated and the process repeats until each remaining row and column have an acceptable amount of missing data.
To Use
Compile with the Makefile by navigating to the root directory and entering: make
Run the program by entering: ./mrclean-greedy
Inputs
Outputs
Greedy Summary
Greedy_summary.csv - File containing details of cleaning result. The following columns are recorded each time the program runs.
data_file - Input data file
maxpercmissing - Maximum percentage of missing data that the file was cleaned to
time - Run time of mrclean-greedy
numvalelements - Number of valid elements in cleaned file
numrowskept - Number of data rows in cleaned matrix
numcolskept - Number of data columns in cleaned matrix
Cleaned Data File
Retained Rows and Columns File
Notes
The
The original data file is unaltered.
Owner
- Name: Climer Lab
- Login: ClimerLab
- Kind: organization
- Location: Saint Louis Missouri
- Website: http://www.cs.umsl.edu/~climer/
- Repositories: 1
- Profile: https://github.com/ClimerLab
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Smith" given-names: "Ken" orcid: "https://orcid.org/0000-0002-7292-8268" title: "Mr.Clean Greedy" version: 1.1.0 license: BSD-3-Clause license-url: "https://github.com/ClimerLab/mrclean-greedy/blob/main/LICENSE" repository-code: "https://github.com/ClimerLab/mrclean-greedy/" keywords: - data cleaning - greedy algorithm type: software url: "https://github.com/ClimerLab/mrclean-greedy/"