https://github.com/climerlab/mrclean-nomiss
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Repository
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
NoMiss
Extension of the MrClean program suite that specializing in creating clean data sets with no missing data
Overview
NoMoss is an ensemble of three algorithms that clean a data matrix, along with a few helper programs. First, the Orient program checks the dimension of the data matrix and creates a transposed version if the matrix contained more rows than columns. Next, the AddRowGreedy and RowColLP cleaning programs can be executed. If the user wishes to run the ElementIP algorithm, the CalcPairs helper function must be run first. After running the desired cleaning programs, the WriteCleanedMatrix program checks the results of the various cleaning programs and writes a cleaned data matrix to file.
Compile Program
The user may need to update 4 parameters in the Makefile to specify the version and location of CPLEX
SYSTEM =
LIBFORMAT =
CPLEXDIR =
CONCERTDIR =
Example:
SYSTEM = x86-64linux
LIBFORMAT = staticpic
CPLEXDIR = /opt/ibm/ILOG/CPLEXStudio221/cplex
CONCERTDIR = /opt/ibm/ILOG/CPLEXStudio221/concert
To compile the program, navigate to the directory containing the download and type 'make' (no quotes). The following executables will be created: CheckMatrixOrientation, addRowGreedy, rowColLP, calcPairs, elementIp, and writeCleanedMatrix.
Running NoMiss
Included in the repo is a shell script named cleandata.sh. If you wish to use the script, there are 4 variables that need to be set: 1. ARR – array containing the data files to clean. All data files should have the same number of header rows, number of header columns, and NASYMBOL. 2. NASYMBOL - the string which represents missing data in the data file. This value needs to be a string. Spaces or tabs will cause unexpected behavior. 3. NUMHEADERROWS - number of header rows for each file in ARR 4. NUMHEADER_COLS - number of header columns for each file in ARR The shell script will handle updating the data file information if a transpose occurs and cleans up all temporary files created during the cleaning process. If you prefer to execute the programs separately, all four inputs listed above are required for each program. Note that checkMatrixOrientation should be executed before any cleaning programs and calcPairs must be run before elementIp. The calcPairs and elementI programs use Open MPI to distribute the work. Mpitun should be used to call these programs. The rankfile can be used to specify which the number of desired processes. A minimum of two processors (or threads) are required to run calcPairs and elementIp. NOTE: Double check the number of header rows and columns. The program will likely run, without error, if an incorect number of headers rows or header columns is provided. NOTE: CPLEX is required to run the two Integer Programs (IP). The user will be required to provide the directories for CPLEX in the Makefile.
Configuration File
The configuration file allows the user to turn several features of the program on/off.
PRINTSUMMARY - determines if a summary is printed to the screen for each algorithm.
WRITESTATS - determines if the statistics are recorded to a file. Each algorithm has a seperate file.
LARGEMATRIX – determines the number of elements in an elementIp problem when the constraints will be reduced.
The program expects a file named _config.cfg in the same directory as the executable and all flags above should be included. If a flag is missing, the program will exit with an error condition.
Program Output
If PRINTSUMMARY is set to true a summary of each executed cleaning program will be printed to the screen for each data file. If WRITESTATS is set true, a CSV file will be created for each cleaning program. The file will contain the data file, run time, number of valid elements, number of rows, and number of columns resulting from the algorithm. From the executed cleaning algorithms, the solution with the most valid elements will be used to create a cleaned data matrix for each input file. The cleaned files will be written in the same directory as the origan data files and will be named < datafile>cleaned.tsv
Owner
- Name: Climer Lab
- Login: ClimerLab
- Kind: organization
- Location: Saint Louis Missouri
- Website: http://www.cs.umsl.edu/~climer/
- Repositories: 1
- Profile: https://github.com/ClimerLab