https://github.com/daffidwilde/kmodes-init-paper
A repository to accompany the paper entitled "A novel initialisation based on hospital-resident assignment for the k-modes algorithm"
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Repository
A repository to accompany the paper entitled "A novel initialisation based on hospital-resident assignment for the k-modes algorithm"
Basic Info
Statistics
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Source code, notebooks, and data
This directory contains all of the source code, notebooks and data needed to reproduce the results and plots in the paper "A novel initialisation based on hospital-resident assignment for the $k$-modes algorithm" by Henry Wilde et al.
At the top level of this directory is an environment.yml file used to create
a virtual conda environment. This environment will ensure that the code
herein will reproduce the data and plots used in the paper exactly. Instructions
on how to create, use and otherwise manage conda environments can be found
here
The remainder of the directory is made up of three ZIP archives that should be
decompressed: data.zip, nbs.zip, and src.zip.
The data
The data subdirectory contains the data associated with each subsection
in the results of the paper. The data/knee and data/nclasses
subdirectories contain the results for the benchmark datasets stored at the top
of data where the number of clusters has been chosen via knee point
detection and number of classes respectively. The data/artificial contains
two subdirectories -- one for each experiment in the final analysis. Each
experiment directory contains three trial subdirectories that are structured as
follows:
data.tar.gz: a tarball of all the data associated with that experiment.summary: a directory containing a summary of the data indata.tar.gz, i.e.:main.csv: a CSV detailing the index, dimensions, memory consumption, generation and fitness of every individual in the trial.max: a directory containing the dataset and metadata of the individual with the highest fitness score in the trial.median: a directory containing the dataset and metadata of the individual with the closest-to-median fitness score in the trial.min: a directory containing the dataset and metadata of the individual with the lowest fitness score in the trial.
In addition to this, each experiment directory contains a top directory
which describes the top-performing percentile of datasets across all trials.
Specifically, top/main.csv contains a subset of all the summary/main.csv
files corresponding to the top percentile, and the remainder of the directory is
a copy of the datasets in the top percentile in their original
<seed>/<generation>/<index>/main.csv structure.
The notebooks
The nbs subdirectory contains the Jupyter Notebooks used to produce the
data, tables and plots in the results section of the paper. Each notebook has
some brief documentation within.
It may be helpful to add the conda environment as a kernel to Jupyter. To do
this, run the following command once it is installed:
python - ipykernel install --user --name kmodes-init --display-name "k-modes (Wilde)".
The source code
The src subdirectory contains the source code used to produce the data in
data/artificial. The only object in src is the artificial directory.
Within that, there are three files: two experiment files that define a
fitness function, and main.py. To reproduce the associated data, use the
following command: python main.py <experiment> <cores> 500 100 0.2 0.01 3
where <experiment> is the experiment file's name (without the .py
extension) and cores is the number of cores to use.
Note that the generation of this data is costly and will likely require a machine that is capable of running undisturbed for a number of days or even weeks.
Once the data has been generated, it can be easily summarised using the
edo_exp library that can be found at
github.com/daffidwilde/edo_exp. To
summarise an experiment whose data is stored at <path/to/experiment/data>,
do the following:
- Clone the repository:
git clone https://github.com/daffidwilde/edo_exp - Move to its source code directory:
cd edo_exp/src/edo_exp - Run the command:
python summarise.py <path/to/experiment/data>.
Again, this may some time to complete.
Owner
- Name: Henry Wilde
- Login: daffidwilde
- Kind: user
- Location: Cardiff, UK
- Company: Dŵr Cymru Welsh Water
- Repositories: 29
- Profile: https://github.com/daffidwilde
Data scientist and advocate for open-source, sustainably developed software 🛸 🐐 🦆
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 13
- Total pull requests: 47
- Average time to close issues: 8 months
- Average time to close pull requests: 4 days
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 46
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- daffidwilde (13)
Pull Request Authors
- daffidwilde (46)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- edo ==0.2.1
- matching ==1.1
- tqdm ==4.36.1
- yellowbrick ==1.0.1