https://github.com/daffidwilde/kmodes-init-paper

A repository to accompany the paper entitled "A novel initialisation based on hospital-resident assignment for the k-modes algorithm"

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

A repository to accompany the paper entitled "A novel initialisation based on hospital-resident assignment for the k-modes algorithm"

Basic Info

Host: GitHub
Owner: daffidwilde
Language: TeX
Default Branch: main
Homepage:
Size: 9.82 MB

Statistics

Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 8 years ago · Last pushed almost 6 years ago

Metadata Files

Readme

Source code, notebooks, and data

This directory contains all of the source code, notebooks and data needed to reproduce the results and plots in the paper "A novel initialisation based on hospital-resident assignment for the $k$-modes algorithm" by Henry Wilde et al.

At the top level of this directory is an environment.yml file used to create a virtual conda environment. This environment will ensure that the code herein will reproduce the data and plots used in the paper exactly. Instructions on how to create, use and otherwise manage conda environments can be found here

The remainder of the directory is made up of three ZIP archives that should be decompressed: data.zip, nbs.zip, and src.zip.

The data

The data subdirectory contains the data associated with each subsection in the results of the paper. The data/knee and data/nclasses subdirectories contain the results for the benchmark datasets stored at the top of data where the number of clusters has been chosen via knee point detection and number of classes respectively. The data/artificial contains two subdirectories -- one for each experiment in the final analysis. Each experiment directory contains three trial subdirectories that are structured as follows:

data.tar.gz: a tarball of all the data associated with that experiment.
summary: a directory containing a summary of the data in data.tar.gz, i.e.:
- main.csv: a CSV detailing the index, dimensions, memory consumption, generation and fitness of every individual in the trial.
- max: a directory containing the dataset and metadata of the individual with the highest fitness score in the trial.
- median: a directory containing the dataset and metadata of the individual with the closest-to-median fitness score in the trial.
- min: a directory containing the dataset and metadata of the individual with the lowest fitness score in the trial.

In addition to this, each experiment directory contains a top directory which describes the top-performing percentile of datasets across all trials. Specifically, top/main.csv contains a subset of all the summary/main.csv files corresponding to the top percentile, and the remainder of the directory is a copy of the datasets in the top percentile in their original <seed>/<generation>/<index>/main.csv structure.

The notebooks

The nbs subdirectory contains the Jupyter Notebooks used to produce the data, tables and plots in the results section of the paper. Each notebook has some brief documentation within.

It may be helpful to add the conda environment as a kernel to Jupyter. To do this, run the following command once it is installed: python - ipykernel install --user --name kmodes-init --display-name "k-modes (Wilde)".

The source code

The src subdirectory contains the source code used to produce the data in data/artificial. The only object in src is the artificial directory. Within that, there are three files: two experiment files that define a fitness function, and main.py. To reproduce the associated data, use the following command: python main.py <experiment> <cores> 500 100 0.2 0.01 3 where <experiment> is the experiment file's name (without the .py extension) and cores is the number of cores to use.

Note that the generation of this data is costly and will likely require a machine that is capable of running undisturbed for a number of days or even weeks.

Once the data has been generated, it can be easily summarised using the edo_exp library that can be found at github.com/daffidwilde/edo_exp. To summarise an experiment whose data is stored at <path/to/experiment/data>, do the following:

Clone the repository: git clone https://github.com/daffidwilde/edo_exp
Move to its source code directory: cd edo_exp/src/edo_exp
Run the command: python summarise.py <path/to/experiment/data>.

Again, this may some time to complete.

Owner

Name: Henry Wilde
Login: daffidwilde
Kind: user
Location: Cardiff, UK
Company: Dŵr Cymru Welsh Water

Repositories: 29
Profile: https://github.com/daffidwilde

Data scientist and advocate for open-source, sustainably developed software 🛸 🐐 🦆

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 13
Total pull requests: 47
Average time to close issues: 8 months
Average time to close pull requests: 4 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 46
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

daffidwilde (13)

Pull Request Authors

daffidwilde (46)

Top Labels

Issue Labels

maybe (2) example (2) enhancement (1) question (1) ongoing (1)

Pull Request Labels

Dependencies

environment.yml pypi

edo ==0.2.1
matching ==1.1
tqdm ==4.36.1
yellowbrick ==1.0.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/daffidwilde/kmodes-init-paper

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Source code, notebooks, and data

The data

The notebooks

The source code

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies