https://github.com/amazon-science/omnimatch

OmniMatch: Joinability Discovery in Data Products

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

OmniMatch: Joinability Discovery in Data Products

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Homepage:
Size: 27.3 KB

Statistics

Stars: 3
Watchers: 0
Forks: 2
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct

ΟmniΜatch

This repo includes the code used for implementing OmniMatch, as described in "OmniMatch: Joinability Discovery in Data Products".

Repo structure

src Contains python source fiiles for developing OmniMatch and baselines used in the paper:
- training_generator.py contains the code for generating training dataset pairs for self-supervision.
- featurizer.py contains the code for computing column pairwise similarity metrics
- omnimatch_predictors.py contains code for training and testing OmniMatch models, as described in the paper.
- rf_predictor.py contains code for training and testing the Random Forest model baseline.
- other source files needed for execution.
config_files Contains configuration files for each python script included in src.

Datasets and other files

The dataset can be downloaded from this location: https://zenodo.org/records/15705578

Details:

data-products-matching/datasets contains test and train datasets for both our join benchmarks.
data-products-matching/assets/features contains column-pairwise similarity metrics for each measure used in the paper (in .pickle format) for both our join benchmarks and their corresponding test and train datasets.
data-products-matching/assets/samples contains samples of training datasets that can be used for training, for both join benchmarks.
data-products-matching/assets/matches contains all join and non-join pairs of training and test datasets for both our join benchmarks (in .pickle format).

Running OmniMatch

In the absence of training data, use src/training_generator.py to generate training dataset pairs based on the test data. Make sure after generating the data to compute the full lists of join/nonjoin pairs between the generated dataset pairs in the format of [((filename1.csv, column1), (filename2.csv, column2)), etc.] and store them into two separate pickle files (like the ones we provide for our benchmarks). Parameters can be set through the `configfiles/traininggeneratorconfig.ini` file.
Make sure that you also have full lists of all join and non join column pairs for the test datasets, again in .pickle format (as the ones we provide).
Run src/featurizer.py for each different metric to compute for all join and non join pairs in training/test datasets. For example if you want compute embedding similarity based on frequent values, make sure to set value_embeddings: True in the corresponding configuration file (config_files/featurizer_config.ini), while all other should be set to False.
Run src/omnimatch_predictors.py by setting appropriately the parameters in config_files/omnimatch_predictors_config.ini. To run Omnimatch set model: rgcn_margin or rgcn_cross_entropy, depending on the loss function you want to use.

Example run

python src/omnimatch_predictors.py -cf config_files/omnimatch_predictors_config.ini

Example config file (omnimatch_predictors_config.ini)

``` traindatasetspath: [root]/datasets/citygovernment/traintables trainfeaturespath: [root]/features/citygovernment/traintables/ trainnodefeatures: [root]/features/citygovernment/individualfeatures.pickle testfeaturespath: [root]/features/citygovernment/testtables/ testnodefeatures: [root]/features/citygovernment/testtables/individualfeatures.pickle resultspath: [root]/results samplespath: [root]/samples/citygovernment sampled_datasets: (nothing as we will sample the training data in this run)

[PARAMETERS] benchmark: citygovernment (run the citygovernment benchmark) graphconstruction: topk (keep topk edges per node) modelloss: rgcnmargin (run OmniMatch with triplet margin loss) k: 3 numberofdatasets: 2 (2 generated datasets per test dataset - should be generated beforehand with traininggenerator.py) numberofsources: 20 (use 20 of the test datasets to pick generated training pairs - should be generated beforehand with traininggenerator.py) dimension: 256 epochs: 30 learningrate: 0.001 margin: 0.5 norm: 2 (doesn't matter since we picked margin loss, would matter if we picked rgcncrossentropy)

[FEATURES] jaccardfrequent: True valueembeddings: True valuedistribution: True jaccardcontainment: True

[SAVEFILES] writeembeddings: False - we don't want to store produced embeddings writeresults: True - we want to store results ```

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 2
Delete event: 4
Issue comment event: 2
Push event: 2
Public event: 1
Pull request event: 3

Last Year

Watch event: 2
Delete event: 4
Issue comment event: 2
Push event: 2
Public event: 1
Pull request event: 3

Dependencies

requirements.txt pypi

dgl ==0.5.3
fasttext ==0.9.2
nltk ==3.6.3
numpy ==1.21.5
pandas ==1.3.5
scikit_learn ==1.0.2
scipy ==1.5.4
torch ==1.7.0
torchmetrics ==0.9.3
tqdm ==4.64.0
wordninja ==2.0.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/omnimatch

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ΟmniΜatch

Repo structure

Datasets and other files

Running OmniMatch

Example run

Example config file (omnimatch_predictors_config.ini)

Owner

GitHub Events

Total

Last Year

Dependencies