https://github.com/amazon-science/omnimatch
OmniMatch: Joinability Discovery in Data Products
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
OmniMatch: Joinability Discovery in Data Products
Basic Info
Statistics
- Stars: 3
- Watchers: 0
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ΟmniΜatch
This repo includes the code used for implementing OmniMatch, as described in "OmniMatch: Joinability Discovery in Data Products".
Repo structure
srcContains python source fiiles for developing OmniMatch and baselines used in the paper:training_generator.pycontains the code for generating training dataset pairs for self-supervision.featurizer.pycontains the code for computing column pairwise similarity metricsomnimatch_predictors.pycontains code for training and testing OmniMatch models, as described in the paper.rf_predictor.pycontains code for training and testing the Random Forest model baseline.- other source files needed for execution.
config_filesContains configuration files for each python script included insrc.
Datasets and other files
The dataset can be downloaded from this location: https://zenodo.org/records/15705578
Details:
data-products-matching/datasetscontains test and train datasets for both our join benchmarks.data-products-matching/assets/featurescontains column-pairwise similarity metrics for each measure used in the paper (in .pickle format) for both our join benchmarks and their corresponding test and train datasets.data-products-matching/assets/samplescontains samples of training datasets that can be used for training, for both join benchmarks.data-products-matching/assets/matchescontains all join and non-join pairs of training and test datasets for both our join benchmarks (in .pickle format).
Running OmniMatch
- In the absence of training data, use
src/training_generator.pyto generate training dataset pairs based on the test data. Make sure after generating the data to compute the full lists of join/nonjoin pairs between the generated dataset pairs in the format of [((filename1.csv, column1), (filename2.csv, column2)), etc.] and store them into two separate pickle files (like the ones we provide for our benchmarks). Parameters can be set through the `configfiles/traininggeneratorconfig.ini` file. - Make sure that you also have full lists of all join and non join column pairs for the test datasets, again in .pickle format (as the ones we provide).
- Run
src/featurizer.pyfor each different metric to compute for all join and non join pairs in training/test datasets. For example if you want compute embedding similarity based on frequent values, make sure to setvalue_embeddings: Truein the corresponding configuration file (config_files/featurizer_config.ini), while all other should be set toFalse. - Run
src/omnimatch_predictors.pyby setting appropriately the parameters inconfig_files/omnimatch_predictors_config.ini. To run Omnimatch setmodel: rgcn_margin or rgcn_cross_entropy, depending on the loss function you want to use.
Example run
python src/omnimatch_predictors.py -cf config_files/omnimatch_predictors_config.ini
Example config file (omnimatch_predictors_config.ini)
``` traindatasetspath: [root]/datasets/citygovernment/traintables trainfeaturespath: [root]/features/citygovernment/traintables/ trainnodefeatures: [root]/features/citygovernment/individualfeatures.pickle testfeaturespath: [root]/features/citygovernment/testtables/ testnodefeatures: [root]/features/citygovernment/testtables/individualfeatures.pickle resultspath: [root]/results samplespath: [root]/samples/citygovernment sampled_datasets: (nothing as we will sample the training data in this run)
[PARAMETERS] benchmark: citygovernment (run the citygovernment benchmark) graphconstruction: topk (keep topk edges per node) modelloss: rgcnmargin (run OmniMatch with triplet margin loss) k: 3 numberofdatasets: 2 (2 generated datasets per test dataset - should be generated beforehand with traininggenerator.py) numberofsources: 20 (use 20 of the test datasets to pick generated training pairs - should be generated beforehand with traininggenerator.py) dimension: 256 epochs: 30 learningrate: 0.001 margin: 0.5 norm: 2 (doesn't matter since we picked margin loss, would matter if we picked rgcncrossentropy)
[FEATURES] jaccardfrequent: True valueembeddings: True valuedistribution: True jaccardcontainment: True
[SAVEFILES] writeembeddings: False - we don't want to store produced embeddings writeresults: True - we want to store results ```
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 2
- Delete event: 4
- Issue comment event: 2
- Push event: 2
- Public event: 1
- Pull request event: 3
Last Year
- Watch event: 2
- Delete event: 4
- Issue comment event: 2
- Push event: 2
- Public event: 1
- Pull request event: 3
Dependencies
- dgl ==0.5.3
- fasttext ==0.9.2
- nltk ==3.6.3
- numpy ==1.21.5
- pandas ==1.3.5
- scikit_learn ==1.0.2
- scipy ==1.5.4
- torch ==1.7.0
- torchmetrics ==0.9.3
- tqdm ==4.64.0
- wordninja ==2.0.0