Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: PierreWoL
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 26.4 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities
This repository hosts the Python Notebooks that bring to life the methodologies and approaches covered in the tutorial titled "Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities," scheduled for presentation at EDBT 2024. We also included the slides in this repo. The 4 dataset discovery and exploration functionalities reviewed in the tutorial are:
- Dataset Discovery.
- Dataset Navigation.
- Dataset Annotation.
- Schema Inference.
As an example of how the four functionalities of data discovery and exploration work in practice, this repository includes the following frameworks, or parts of them: - D3L [1] - Dataset Discovery. - Aurum [2] - Dataset Navigation. - TableMiner+ [3] - Dataset Annotation. - Starmie [4] - Schema Inference.
NOTE:SEMPROP[5] is the currently implemented core function of Aurum. Our version is the simplified version based on the description in Aurum [2].
[1] A. Bogatu, A. A. A. Fernandes, N. W. Paton and N. Konstantinou, "Dataset Discovery in Data Lakes," 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 2020, pp. 709-720, doi: 10.1109/ICDE48307.2020.00067.
[2] R. Castro Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden and M. Stonebraker, "Aurum: A Data Discovery System," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 2018, pp. 1001-1012, doi: 10.1109/ICDE.2018.00094.
[3] Zhang, Ziqi. “Effective and efficient Semantic Table Interpretation using TableMiner+.” Semantic Web 8 (2017): 921-957.
[4] Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 16, 7 (March 2023), 1726–1739.
[5] R. Castro Fernandez et al., "Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 2018, pp. 989-1000, doi: 10.1109/ICDE.2018.00093.
Dataset Overview
This project utilizes structured data derived from the Web Data Commons project, focusing on:
- T2Dv2 Gold Standard for Matching Web Tables to DBpedia: 108 tables from 9 entity classes. Link.
- Schema.org Table Corpus 2023: 92 tables from 8 entity classes. Link.
Modifications are applied to the Schema.org Table Corpus data, which include: - Conversion from JSON to CSV format for easier processing and analysis. - Flattening of nested JSON structures (e.g., address objects) to create clear, tabular data with columns.
The datasets have been uploaded to this project as Datasets.zip. Simply extract this file in the current folder to access the data.
Prerequisites
Requirements (follows the requirement of all included framework)
- Python 3.7.10 or higher. Preferred version: 3.11.0 All needed packages are specified in requirement.txt.
This repo includes all the D3L and Starmie in the framework. If you would like to download the original framework please go to the Origin framework, which includes the D3L and Starmie as submodules.
Ensure you have the following installed: - Python 3.7+ (Better to use Python 3.11) - Jupyter Notebook
Installation
Clone the repository to your local machine:
bash
git clone https://github.com/PierreWoL/EDBTDemo
Install required packages.
bash
pip install -r requirements.txt
Unzip the datasets.
bash
unzip Datasets.zip
Install the Jupyter notebook using pip.
bash
pip install jupyter notebook
Run the tutorial demo via
bash
jupyter notebook DataDiscovery.ipynb
Optional steps
Subject Column Detection in TableMiner+: This section needs to use the Google api for Google Custom Search
Engine, you can build your own engine via Here. After setting the engine, in webSearchAPI.py, cse_id and myapikey are needed to be changed.
Currently, we use the parameter SearchingWeb
TableColumnAnnotation class in TableAnnotation.py to disable the web search function in the Notebook.
In schema inference part, embeddings first needs to be generated by Starmie. If you have CUDA, feel free to run
bash
./starmie/cmd.sh
to generate the embeddings.
We also uploaded a sample embedding file for the datasets.
Copyright and License Notice
This project incorporates data used under the Apache License 2.0. We respect all original data copyrights and license requirements and share our work on this basis.
Citation
If you are using the code in this repo, please cite the following in your work:
bibtex
Norman W. Paton, Zhenyu Wu, Dataset Discovery and Exploration: State-of-the-art, Challenges and Opportunities.
Proceedings 27th International Conference on Extending Database Technology (EDBT), pp. 854–857, 2024.
https://openproceedings.org/2024/conf/edbt/tutorial-3.pdf
Owner
- Name: PierreWoL
- Login: PierreWoL
- Kind: user
- Repositories: 1
- Profile: https://github.com/PierreWoL
Citation (CITATION.cff)
type: dataset
cff-version: 1.2.0
message: "This is the dataset that this tutorial uses."
title: "Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities"
version: 1.0.0
date-released: 2024-03-05
references:
- title: "Web Data Commons - Schema.org Table Corpus 2023"
year: 2023
url: "https://webdatacommons.org/structureddata/schemaorgtables/2023/index.html#toc3"
type: dataset
- title: "T2Dv2 Gold Standard for Matching Web Tables to DBpedia"
year: 2023
url: "https://webdatacommons.org/webtables/goldstandardV2.html#Limaye2010"
type: dataset
GitHub Events
Total
Last Year
Dependencies
- SPARQLWrapper ===2.0.0
- country-list ===1.0.0
- google-api-python-client ===2.122.0
- jsonlines ===1.2.0
- mmh3 ===4.0.1
- networkx ===2.8.8
- nltk ===3.8.1
- numpy ===1.26.0
- pandas ===2.2.1
- plotly ===5.17.0
- regex ===2023.10.3
- requests ===2.31.0
- scikit-learn ===1.3.1
- scipy ===1.11.3
- sentencepiece ===0.2.0
- spacy ===4.0.0.dev2
- sqlalchemy ===2.0.22
- tensorboardX ===2.0
- torch ===2.2.1
- tqdm ===4.66.1
- transformers ===4.36.0
- urllib3 ===2.0.7