Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: PierreWoL
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 26.4 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities

This repository hosts the Python Notebooks that bring to life the methodologies and approaches covered in the tutorial titled "Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities," scheduled for presentation at EDBT 2024. We also included the slides in this repo. The 4 dataset discovery and exploration functionalities reviewed in the tutorial are:

  • Dataset Discovery.
  • Dataset Navigation.
  • Dataset Annotation.
  • Schema Inference.

As an example of how the four functionalities of data discovery and exploration work in practice, this repository includes the following frameworks, or parts of them: - D3L [1] - Dataset Discovery. - Aurum [2] - Dataset Navigation. - TableMiner+ [3] - Dataset Annotation. - Starmie [4] - Schema Inference.

NOTE:SEMPROP[5] is the currently implemented core function of Aurum. Our version is the simplified version based on the description in Aurum [2].

[1] A. Bogatu, A. A. A. Fernandes, N. W. Paton and N. Konstantinou, "Dataset Discovery in Data Lakes," 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 2020, pp. 709-720, doi: 10.1109/ICDE48307.2020.00067.

[2] R. Castro Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden and M. Stonebraker, "Aurum: A Data Discovery System," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 2018, pp. 1001-1012, doi: 10.1109/ICDE.2018.00094.

[3] Zhang, Ziqi. “Effective and efficient Semantic Table Interpretation using TableMiner+.” Semantic Web 8 (2017): 921-957.

[4] Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 16, 7 (March 2023), 1726–1739.

[5] R. Castro Fernandez et al., "Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 2018, pp. 989-1000, doi: 10.1109/ICDE.2018.00093.

Dataset Overview

This project utilizes structured data derived from the Web Data Commons project, focusing on:

  • T2Dv2 Gold Standard for Matching Web Tables to DBpedia: 108 tables from 9 entity classes. Link.
  • Schema.org Table Corpus 2023: 92 tables from 8 entity classes. Link.

Modifications are applied to the Schema.org Table Corpus data, which include: - Conversion from JSON to CSV format for easier processing and analysis. - Flattening of nested JSON structures (e.g., address objects) to create clear, tabular data with columns.

The datasets have been uploaded to this project as Datasets.zip. Simply extract this file in the current folder to access the data.

Prerequisites

Requirements (follows the requirement of all included framework)

  • Python 3.7.10 or higher. Preferred version: 3.11.0 All needed packages are specified in requirement.txt.

This repo includes all the D3L and Starmie in the framework. If you would like to download the original framework please go to the Origin framework, which includes the D3L and Starmie as submodules.

Ensure you have the following installed: - Python 3.7+ (Better to use Python 3.11) - Jupyter Notebook

Installation

Clone the repository to your local machine: bash git clone https://github.com/PierreWoL/EDBTDemo Install required packages. bash pip install -r requirements.txt Unzip the datasets. bash unzip Datasets.zip Install the Jupyter notebook using pip. bash pip install jupyter notebook Run the tutorial demo via bash jupyter notebook DataDiscovery.ipynb

Optional steps

Subject Column Detection in TableMiner+: This section needs to use the Google api for Google Custom Search Engine, you can build your own engine via Here. After setting the engine, in webSearchAPI.py, cse_id and myapikey are needed to be changed.
Currently, we use the parameter SearchingWeb TableColumnAnnotation class in TableAnnotation.py to disable the web search function in the Notebook.

In schema inference part, embeddings first needs to be generated by Starmie. If you have CUDA, feel free to run bash ./starmie/cmd.sh to generate the embeddings. We also uploaded a sample embedding file for the datasets.

Copyright and License Notice

This project incorporates data used under the Apache License 2.0. We respect all original data copyrights and license requirements and share our work on this basis.

Citation

If you are using the code in this repo, please cite the following in your work: bibtex Norman W. Paton, Zhenyu Wu, Dataset Discovery and Exploration: State-of-the-art, Challenges and Opportunities. Proceedings 27th International Conference on Extending Database Technology (EDBT), pp. 854–857, 2024. https://openproceedings.org/2024/conf/edbt/tutorial-3.pdf

Owner

  • Name: PierreWoL
  • Login: PierreWoL
  • Kind: user

Citation (CITATION.cff)

type: dataset
cff-version: 1.2.0
message: "This is the dataset that this tutorial uses."
title: "Dataset Discovery and Exploration: State-of-the-art, Challenges, and Opportunities"
version: 1.0.0
date-released: 2024-03-05
references:
  - title: "Web Data Commons - Schema.org Table Corpus 2023"
    year: 2023
    url: "https://webdatacommons.org/structureddata/schemaorgtables/2023/index.html#toc3"
    type: dataset
  - title: "T2Dv2 Gold Standard for Matching Web Tables to DBpedia"
    year: 2023
    url: "https://webdatacommons.org/webtables/goldstandardV2.html#Limaye2010"
    type: dataset

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • SPARQLWrapper ===2.0.0
  • country-list ===1.0.0
  • google-api-python-client ===2.122.0
  • jsonlines ===1.2.0
  • mmh3 ===4.0.1
  • networkx ===2.8.8
  • nltk ===3.8.1
  • numpy ===1.26.0
  • pandas ===2.2.1
  • plotly ===5.17.0
  • regex ===2023.10.3
  • requests ===2.31.0
  • scikit-learn ===1.3.1
  • scipy ===1.11.3
  • sentencepiece ===0.2.0
  • spacy ===4.0.0.dev2
  • sqlalchemy ===2.0.22
  • tensorboardX ===2.0
  • torch ===2.2.1
  • tqdm ===4.66.1
  • transformers ===4.36.0
  • urllib3 ===2.0.7