Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: INFORMSJoC
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 3.49 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

INFORMS Journal on Computing Logo

pyJedAI: A Library with Resolution-related Structures and Procedures for Products

This project is distributed in association with the INFORMS Journal on Computing under the Apache 2.0 License.

The software and data in this repository are associated with the paper pyJedAI: A Library with Resolution-related Structures and Procedures for Products by Ekaterini Ioannou, Konstantinos Nikoletos and George Papadakis.

Version

The version used in the paper is

Release

Cite

To cite this software, please cite the paper and the software, using the following DOI.

@misc{pyjedaiProductMatching, author = {Ekaterini Ioannou and Konstantinos Nikoletos and George Papadakis}, publisher = {INFORMS Journal on Computing}, title = {pyJedAI: A Library with Resolution-related Structures and Procedures for Products}, year = {2024}, doi = {10.1287/ijoc.2023.0410.cd}, note = {Available for download at https://github.com/INFORMSJoC/2023.0410}, }

Authors

Description

This work presents an open-source Python library, named pyJedAI, which provides functionalities supporting the creation of algorithms related to Product Entity Resolution. Building over existing state-of-the-art resolution algorithms (Papadakis et al. 2021a), the tool offers a plethora of important tasks required for processing product data collections. It be can easily used by researchers and practitioners for creating algorithms analyzing products, such as real-time ads bidding, sponsored search, or pricing determination. In essence, it allows to easily import product data from the possible sources, compare products in order to detect either similar or identical products, generate a graph representation using the products and desired relationships, and either visualize or export the outcome in various forms. Our experimental evaluation on data from well-known online retailers illustrates high accuracy and low execution time for the supported tasks. To the best of our knowledge this is the first Python package to focus on product entities and provide this range of Product Entity Resolution functionalities.

Building

In Linux, to build the version used for this paper, execute the following two steps.

  1. Create and then activate a conda environment with Python 3.9 or 3.10: conda create --name pyJedAI_env python==3.9 conda activate pyJedAI_env

  2. Then, install the tool using either: pip install pyjedai==0.1.7

or in the root directory using: git clone https://github.com/AI-team-UoA/pyJedAI.git pip install .

Please note that it requires pystringmatching, which can be installed (before step 2) using command: ``` conda install conda-forge::pystringmatching ```

Usage

pLibTool

As describe in the journal, the tool implements a comprehensive end-to-end process for realizing possible similarity relation operators. The process, shown the above figure, consists of four steps: 1. data reading, 2. filtering, 3. verification, and 4. data writing and evaluation.

Google Colab Hands-on demo:

The simplest way to reproduce and view the results of this paper, is using the Colab notebook here:

Alternatively first run the installation and then go to src directory and run:

  • Blocking-based workflow: python blocking_workflow.py --dataset 'Abt - Buy'
  • Similarity join-based workflow: python similarity_joins_workflow.py --dataset 'Amazon - Google Products'
  • Nearest neighbor-based workflow: python nn_workflow.py --dataset 'Abt - Buy' --schema 'schema-agnostic'

where for - --dataset flag, available values are {'Abt - Buy', 'Amazon - Google Products', 'Wallmart - Amazon' } and for - --schema flag {'schema-agnostic', 'schema-based'}, available only for the NN workflow.

For the scalability test:

python dbpedia_scalability.py

Ongoing Development

This main tool is being developed on an on-going basis at the author's Github site.

Documentation page

To view more examples of this software visit readthedocs website.

Support

For support in using this software, submit an issue.

Owner

  • Name: INFORMS Journal on Computing
  • Login: INFORMSJoC
  • Kind: organization

Repository for software and data associated with papers published in the INFORMS Journal on Computing

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "YOUR_NAME_HERE"
  given-names: "YOUR_NAME_HERE"
  orcid: "https://orcid.org/0000-0000-0000-0000"
- family-names: "Lisa"
  given-names: "Mona"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "2023.0410"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2024-07-12
url: "https://github.com/Nikoletos-K/2023.0410"

GitHub Events

Total
Last Year

Dependencies

pyproject.toml pypi
  • faiss-cpu *
  • gensim *
  • matplotlib *
  • networkx *
  • nltk *
  • numpy >= 1.7.0,<2.0
  • ordered-set *
  • pandas *
  • py-stringmatching *
  • scipy ==1.12
  • seaborn *
  • sentence-transformers *
  • shapely *
  • tqdm *
  • transformers *
  • valentine python_version > '3.7'