2023.0410

https://github.com/informsjoc/2023.0410

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: INFORMSJoC
License: apache-2.0
Language: Python
Default Branch: main
Size: 3.49 MB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

pyJedAI: A Library with Resolution-related Structures and Procedures for Products

This project is distributed in association with the INFORMS Journal on Computing under the Apache 2.0 License.

The software and data in this repository are associated with the paper pyJedAI: A Library with Resolution-related Structures and Procedures for Products by Ekaterini Ioannou, Konstantinos Nikoletos and George Papadakis.

Version

The version used in the paper is

Cite

To cite this software, please cite the paper and the software, using the following DOI.

@misc{pyjedaiProductMatching, author = {Ekaterini Ioannou and Konstantinos Nikoletos and George Papadakis}, publisher = {INFORMS Journal on Computing}, title = {pyJedAI: A Library with Resolution-related Structures and Procedures for Products}, year = {2024}, doi = {10.1287/ijoc.2023.0410.cd}, note = {Available for download at https://github.com/INFORMSJoC/2023.0410}, }

Authors

Ekaterini Ioannou, Assistant Professor at Tilburg University, The Netherlands
Konstantinos Nikoletos, Research Associate at University of Athens, Greece
George Papadakis, Senior Researcher at University of Athens, Greece

Description

This work presents an open-source Python library, named pyJedAI, which provides functionalities supporting the creation of algorithms related to Product Entity Resolution. Building over existing state-of-the-art resolution algorithms (Papadakis et al. 2021a), the tool offers a plethora of important tasks required for processing product data collections. It be can easily used by researchers and practitioners for creating algorithms analyzing products, such as real-time ads bidding, sponsored search, or pricing determination. In essence, it allows to easily import product data from the possible sources, compare products in order to detect either similar or identical products, generate a graph representation using the products and desired relationships, and either visualize or export the outcome in various forms. Our experimental evaluation on data from well-known online retailers illustrates high accuracy and low execution time for the supported tasks. To the best of our knowledge this is the first Python package to focus on product entities and provide this range of Product Entity Resolution functionalities.

Building

In Linux, to build the version used for this paper, execute the following two steps.

Create and then activate a conda environment with Python 3.9 or 3.10: conda create --name pyJedAI_env python==3.9 conda activate pyJedAI_env
Then, install the tool using either: pip install pyjedai==0.1.7

or in the root directory using: git clone https://github.com/AI-team-UoA/pyJedAI.git pip install .

Please note that it requires pystringmatching, which can be installed (before step 2) using command: ``` conda install conda-forge::pystringmatching ```

Usage

As describe in the journal, the tool implements a comprehensive end-to-end process for realizing possible similarity relation operators. The process, shown the above figure, consists of four steps: 1. data reading, 2. filtering, 3. verification, and 4. data writing and evaluation.

Google Colab Hands-on demo:

The simplest way to reproduce and view the results of this paper, is using the Colab notebook here:

Alternatively first run the installation and then go to src directory and run:

Blocking-based workflow: python blocking_workflow.py --dataset 'Abt - Buy'
Similarity join-based workflow: python similarity_joins_workflow.py --dataset 'Amazon - Google Products'
Nearest neighbor-based workflow: python nn_workflow.py --dataset 'Abt - Buy' --schema 'schema-agnostic'

where for - --dataset flag, available values are {'Abt - Buy', 'Amazon - Google Products', 'Wallmart - Amazon' } and for - --schema flag {'schema-agnostic', 'schema-based'}, available only for the NN workflow.

For the scalability test:

python dbpedia_scalability.py

Ongoing Development

This main tool is being developed on an on-going basis at the author's Github site.

Documentation page

To view more examples of this software visit readthedocs website.

Support

For support in using this software, submit an issue.

Owner

Name: INFORMS Journal on Computing
Login: INFORMSJoC
Kind: organization

Website: https://pubsonline.informs.org/journal/ijoc
Repositories: 32
Profile: https://github.com/INFORMSJoC

Repository for software and data associated with papers published in the INFORMS Journal on Computing

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "YOUR_NAME_HERE"
  given-names: "YOUR_NAME_HERE"
  orcid: "https://orcid.org/0000-0000-0000-0000"
- family-names: "Lisa"
  given-names: "Mona"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "2023.0410"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2024-07-12
url: "https://github.com/Nikoletos-K/2023.0410"

GitHub Events

Total

Last Year

Dependencies

pyproject.toml pypi

faiss-cpu *
gensim *
matplotlib *
networkx *
nltk *
numpy >= 1.7.0,<2.0
ordered-set *
pandas *
py-stringmatching *
scipy ==1.12
seaborn *
sentence-transformers *
shapely *
tqdm *
transformers *
valentine python_version > '3.7'

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science