refactor-negative-sampler

Repository for paper "Enhancing PyKeen with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models"

https://github.com/ivandiliso/refactor-negative-sampler

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

Repository for paper "Enhancing PyKeen with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models"

Basic Info
  • Host: GitHub
  • Owner: ivandiliso
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 552 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Enhancing PyKeen with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models

DOI Docs arXiv

Documentation

In depth documentation and tutorials are available in the apposite GitHub Page https://ivandiliso.github.io/refactor-negative-sampler/

Folder Structure

data -> Dataset used during traning, validation and testing YAGO4-20 FB15K WN18 DB50K doc -> Documentations and logs cached -> Cached Negative Sampler subsets for faster computation model embedding -> Embedding models checkpoints sampling -> Checkpoints for models used in dynamic sampling experiments -> Experiments results after HPO pipeline script -> Single execution files, settings etc src -> Source code extension -> Extensions of PyKeen classes for negative sampling utils -> Utility files, libraries, logging notebooks -> Testing, single exectuion and code evaluation notebooks temp -> Temporary files

Dataset Stucture

Each dataset is provided with the following folder structure

dataset_name mapping entity_to_id.json -> Dictionary mapping entity names (string) to IDs (integer) relation_to_id.json -> Dictionary mapping relation names (string) to IDs (integer) metadata entity_classes.json -> Dictionary mapping entity names (string) to classes (list of strings) relation_domain_range.json -> Dictionary mapping relation names (string) to domain and range classes (string) owl -> Additional schema-level information in OWL format train.txt -> Training Split Triples in TSV format (using string names) test.txt -> Testing Split Triples in TSV format (using string names) valid.txt -> Validation Split Triples in TSV format (using string names)

Extension Structure

``` src/extension constants.py -> Constant variables used across the whole library dataset.py -> Implementation of OnMemoryDataset filtering.py -> Implementation of NullPytonSetFilterer sampling.py -> Implementation of SubsetNegativeSampler and all the specific sampling strategies utils.py -> Utility functions

```

Instructions

A fully detailed tutorial is provided in src/tutorial.ipynb.Detailed instruction are available in https://ivandiliso.github.io/refactor-negative-sampler/

  1. Unzip the datasets files
  2. Install the dependencies found in the requirements.txt file
  3. Manually run the example python files, or use one of the provided scripts in the scripts folder

The library is completely integrated in the PyKEEN ecosystem, if you need a boostrap on using the library on the fly, just follow this guide, three example file can be used to run in order a hpo pipeline, a normal pipeline, and the negative sampler evaluation. If you want to directly run an example configuration, you can find

hpo_pipeline.py

Run a hyperparameter optimization pipeline using the chosen model, can be run using CLI arguments:

bash python src/hpo_pipeline.py --dataset dataset_name --model model_name --sampler sampler_name --negatives number_negatives

pipeline.py

Run a pipeline using the chosen model and static defined parameters, can be run using CLI arguments:

bash python src/hpo_pipeline.py --dataset dataset_name --model model_name --sampler sampler_name --negatives number_negatives --l2 regularizer_weight --lr learning_rate --margin loss_margin

negative_evaluation.py

Example code on how to compute the negative sampler statistic for a specific dataset. This file also contains use examples of Dynamic Sampling using a TransE pretained model on YAGO4-20, it provides pre-written prediciton function that work with the provided model.

Cite our paper

bibtex @misc{damato2025enhancingpykeenmultiplenegative, title={Enhancing PyKEEN with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models}, author={Claudia d'Amato and Ivan Diliso and Nicola Fanizzi and Zafar Saeed}, year={2025}, eprint={2508.05587}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.05587}, }

Owner

  • Name: Ivan Diliso
  • Login: ivandiliso
  • Kind: user
  • Location: Italy, Bari
  • Company: University of Bari Aldo Moro

PhD Student @ University of Bari Aldo Moro

GitHub Events

Total
  • Release event: 4
  • Push event: 14
  • Create event: 5
Last Year
  • Release event: 4
  • Push event: 14
  • Create event: 5

Dependencies

requirements.txt pypi
  • Jinja2 ==3.1.6
  • Mako ==1.3.10
  • MarkupSafe ==3.0.2
  • PySocks ==1.7.1
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • SQLAlchemy ==2.0.40
  • alembic ==1.15.2
  • asttokens ==3.0.0
  • beautifulsoup4 ==4.13.4
  • certifi ==2025.4.26
  • charset-normalizer ==3.4.1
  • class-resolver ==0.6.0
  • click ==8.1.8
  • click-default-group ==1.2.4
  • colorlog ==6.9.0
  • comm ==0.2.2
  • dataclasses-json ==0.6.7
  • debugpy ==1.8.14
  • decorator ==5.2.1
  • docdata ==0.0.5
  • executing ==2.2.0
  • filelock ==3.18.0
  • fsspec ==2025.3.2
  • gdown ==5.2.0
  • greenlet ==3.2.1
  • idna ==3.10
  • ipykernel ==6.29.5
  • ipython ==9.2.0
  • ipython_pygments_lexers ==1.1.1
  • jedi ==0.19.2
  • joblib ==1.4.2
  • jupyter_client ==8.6.3
  • jupyter_core ==5.7.2
  • marshmallow ==3.26.1
  • matplotlib-inline ==0.1.7
  • more-click ==0.1.2
  • more-itertools ==10.7.0
  • mpmath ==1.3.0
  • mypy_extensions ==1.1.0
  • nest-asyncio ==1.6.0
  • networkx ==3.4.2
  • numpy ==2.2.5
  • nvidia-cublas-cu12 ==12.6.4.1
  • nvidia-cuda-cupti-cu12 ==12.6.80
  • nvidia-cuda-nvrtc-cu12 ==12.6.77
  • nvidia-cuda-runtime-cu12 ==12.6.77
  • nvidia-cudnn-cu12 ==9.5.1.17
  • nvidia-cufft-cu12 ==11.3.0.4
  • nvidia-cufile-cu12 ==1.11.1.6
  • nvidia-curand-cu12 ==10.3.7.77
  • nvidia-cusolver-cu12 ==11.7.1.2
  • nvidia-cusparse-cu12 ==12.5.4.2
  • nvidia-cusparselt-cu12 ==0.6.3
  • nvidia-nccl-cu12 ==2.26.2
  • nvidia-nvjitlink-cu12 ==12.6.85
  • nvidia-nvtx-cu12 ==12.6.77
  • optuna ==4.3.0
  • packaging ==25.0
  • pandas ==2.2.3
  • parso ==0.8.4
  • pexpect ==4.9.0
  • platformdirs ==4.3.8
  • prompt_toolkit ==3.0.51
  • psutil ==7.0.0
  • ptyprocess ==0.7.0
  • pure_eval ==0.2.3
  • pykeen ==1.11.1
  • pystow ==0.7.0
  • python-dateutil ==2.9.0.post0
  • pytz ==2025.2
  • pyzmq ==26.4.0
  • requests ==2.32.3
  • scikit-learn ==1.6.1
  • scipy ==1.15.2
  • setuptools ==80.0.1
  • six ==1.17.0
  • soupsieve ==2.7
  • stack-data ==0.6.3
  • sympy ==1.14.0
  • tabulate ==0.9.0
  • threadpoolctl ==3.6.0
  • torch ==2.7.0
  • torch-max-mem ==0.1.4
  • torch-ppr ==0.0.8
  • tornado ==6.4.2
  • tqdm ==4.67.1
  • traitlets ==5.14.3
  • triton ==3.3.0
  • typing-inspect ==0.9.0
  • typing_extensions ==4.13.2
  • tzdata ==2025.2
  • urllib3 ==2.4.0
  • wcwidth ==0.2.13