dataset_recommendation_using_ensemble_approach-coauthor-

https://github.com/xuwang0010/dataset_recommendation_using_ensemble_approach-coauthor-

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: xuwang0010
License: gpl-3.0
Language: Python
Default Branch: main
Size: 57.6 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created about 4 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

DatasetRecommendationusingensembelapproach-CoAuthor-

Introduction

This repository is the required dataset and python implementation of paper "Recommending Scientific Datasets Using Author Networks in Ensemble Methods" with authors Xu Wang, Frank van Harmelen and Zhisheng Huang.

Requirement before running experiment

Make sure your python version >= 3.6. You should "pip" install followling library in your python environment: - pybind11 - pyHDT or rdflib-hdt - numpy - gensim - tqdm - rank_bm25 - networkx

Or simply use pip install -r requirements.txt to install all needed library.

Dataset

The dataset you needed for our ensembel datset recommendation algorithm: - MAKG coauthor RDF/HDT file download link - MAKG paper/dataset title RDF/HDT file download link - MAKG paper/dataset abstract RDF/HDT file download link - MAKG pretrained author-entity embedding download link - Seed dataset/paper txt file one dataset per line - Candidate dataset/paper txt file one dataset per line - Gold standard link between seeds and candidates RDF/HDT file

Python Implementation of Dataset Recommendation with Co-author network in ensemble methods

The algorithm in paper is implemented in Recommendwalkembed_bm.py: - Graph walk implementation - graphwalk function in line 47 - line 217-220 of step function - Author entity embedding similarity - clean_candidate_with_ent_embed in line 107 - line 221-222 of step function - BM25 - line 253-260 of step function

Usage

``` usage: Recommendwalkembedbm.py [-h] -th THRESHOLD -bth BM25THRESHOLD -hp HOP -sd SEED -cd CANDIDATE -gd STANDARD [-d DIR]

optional arguments: -h, --help show this help message and exit -th THRESHOLD, --threshold THRESHOLD Threshold for similarity between entity(author) embedding -bth BM25THRESHOLD, --bm25threshold BM25_THRESHOLD Threshold for BM25 ranking -hp HOP, --hop HOP Hop number for graph walk -sd SEED, --seed SEED Path to [seed file].txt -cd CANDIDATE, --candidate CANDIDATE Path to [candidate file].txt -gd STANDARD, --standard STANDARD Path to [standard file].hdt -d DIR, --dir DIR Directory to read all needed files and to store all results. Default is directory of this python file ```

Sample experiment

41 seed datasets and 116 candidate datasets, with 117 gold standard link.

python Recommend_walk_embed_bm.py -th [threshold of embedding similarity] -bth [threshold of bm25] -hp [hop number of graph walk] -sd [path_to_seed] -cd [path_to_candidate] -gd [path_to_standard.hdt] -d [path_to_dir_of_all_datasets]

After running python file, it will return result file in directory with format per line:

seed_dataset_id[Tab Separated]Correct_Count[Tab Separated]Standard_Count[Tab Separated]Recommended_Count[Tab Separated]Recall[Tab Separated]Precision

where Standard_Count is the number of standard linked datasets for seed dataset; Recommended_Count is the number of datasets returned by recommendation alogrithm for seed dataset; Correct_Count is the number of intersection between standard linked datasets and datasets returned by recommendation alogrithm for seed dataset; Recall is Correct_Count divided by Standard_Count; Precision is Correct_Count divided by Recommended_Count.

License

This repository is licensed under GNU General Public License v3.0.

The Microsoft Academic Knowledge Graph, the linked data description files, and the ontology are licensed under the Open Data Commons Attribution License (ODC-By) v1.0.

Citation of Data

Wang, Xu, 2022, "Data For "Recommending Scientiﬁc Datasets Using Author Networks in Ensemble Methods"", https://doi.org/10.34894/W6C7P7, DataverseNL, V1

Owner

Name: XuWang1991
Login: xuwang0010
Kind: user

Repositories: 2
Profile: https://github.com/xuwang0010

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Dataset_Recommendation_using_ensembel_approach-CoAuthor
message: >-
  Python implementation of paper "Recommending
  Scientiﬁc Datasets Using Author Networks in
  Ensemble Methods"
type: software
authors:
  - given-names: Xu
    family-names: Wang
    email: xu.wang@vu.nl
    affiliation: Vrije Universiteit Amsterdam
    orcid: 'https://orcid.org/0000-0002-7585-759X'

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

gensim *
hdt *
networkx *
numpy *
pybind11 *
rank_bm25 *
tqdm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science