dataset_recommendation_using_ensemble_approach-coauthor-
https://github.com/xuwang0010/dataset_recommendation_using_ensemble_approach-coauthor-
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: xuwang0010
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 57.6 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
DatasetRecommendationusingensembelapproach-CoAuthor-
Introduction
This repository is the required dataset and python implementation of paper "Recommending Scientific Datasets Using Author Networks in Ensemble Methods" with authors Xu Wang, Frank van Harmelen and Zhisheng Huang.
Requirement before running experiment
Make sure your python version >= 3.6. You should "pip" install followling library in your python environment: - pybind11 - pyHDT or rdflib-hdt - numpy - gensim - tqdm - rank_bm25 - networkx
Or simply use pip install -r requirements.txt to install all needed library.
Dataset
The dataset you needed for our ensembel datset recommendation algorithm: - MAKG coauthor RDF/HDT file download link - MAKG paper/dataset title RDF/HDT file download link - MAKG paper/dataset abstract RDF/HDT file download link - MAKG pretrained author-entity embedding download link - Seed dataset/paper txt file one dataset per line - Candidate dataset/paper txt file one dataset per line - Gold standard link between seeds and candidates RDF/HDT file
Python Implementation of Dataset Recommendation with Co-author network in ensemble methods
The algorithm in paper is implemented in Recommendwalkembed_bm.py:
- Graph walk implementation
- graphwalk function in line 47
- line 217-220 of step function
- Author entity embedding similarity
- clean_candidate_with_ent_embed in line 107
- line 221-222 of step function
- BM25
- line 253-260 of step function
Usage
``` usage: Recommendwalkembedbm.py [-h] -th THRESHOLD -bth BM25THRESHOLD -hp HOP -sd SEED -cd CANDIDATE -gd STANDARD [-d DIR]
optional arguments: -h, --help show this help message and exit -th THRESHOLD, --threshold THRESHOLD Threshold for similarity between entity(author) embedding -bth BM25THRESHOLD, --bm25threshold BM25_THRESHOLD Threshold for BM25 ranking -hp HOP, --hop HOP Hop number for graph walk -sd SEED, --seed SEED Path to [seed file].txt -cd CANDIDATE, --candidate CANDIDATE Path to [candidate file].txt -gd STANDARD, --standard STANDARD Path to [standard file].hdt -d DIR, --dir DIR Directory to read all needed files and to store all results. Default is directory of this python file ```
Sample experiment
41 seed datasets and 116 candidate datasets, with 117 gold standard link.
python Recommend_walk_embed_bm.py -th [threshold of embedding similarity] -bth [threshold of bm25] -hp [hop number of graph walk] -sd [path_to_seed] -cd [path_to_candidate] -gd [path_to_standard.hdt] -d [path_to_dir_of_all_datasets]
After running python file, it will return result file in directory with format per line:
seed_dataset_id[Tab Separated]Correct_Count[Tab Separated]Standard_Count[Tab Separated]Recommended_Count[Tab Separated]Recall[Tab Separated]Precision
where Standard_Count is the number of standard linked datasets for seed dataset; Recommended_Count is the number of datasets returned by recommendation alogrithm for seed dataset; Correct_Count is the number of intersection between standard linked datasets and datasets returned by recommendation alogrithm for seed dataset; Recall is Correct_Count divided by Standard_Count; Precision is Correct_Count divided by Recommended_Count.
License
This repository is licensed under GNU General Public License v3.0.
The Microsoft Academic Knowledge Graph, the linked data description files, and the ontology are licensed under the Open Data Commons Attribution License (ODC-By) v1.0.
Citation of Data
Wang, Xu, 2022, "Data For "Recommending Scientific Datasets Using Author Networks in Ensemble Methods"", https://doi.org/10.34894/W6C7P7, DataverseNL, V1
Owner
- Name: XuWang1991
- Login: xuwang0010
- Kind: user
- Repositories: 2
- Profile: https://github.com/xuwang0010
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Dataset_Recommendation_using_ensembel_approach-CoAuthor
message: >-
Python implementation of paper "Recommending
Scientific Datasets Using Author Networks in
Ensemble Methods"
type: software
authors:
- given-names: Xu
family-names: Wang
email: xu.wang@vu.nl
affiliation: Vrije Universiteit Amsterdam
orcid: 'https://orcid.org/0000-0002-7585-759X'
GitHub Events
Total
Last Year
Dependencies
- gensim *
- hdt *
- networkx *
- numpy *
- pybind11 *
- rank_bm25 *
- tqdm *