Science Score: 28.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary
Repository
2021SP FORWARD Lab Project by Haozhe Si
Basic Info
- Host: GitHub
- Owner: Ehzoahis
- Language: Python
- Default Branch: main
- Size: 30.9 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
AcademicRank
2021SP FORWARD Lab Project
Introduction
The goal of the project is to calculate the rank of academic works given a keyword. The rank will be calculated according to the Field of Study of the paper. The ranking algorithm is inspired by The PageRank Citation Ranking: Bringing Order to the Web, with the assumption that similarity between the papers and the target keywords can only be distributed once. Currently, the program can only handle the keyword with multiple words to ensure the accuracy of ranking.
Installation
Install the package using requirements.txt
shell
pip3 install -r requirements.txt
Datasets
Microsoft Academic Graph
The Mircosoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications and fields of study. The schema of the dataset can be found here. Among those dataset files, we would use: - FieldsOfStudy - PaperFieldsOfStudy - PaperReferences
The downloaded data can be found on owl3 server, path.
Springer-83K CS Keywords
The CS keywords collected from Springer by Yanghui Pang. Dataset can be found here.
word2vec Model
The word2vec model is trained on the abstract of papers in arXive dataset by Edward Ma. The model can be found here.
Usage
Build the Pruned MAG Dataset
To speed up the ranking algorithm, we need to first prune out the Field of Study (FoS) that are not CS keywords.
python
python3 prune_fos.py
The resulting FoS list will be in pruned_FOS.txt.
We further need to prune out the papers and references that do not relate to CS.
python
python3 prune_paper_edge.py
The resulting file are cspapers.txt and pruned_PR.txt.
If any issue exists when running prune_fos.py or prune_paper_edge.py please check the original codes which are more stable.
Perform AcademicRank
The preparation work only need to be done once. To calculate the rank of papers given keywords, do
python
python3 academic_rank.py [keyword1,keyword2,...]
where keywords need to be separated by ',' and keywords with multiple words need to be connected by '_'. E.g.
python
python3 academic_rank.py computer_science,data_mining
Visualization
Since the academic_rank.py will give a list of paper ID, we can find the name of the papers given the ID using MAG API. See methods and examples from visualization.ipynb for more information.
Reservation
The accuracy of this program is not guaranteed because the vocabulary of the word2vec model is not large enough and thus the keyword similarity cannot be calculated in the most times. Currently, the program is assigning dummy similarity to the keywords that are not in word2vec model.
Author
- Haozhe Si
- Instructed by Professor Kevin Chang
Owner
- Name: Haozhe Si
- Login: Ehzoahis
- Kind: user
- Location: Champaign, IL
- Company: University of Illinois Urbana-Champaign
- Website: https://ehzoahis.github.io/
- Repositories: 1
- Profile: https://github.com/Ehzoahis
Citation (citation_cnt.py)
# Calculate the rank naively base on citation numbers of a given paper.
# Used to compare with the AcademicRank Result
from heapq import nlargest
import sys
from tqdm import tqdm
from collections import defaultdict
from sqlalchemy import create_engine
EDGE_CNT = 1094935127 # tot edges
PAPER_CNT = 355977380 # papers about CS
db_url = 'mysql+pymysql://haozhes3:hank20si@owl2.cs.illinois.edu/haozhes3_refs?charset=utf8'
engine = create_engine(db_url)
mag_db = './pruned_PR_83k.txt'
mid2fos = './cspaper.txt'
# Build the dictionary for translating FOS to FId
def get_fid_dict():
print('querying DB')
q = ('select fid, fos'
' from fid2fos_83k');
tuples = engine.execute(q).fetchall()
fos2fid = dict()
for item in tuples:
fid = item[0]
fos = item[1]
fos2fid[fos] = fid
return fos2fid
# Build the dictionary for checking FId given Paper ID
def generate_mid2fid():
mid2fid = dict()
with open(mid2fos, 'r') as f:
for i, line in tqdm(enumerate(f), total=PAPER_CNT):
item = line.strip('\n').split('\t')
mid = item[0]
fid = item[2]
mid2fid[mid] = fid
return mid2fid
# Counting the citation numbers by accumulating the edges
def rank(mid2fid, fos2fid, fname, keyword):
R = defaultdict(int)
print("Read Edges...")
with open(fname, 'r') as f:
print('Read {} edges.'.format(EDGE_CNT))
print('Iterate through edges...')
for i, line in tqdm(enumerate(f), total=EDGE_CNT):
_, dst = line.strip('\n').split('\t')
if mid2fid[dst] == fos2fid[keyword.replace('_', ' ')]:
R[dst] += 1
return R
# Order the rank and output the top_k results
def top_ranks(R, top_k=10):
key_list = nlargest(top_k, R, key = R.get)
key_rank = list()
for key in key_list:
key_rank.append((key, R[key]))
return key_rank
# Write result into target file
def write(fname, key_rank):
with open(fname, 'w') as f:
for key, rank in key_rank:
line = '{}\n'.format('\t'.join([key, str(rank)]))
f.write(line)
# Can taking in multiple keywords, seperated by ','
if __name__ == "__main__":
keywords = sys.argv[1]
keywords = keywords.split(',')
mid2fid = generate_mid2fid()
fos2fid = get_fid_dict()
for keyword in keywords:
o_fname = keyword+'_83k_cont.txt'
R = rank(mid2fid, fos2fid, mag_db, keyword)
top_rank = top_ranks(R)
write(o_fname, top_rank)
GitHub Events
Total
Last Year
Dependencies
- gensim *
- sqlalchemy *
- tqdm *