academicrank

2021SP FORWARD Lab Project by Haozhe Si

https://github.com/ehzoahis/academicrank

Science Score: 28.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

2021SP FORWARD Lab Project by Haozhe Si

Basic Info
  • Host: GitHub
  • Owner: Ehzoahis
  • Language: Python
  • Default Branch: main
  • Size: 30.9 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 5 years ago · Last pushed about 5 years ago
Metadata Files
Readme Citation

README.md

AcademicRank

2021SP FORWARD Lab Project

Introduction

The goal of the project is to calculate the rank of academic works given a keyword. The rank will be calculated according to the Field of Study of the paper. The ranking algorithm is inspired by The PageRank Citation Ranking: Bringing Order to the Web, with the assumption that similarity between the papers and the target keywords can only be distributed once. Currently, the program can only handle the keyword with multiple words to ensure the accuracy of ranking.

Installation

Install the package using requirements.txt

shell pip3 install -r requirements.txt

Datasets

Microsoft Academic Graph

The Mircosoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications and fields of study. The schema of the dataset can be found here. Among those dataset files, we would use: - FieldsOfStudy - PaperFieldsOfStudy - PaperReferences

The downloaded data can be found on owl3 server, path.

Springer-83K CS Keywords

The CS keywords collected from Springer by Yanghui Pang. Dataset can be found here.

word2vec Model

The word2vec model is trained on the abstract of papers in arXive dataset by Edward Ma. The model can be found here.

Usage

Build the Pruned MAG Dataset

To speed up the ranking algorithm, we need to first prune out the Field of Study (FoS) that are not CS keywords. python python3 prune_fos.py The resulting FoS list will be in pruned_FOS.txt.

We further need to prune out the papers and references that do not relate to CS. python python3 prune_paper_edge.py The resulting file are cspapers.txt and pruned_PR.txt.

If any issue exists when running prune_fos.py or prune_paper_edge.py please check the original codes which are more stable.

Perform AcademicRank

The preparation work only need to be done once. To calculate the rank of papers given keywords, do python python3 academic_rank.py [keyword1,keyword2,...] where keywords need to be separated by ',' and keywords with multiple words need to be connected by '_'. E.g.

python python3 academic_rank.py computer_science,data_mining

Visualization

Since the academic_rank.py will give a list of paper ID, we can find the name of the papers given the ID using MAG API. See methods and examples from visualization.ipynb for more information.

Reservation

The accuracy of this program is not guaranteed because the vocabulary of the word2vec model is not large enough and thus the keyword similarity cannot be calculated in the most times. Currently, the program is assigning dummy similarity to the keywords that are not in word2vec model.

Author

Owner

  • Name: Haozhe Si
  • Login: Ehzoahis
  • Kind: user
  • Location: Champaign, IL
  • Company: University of Illinois Urbana-Champaign

Citation (citation_cnt.py)

# Calculate the rank naively base on citation numbers of a given paper.
# Used to compare with the AcademicRank Result

from heapq import nlargest
import sys
from tqdm import tqdm
from collections import defaultdict
from sqlalchemy import create_engine

EDGE_CNT = 1094935127 # tot edges
PAPER_CNT = 355977380 # papers about CS

db_url = 'mysql+pymysql://haozhes3:hank20si@owl2.cs.illinois.edu/haozhes3_refs?charset=utf8'
engine = create_engine(db_url)

mag_db = './pruned_PR_83k.txt'
mid2fos = './cspaper.txt'

# Build the dictionary for translating FOS to FId
def get_fid_dict():
    print('querying DB')
    q = ('select fid, fos'
        ' from fid2fos_83k');
    
    tuples = engine.execute(q).fetchall()
    fos2fid = dict()
    for item in tuples:
        fid = item[0]
        fos = item[1]
        fos2fid[fos] = fid
    return fos2fid

# Build the dictionary for checking FId given Paper ID
def generate_mid2fid():
    mid2fid = dict()
    with open(mid2fos, 'r') as f:
        for i, line in tqdm(enumerate(f), total=PAPER_CNT):
            item = line.strip('\n').split('\t')
            mid = item[0]
            fid = item[2]
            mid2fid[mid] = fid
    return mid2fid

# Counting the citation numbers by accumulating the edges
def rank(mid2fid, fos2fid, fname, keyword):
    R = defaultdict(int)

    print("Read Edges...")
    with open(fname, 'r') as f:
        print('Read {} edges.'.format(EDGE_CNT))

        print('Iterate through edges...')

        for i, line in tqdm(enumerate(f), total=EDGE_CNT):
            _, dst = line.strip('\n').split('\t')

            if mid2fid[dst] == fos2fid[keyword.replace('_', ' ')]:
                R[dst] += 1
    return R

# Order the rank and output the top_k results
def top_ranks(R, top_k=10):
    key_list = nlargest(top_k, R, key = R.get)
    key_rank = list()
    for key in key_list:
        key_rank.append((key, R[key]))
    return key_rank

# Write result into target file
def write(fname, key_rank):
    with open(fname, 'w') as f:
        for key, rank in key_rank:
            line = '{}\n'.format('\t'.join([key, str(rank)]))
            f.write(line)

# Can taking in multiple keywords, seperated by ','
if __name__ == "__main__":
    keywords = sys.argv[1]
    keywords = keywords.split(',')

    mid2fid = generate_mid2fid()
    fos2fid = get_fid_dict()
    for keyword in keywords:
        o_fname = keyword+'_83k_cont.txt'

        R = rank(mid2fid, fos2fid, mag_db, keyword)
        top_rank = top_ranks(R)
        write(o_fname, top_rank)

    

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • gensim *
  • sqlalchemy *
  • tqdm *