citationranker

This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.

https://github.com/yaoming95/citationranker

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.

Basic Info
  • Host: GitHub
  • Owner: Yaoming95
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 8.79 KB
Statistics
  • Stars: 12
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed over 4 years ago
Metadata Files
Readme License Citation

readme.md

Conference Citation Ranker

This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.

It will save the retrieved paper information to a .csv file containing the Title, Number of Citations, Web Link, Conference, and Year. The .csv file is sorted according to the Citation number.

This tool can help find the most cited papers in a conference and discover hot research topics.

Requirements

Python3

Chrome Diver

Installation

  1. Download Git Repo git clone https://github.com/Yaoming95/CitationRanker.git

  2. Install requirements pip install -r requirements.txt

  3. Download Chrome Diver in order to enter Captcha when Google robot checking is enabled. After downloading chromedriver, rename it to chromedriver and put it into current folder.

  4. Run the command (e.g., python citationRanker.py -c <confence abbr> -y <year>).

Usage

General Usage bash python citationRanker.py -c <conference abbr> -y <year_start> \ -e <year_end, optional> -o <output_path, optional> -kw <keywords, optional> \ --driver <path for Chrome Driver, optional>

The code support multiple conferences and keyword, which shall be separated by comma

The conference abbr. is case insensitive, but shall be consist with dblp. For example, for Conference on Neural Information Processing Systems, nips is for papers before year 2017, and neurips is for ones after year 2018.

To get help bash python citationRanker.py -h

Simple Tutorial and Examples

1.Retrieve the publication of a single conference in a certain year ```bash python citationRanker.py -c -y

e.g. If you want to retrieve the publications of SIGIR’18

python citationRanker.py -c sigir -y 2018 ```

2.Retrieve the publication of multiple conferences in a certain year

```bash python citationRanker.py -c , -y

e.g. If you want to retrieve the publications of SIGIR’18 and KDD'18

python citationRanker.py -c sigir,kdd -y 2018 ```

3.Retrieve the publication of a conferences in several years span.

```bash python citationRanker.py -c -y -e

e.g. If you want to retrieve the publications of SIGIR from 2018 to 2020

python citationRanker.py -c sigir -y 2018 -e 2020

e.g. If you want to retrieve the publications of NIPS from 2017 to 2020

python citationRanker.py -c nips,neurips -y 2017 -e 2020 ```

4.Retrieve the publications with keywords.

```bash python citationRanker.py -c -y -kw

e.g. If you want to retrieve the publications of SIGIR'18 about search

python citationRanker.py -c sigir -y 2018 -kw search

e.g. If you want to retrieve the publications of EMNLP&ACL&NAACL about machine translation in 2019

python citationRanker.py -c emnlp,acl,naacl -y 2019 -kw machine,translation ```

5.Specify the output file

```bash python citationRanker.py -c -y -o

e.g. If you want to retrieve the publications about search of SIGIR’18 and save it to search.csv

python citationRanker.py -c sigir -y 2018 -kw search -o search.csv ```

About Captcha

Sometimes you may encounter the following sentence in the terminal:

Solve captcha manually and press enter here to continue...

No worries, this is caused by Google's robot detection. Please complete the Captcha in the pops-up Chrome window, and then press Enter in the terminal. Please do not close the pop-up window when you finish the Captcha.

Acknowledgment

WittmannF's repo helped my development.

Owner

  • Name: Yaoming
  • Login: Yaoming95
  • Kind: user
  • Location: Shanghai, China
  • Company: ByteDance AI Lab

Citation (citationRanker.py)

import json
import time

import nltk
import pandas as pd
import requests
from bs4 import BeautifulSoup
import argparse
from tqdm import tqdm
import datetime

SLEEP_TIME = 1.5

# Websession Parameters
GSCHOLAR_URL = 'https://scholar.google.com/scholar?start={}&q={}&hl=en&as_sdt=0,5'
ROBOT_KW = ['unusual traffic from your computer network', 'not a robot']


def get_citations(content):
    out = 0
    for char in range(0, len(content)):
        if content[char:char + 9] == 'Cited by ':
            init = char + 9
            for end in range(init + 1, init + 6):
                if content[end] == '<':
                    break
            out = content[init:end]
    return int(out)


def get_year(content):
    for char in range(0, len(content)):
        if content[char] == '-':
            out = content[char - 5:char - 1]
    if not out.isdigit():
        out = 0
    return int(out)


def setup_driver(driver_path=None):
    try:
        from selenium import webdriver
        from selenium.webdriver.chrome.options import Options
        from selenium.common.exceptions import StaleElementReferenceException
    except Exception as e:
        print(e)
        print("Please install Selenium and chrome webdriver for manual checking of captchas")

    print('Loading...')
    chrome_options = Options()
    chrome_options.add_argument("disable-infobars")
    if driver_path is None:
        driver_path = "./chromedriver"
    driver = webdriver.Chrome(executable_path=driver_path, chrome_options=chrome_options)
    return driver


def get_author(content):
    for char in range(0, len(content)):
        if content[char] == '-':
            out = content[2:char - 1]
            break
    return out


def get_element(driver, xpath, attempts=5, _count=0):
    '''Safe get_element method with multiple attempts'''
    try:
        element = driver.find_element_by_xpath(xpath)
        return element
    except Exception as e:
        if _count < attempts:
            time.sleep(1)
            get_element(driver, xpath, attempts=attempts, _count=_count + 1)
        else:
            print("Element not found")


def get_content_with_selenium(url, drive_path=None):
    if 'driver' not in globals():
        global driver
        driver = setup_driver(drive_path)
    driver.get(url)

    # Get element from page
    el = get_element(driver, "/html/body")
    c = el.get_attribute('innerHTML')

    if any(kw in el.text for kw in ROBOT_KW):
        input("Solve captcha manually and press enter here to continue...")
        el = get_element(driver, "/html/body")
        c = el.get_attribute('innerHTML')

    return c.encode('utf-8')


class PaperTitle():
    def __init__(self, conf_name, year_start=2010, year_end=None,
                 output_file=None, keyword=None, driver_path=None):
        self._conf_name = conf_name.lower().split(",")
        self._conf_name = list(map(lambda x: x.strip(), self._conf_name))
        if year_end is None:
            year_end = datetime.datetime.now().year
        self._years = range(year_start, year_end + 1)
        self._output_file = output_file
        self._driver_path = driver_path

        self._session = requests.Session()
        if keyword is not None:
            self._keyword = keyword.lower().split(",")
            self._keyword = list(map(lambda x: x.strip(), self._keyword))
        self.paper_info_pd = None

    def save_paper_info(self, paper_info_list):
        paper_info_pd = pd.DataFrame(paper_info_list, columns=["title", "citation", "link", "conf", "year"])
        paper_info_pd["citation"] = paper_info_pd["citation"].astype(int)
        paper_info_pd["year"] = paper_info_pd["year"].astype(int)
        paper_info_pd = paper_info_pd.sort_values(by=["citation", "year"], ascending=False)
        if self.paper_info_pd is None:
            self.paper_info_pd = paper_info_pd
        else:
            self.paper_info_pd = pd.concat([self.paper_info_pd, paper_info_pd])

    def get_paper_info(self, paper_title):
        url = GSCHOLAR_URL.format(str(0), paper_title.replace(' ', '+'))
        while True:
            try:
                page = self._session.get(url)
                break
            except requests.exceptions.ConnectionError:
                pass
        c = page.content
        if any(kw in c.decode('ISO-8859-1') for kw in ROBOT_KW):
            # print("Robot checking detected, handling with selenium (if installed)")
            try:
                c = get_content_with_selenium(url, drive_path=self._driver_path)
            except Exception as e:
                print("No success. The following error was raised:")
                print(e)
        # Create parser
        soup = BeautifulSoup(c, 'html.parser')
        # Get stuff
        mydivs = soup.findAll("div", {"class": "gs_r"})

        try:
            div = mydivs[0]
            link = div.find('h3').find('a').get('href')
            title = div.find('h3').find('a').text
            citation = get_citations(str(div.format_string))
        except Exception as e:
            return None
        if nltk.edit_distance(paper_title.lower(), title.lower()) > 10:
            return None
        return title, citation, link

    def get_paper_title_list(self, conf, year):
        paper_title_list = []
        first_hix_index = 0
        while True:
            # The first hit index. DBLP retrieve at most 1000 items once.

            url = "https://dblp.org/search/publ/api?q=conf%2F" + conf + "%20" + \
                  str(year) + "&h=1000&f=" + str(first_hix_index) + "&format=json"
            r = requests.get(url)
            c = r.content
            num = json.loads(c)['result']['hits']["@sent"]
            if not int(num) > 0:
                break
            first_hix_index += 1000
            content = json.loads(c)['result']['hits']['hit']
            for info in content:
                try:
                    paper_info = info["info"]
                    if type(paper_info["venue"]) is list:
                        paper_info["venue"] = " ".join(paper_info["venue"])
                    if conf.lower() in paper_info["venue"].lower() and paper_info["year"] == str(year):
                        if self._keyword is not None:
                            keyword_flag = False
                            for keyword in self._keyword:
                                if not keyword in paper_info["title"].lower():
                                    keyword_flag = True
                                    break
                            if keyword_flag:
                                continue
                        paper_title_list.append(paper_info["title"])
                except KeyError:
                    continue
        return paper_title_list

    def get_paper_list_by_conf_year(self, conf, year):
        paper_title_list = self.get_paper_title_list(conf, year)
        paper_info_list = []
        # pd.DataFrame(paper_info_list, columns=["title", "citation", "link"])
        for paper_title in tqdm(paper_title_list, desc=conf+"_"+str(year), ):
            paper_info = self.get_paper_info(paper_title)
            if paper_info is not None:
                paper_info = list(paper_info) + [conf, year]
                paper_info_list.append(paper_info)
            time.sleep(SLEEP_TIME)
        self.save_paper_info(paper_info_list)
        return paper_info_list

    def get_paper_list(self):
        for conf in self._conf_name:
            for year in self._years:
                self.get_paper_list_by_conf_year(conf, year)
        if self._output_file is None:
            self._output_file = "_".join(self._conf_name) + "__" + str(self._years[0]) + "to" + str(
                self._years[-1]) + ".csv"
        self.paper_info_pd = self.paper_info_pd.sort_values(by=["citation", "year"], ascending=False)
        self.paper_info_pd.to_csv(self._output_file, index=False)
        print("save the output to %s" % self._output_file)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--conferences", default="",
                        type=str, help="The abbr. of conferences, separated by comma(,)")
    # "./chromedriver"
    parser.add_argument("-y", "--year_start", default=2020,
                        type=int, help="The start year")
    parser.add_argument("-e", "--year_end", default=None,
                        type=int, help="The end year")
    parser.add_argument("-o", "--output_file", default=None,
                        required=False,
                        type=str, help="The output file name. ")
    parser.add_argument("-kw", "--keyword", default=None,
                        required=False,
                        type=str, help="Search paper titles by keywords, separated by comma(,)")
    parser.add_argument("--driver", default=None,
                        required=False,
                        type=str, help="The path for chromedriver, default is under current folder './chromedriver' ")
    args = parser.parse_args()
    pt = PaperTitle(conf_name=args.conferences, year_start=args.year_start, year_end=args.year_end,
                    output_file=args.output_file, keyword=args.keyword, driver_path=args.driver)
    pt.get_paper_list()

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • beautifulsoup4 *
  • nltk *
  • pandas *
  • requests *
  • selenium *
  • tqdm *