citationranker
This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.
Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.
Basic Info
- Host: GitHub
- Owner: Yaoming95
- License: mit
- Language: Python
- Default Branch: main
- Size: 8.79 KB
Statistics
- Stars: 12
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
readme.md
Conference Citation Ranker
This code helps to retrieve all papers from conferences and rank them by the number of (Google Scholar) citations.
It will save the retrieved paper information to a .csv file containing the Title, Number of Citations, Web Link, Conference, and Year. The .csv file is sorted according to the Citation number.
This tool can help find the most cited papers in a conference and discover hot research topics.
Requirements
Python3
Installation
Download Git Repo
git clone https://github.com/Yaoming95/CitationRanker.gitInstall requirements
pip install -r requirements.txtDownload Chrome Diver in order to enter Captcha when Google robot checking is enabled. After downloading chromedriver, rename it to chromedriver and put it into current folder.
Run the command (e.g.,
python citationRanker.py -c <confence abbr> -y <year>).
Usage
General Usage
bash
python citationRanker.py -c <conference abbr> -y <year_start> \
-e <year_end, optional> -o <output_path, optional> -kw <keywords, optional> \
--driver <path for Chrome Driver, optional>
The code support multiple conferences and keyword, which shall be separated by comma
The conference abbr. is case insensitive, but shall be consist with dblp.
For example, for Conference on Neural Information Processing Systems,
nips is for papers before year 2017, and neurips is for ones after year 2018.
To get help
bash
python citationRanker.py -h
Simple Tutorial and Examples
1.Retrieve the publication of a single conference in a certain year
```bash
python citationRanker.py -c
e.g. If you want to retrieve the publications of SIGIR’18
python citationRanker.py -c sigir -y 2018 ```
2.Retrieve the publication of multiple conferences in a certain year
```bash
python citationRanker.py -c
e.g. If you want to retrieve the publications of SIGIR’18 and KDD'18
python citationRanker.py -c sigir,kdd -y 2018 ```
3.Retrieve the publication of a conferences in several years span.
```bash
python citationRanker.py -c
e.g. If you want to retrieve the publications of SIGIR from 2018 to 2020
python citationRanker.py -c sigir -y 2018 -e 2020
e.g. If you want to retrieve the publications of NIPS from 2017 to 2020
python citationRanker.py -c nips,neurips -y 2017 -e 2020 ```
4.Retrieve the publications with keywords.
```bash
python citationRanker.py -c
e.g. If you want to retrieve the publications of SIGIR'18 about search
python citationRanker.py -c sigir -y 2018 -kw search
e.g. If you want to retrieve the publications of EMNLP&ACL&NAACL about machine translation in 2019
python citationRanker.py -c emnlp,acl,naacl -y 2019 -kw machine,translation ```
5.Specify the output file
```bash
python citationRanker.py -c
e.g. If you want to retrieve the publications about search of SIGIR’18 and save it to search.csv
python citationRanker.py -c sigir -y 2018 -kw search -o search.csv ```
About Captcha
Sometimes you may encounter the following sentence in the terminal:
Solve captcha manually and press enter here to continue...
No worries, this is caused by Google's robot detection. Please complete the Captcha in the pops-up Chrome window, and then press Enter in the terminal. Please do not close the pop-up window when you finish the Captcha.
Acknowledgment
WittmannF's repo helped my development.
Owner
- Name: Yaoming
- Login: Yaoming95
- Kind: user
- Location: Shanghai, China
- Company: ByteDance AI Lab
- Repositories: 3
- Profile: https://github.com/Yaoming95
Citation (citationRanker.py)
import json
import time
import nltk
import pandas as pd
import requests
from bs4 import BeautifulSoup
import argparse
from tqdm import tqdm
import datetime
SLEEP_TIME = 1.5
# Websession Parameters
GSCHOLAR_URL = 'https://scholar.google.com/scholar?start={}&q={}&hl=en&as_sdt=0,5'
ROBOT_KW = ['unusual traffic from your computer network', 'not a robot']
def get_citations(content):
out = 0
for char in range(0, len(content)):
if content[char:char + 9] == 'Cited by ':
init = char + 9
for end in range(init + 1, init + 6):
if content[end] == '<':
break
out = content[init:end]
return int(out)
def get_year(content):
for char in range(0, len(content)):
if content[char] == '-':
out = content[char - 5:char - 1]
if not out.isdigit():
out = 0
return int(out)
def setup_driver(driver_path=None):
try:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import StaleElementReferenceException
except Exception as e:
print(e)
print("Please install Selenium and chrome webdriver for manual checking of captchas")
print('Loading...')
chrome_options = Options()
chrome_options.add_argument("disable-infobars")
if driver_path is None:
driver_path = "./chromedriver"
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=chrome_options)
return driver
def get_author(content):
for char in range(0, len(content)):
if content[char] == '-':
out = content[2:char - 1]
break
return out
def get_element(driver, xpath, attempts=5, _count=0):
'''Safe get_element method with multiple attempts'''
try:
element = driver.find_element_by_xpath(xpath)
return element
except Exception as e:
if _count < attempts:
time.sleep(1)
get_element(driver, xpath, attempts=attempts, _count=_count + 1)
else:
print("Element not found")
def get_content_with_selenium(url, drive_path=None):
if 'driver' not in globals():
global driver
driver = setup_driver(drive_path)
driver.get(url)
# Get element from page
el = get_element(driver, "/html/body")
c = el.get_attribute('innerHTML')
if any(kw in el.text for kw in ROBOT_KW):
input("Solve captcha manually and press enter here to continue...")
el = get_element(driver, "/html/body")
c = el.get_attribute('innerHTML')
return c.encode('utf-8')
class PaperTitle():
def __init__(self, conf_name, year_start=2010, year_end=None,
output_file=None, keyword=None, driver_path=None):
self._conf_name = conf_name.lower().split(",")
self._conf_name = list(map(lambda x: x.strip(), self._conf_name))
if year_end is None:
year_end = datetime.datetime.now().year
self._years = range(year_start, year_end + 1)
self._output_file = output_file
self._driver_path = driver_path
self._session = requests.Session()
if keyword is not None:
self._keyword = keyword.lower().split(",")
self._keyword = list(map(lambda x: x.strip(), self._keyword))
self.paper_info_pd = None
def save_paper_info(self, paper_info_list):
paper_info_pd = pd.DataFrame(paper_info_list, columns=["title", "citation", "link", "conf", "year"])
paper_info_pd["citation"] = paper_info_pd["citation"].astype(int)
paper_info_pd["year"] = paper_info_pd["year"].astype(int)
paper_info_pd = paper_info_pd.sort_values(by=["citation", "year"], ascending=False)
if self.paper_info_pd is None:
self.paper_info_pd = paper_info_pd
else:
self.paper_info_pd = pd.concat([self.paper_info_pd, paper_info_pd])
def get_paper_info(self, paper_title):
url = GSCHOLAR_URL.format(str(0), paper_title.replace(' ', '+'))
while True:
try:
page = self._session.get(url)
break
except requests.exceptions.ConnectionError:
pass
c = page.content
if any(kw in c.decode('ISO-8859-1') for kw in ROBOT_KW):
# print("Robot checking detected, handling with selenium (if installed)")
try:
c = get_content_with_selenium(url, drive_path=self._driver_path)
except Exception as e:
print("No success. The following error was raised:")
print(e)
# Create parser
soup = BeautifulSoup(c, 'html.parser')
# Get stuff
mydivs = soup.findAll("div", {"class": "gs_r"})
try:
div = mydivs[0]
link = div.find('h3').find('a').get('href')
title = div.find('h3').find('a').text
citation = get_citations(str(div.format_string))
except Exception as e:
return None
if nltk.edit_distance(paper_title.lower(), title.lower()) > 10:
return None
return title, citation, link
def get_paper_title_list(self, conf, year):
paper_title_list = []
first_hix_index = 0
while True:
# The first hit index. DBLP retrieve at most 1000 items once.
url = "https://dblp.org/search/publ/api?q=conf%2F" + conf + "%20" + \
str(year) + "&h=1000&f=" + str(first_hix_index) + "&format=json"
r = requests.get(url)
c = r.content
num = json.loads(c)['result']['hits']["@sent"]
if not int(num) > 0:
break
first_hix_index += 1000
content = json.loads(c)['result']['hits']['hit']
for info in content:
try:
paper_info = info["info"]
if type(paper_info["venue"]) is list:
paper_info["venue"] = " ".join(paper_info["venue"])
if conf.lower() in paper_info["venue"].lower() and paper_info["year"] == str(year):
if self._keyword is not None:
keyword_flag = False
for keyword in self._keyword:
if not keyword in paper_info["title"].lower():
keyword_flag = True
break
if keyword_flag:
continue
paper_title_list.append(paper_info["title"])
except KeyError:
continue
return paper_title_list
def get_paper_list_by_conf_year(self, conf, year):
paper_title_list = self.get_paper_title_list(conf, year)
paper_info_list = []
# pd.DataFrame(paper_info_list, columns=["title", "citation", "link"])
for paper_title in tqdm(paper_title_list, desc=conf+"_"+str(year), ):
paper_info = self.get_paper_info(paper_title)
if paper_info is not None:
paper_info = list(paper_info) + [conf, year]
paper_info_list.append(paper_info)
time.sleep(SLEEP_TIME)
self.save_paper_info(paper_info_list)
return paper_info_list
def get_paper_list(self):
for conf in self._conf_name:
for year in self._years:
self.get_paper_list_by_conf_year(conf, year)
if self._output_file is None:
self._output_file = "_".join(self._conf_name) + "__" + str(self._years[0]) + "to" + str(
self._years[-1]) + ".csv"
self.paper_info_pd = self.paper_info_pd.sort_values(by=["citation", "year"], ascending=False)
self.paper_info_pd.to_csv(self._output_file, index=False)
print("save the output to %s" % self._output_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-c", "--conferences", default="",
type=str, help="The abbr. of conferences, separated by comma(,)")
# "./chromedriver"
parser.add_argument("-y", "--year_start", default=2020,
type=int, help="The start year")
parser.add_argument("-e", "--year_end", default=None,
type=int, help="The end year")
parser.add_argument("-o", "--output_file", default=None,
required=False,
type=str, help="The output file name. ")
parser.add_argument("-kw", "--keyword", default=None,
required=False,
type=str, help="Search paper titles by keywords, separated by comma(,)")
parser.add_argument("--driver", default=None,
required=False,
type=str, help="The path for chromedriver, default is under current folder './chromedriver' ")
args = parser.parse_args()
pt = PaperTitle(conf_name=args.conferences, year_start=args.year_start, year_end=args.year_end,
output_file=args.output_file, keyword=args.keyword, driver_path=args.driver)
pt.get_paper_list()
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- beautifulsoup4 *
- nltk *
- pandas *
- requests *
- selenium *
- tqdm *