citations-bargraph-gscholar
A simple Python script to scrap Google Scholar profiles and generate a bargraph with citation count per year.
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
A simple Python script to scrap Google Scholar profiles and generate a bargraph with citation count per year.
Basic Info
- Host: GitHub
- Owner: ricardojrdez
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 182 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
citations-bargraph-GScholar
citations-bargraph-GScholar is a Python3 script to scrap a Google Scholar web profile, get citations per year, and generate a bargraph (similar to the one you can get in a Google Scholar profile when clicking on the image at the right-side panel). It only needs the Google Scholar ID of the profile that you want to retrieve!
This script relies on BeautifulSoup to scrap the HMTL and matplotlib to generate the bargraph. In addition, the bargraph also shows other statistical data, such as the citation average of last 5 years, the total number of citations, h-index, and i10-index. It also shows the URL scrapped and the date when data was fetched on.
Requirements
- Python 3.5 (tested on Python 3.8.4rc1 on top of Debian GNU/Linux bullseye/sid 8.11)
- BeautifulSoup
Usage
``` usage: citations-bargraph-GScholar.py [-h] [-f FILENAME] [-s startyear] [-e endyear] [-u GSCHOLARURL] GSCHOLARID Gets the citation bargraph of a Google Scholar profile
Options: -h, --help List all available options. -f, --filename=STARTYEARENDYEARSCHOLARNAME.pdf Output filename of the bargraph (default value is composed by STARTYEAR, ENDYEAR, and abbreviated SCHOLARNAME). -s, --year-start=2017 Init year of the bargraph -e, --year-end=2021 End year of the bargraph -u, --gscholar-url=https://scholar.google.es/citations?user= Google Scholar URL. If it changes, you can use it this parameter ```
Usage example
$ python3 citations-bargraph-GScholar.py HlQC1OcAAAAJ
It will generate a PDF file, containing the bargraph with citations per year of the Google Scholar profile indicated as parameter. For instance, the previous command will generate a PDF as this one:
License
Licensed under the GNU GPLv3 license.
Owner
- Name: Ricardo J. Rodríguez
- Login: ricardojrdez
- Kind: user
- Location: Zaragoza, Spain
- Company: University of Zaragoza
- Website: http://www.ricardojrodriguez.es
- Twitter: RicardoJRdez
- Repositories: 1
- Profile: https://github.com/ricardojrdez
Associate Professor @universidad-zaragoza (Spain). I research at @reverseame: program binary analysis, digital forensics, RFID/NFC security
Citation (citations-bargraph-GScholar.py)
#!/usr/bin/python3
# -*- coding: utf-8 -*-
'''Gets the citation bargraph of a Google Scholar profile
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
'''
import os
import re
import requests
from bs4 import BeautifulSoup
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import statistics
import logging
import configparser
import getopt
import sys
from datetime import datetime
from matplotlib import rc
rc('font',**{'family':'serif','serif':['Times']})
matplotlib.rcParams['text.usetex'] = True
__author__ = "Ricardo J. Rodríguez"
__copyright__ = "Copyright 2021, University of Zaragoza, Spain"
__credits__ = ["Ricardo J. Rodríguez"]
__license__ = "GPL"
__version__ = "1.0"
__maintainer__ = "Ricardo J. Rodríguez"
__email__ = "rjrodriguez@unizar.es"
__status__ = "Production"
default_log_level = logging.INFO
logger = logging.getLogger('main')
script_name = os.path.basename(__file__)
GSCHOLAR_ID=''
GSCHOLAR_URL="https://scholar.google.es/citations?user="
help= f'''usage: {script_name} [-h] [-f FILENAME] [-s start_year] [-e end_year] [-u GSCHOLAR_URL] GSCHOLAR_ID
Gets the citation bargraph of a Google Scholar profile
Options:
-h, --help
List all available options.
-f, --filename=STARTYEAR_ENDYEAR_SCHOLARNAME.pdf
Output filename of the bargraph (default value is composed by STARTYEAR, ENDYEAR, and abbreviated SCHOLARNAME).
-s, --year-start=2017
Init year of the bargraph
-e, --year-end=2021
End year of the bargraph
-u, --gscholar-url={GSCHOLAR_URL}
Google Scholar URL. If it changes, you can use it this parameter
'''
def usage_error(exit_code=0):
print(help, file=sys.stderr)
sys.exit(exit_code)
def main(argv):
logging.basicConfig(level=default_log_level)
filename = ''
year_start = -1
year_end = -1
gscholar_url = GSCHOLAR_URL
gscholar_id = GSCHOLAR_ID
try:
opts, _args = getopt.getopt(argv, "e:f:hs:u:",["year-end=", "filename=", "help=", "year-start=", "gscholar-url="])
except getopt.GetoptError as e:
logger.exception(e)
usage_error(2)
for opt, arg in opts:
if opt in ('-h', "--help"):
usage_error()
elif opt in ("-s", "--year-start"):
year_start = arg
elif opt in ("-e", "--year-end"):
year_end = arg
elif opt in ("-f", "--filename"):
filename = arg
elif opt in ("-u", "--gscholar-url="):
gscholar_url = arg
try:
gscholar_id = _args[0]
except:
logger.error('Required arguments missed')
usage_error(2)
soup, cit_years, years = get_citations_years(gscholar_url, gscholar_id)
researcher_name = get_name(soup)
researcher_affiliation = get_affiliation(soup)
if year_start == -1:
year_start = years[0]
if year_end == -1:
year_end= years[-1]
if filename == '':
filename = '{}_{}_{}.pdf'.format(year_start, year_end,
''.join(s[0] for s in researcher_name.split(' ')[:-1]) + researcher_name.split(' ')[-1])
idx_start = years.index(year_start)
idx_end = years.index(year_end) + 1
years = years[idx_start:idx_end]
cit_summary = get_citation_summary(soup)
generate_citation_bargraph(filename, years=years, cit_years=cit_years, name=researcher_name, university=researcher_affiliation, summary=cit_summary, gscholar_id=gscholar_id)
def get_citations_years(gscholar_url=GSCHOLAR_URL, gscholar_id=GSCHOLAR_ID):
# build URL appropriately
_req = requests.get("{}{}".format(gscholar_url, gscholar_id))
soup = BeautifulSoup(_req.text, features="lxml") # make request
_div = str(soup.find(attrs={'class': 'gsc_md_hist_b'}))
_year_regex = 'px">\d+</span>'
_citation_year_regex = 'class="gsc_g_al">\d+<'
_cit_year = get_values_from(_citation_year_regex, _div)
_years = get_values_from(_year_regex, _div)
return soup, _cit_year, _years
def get_citation_summary(soup: BeautifulSoup) -> str:
_itms = soup.findAll(attrs={'class': 'gsc_rsb_std'})
_indexes = []
for _itm in _itms:
_soup = BeautifulSoup(str(_itm), features="lxml")
_indexes.append(int(_soup.td.text))
return _indexes
def get_name(soup: BeautifulSoup) -> str:
_div = str(soup.find(attrs={'id': 'gsc_prf_in'}))
_soup = BeautifulSoup(_div, features="lxml")
return _soup.div.text
def get_affiliation(soup: BeautifulSoup) -> str:
_div = str(soup.find(attrs={'class': 'gsc_prf_il'}))
_soup = BeautifulSoup(_div, features="lxml")
return _soup.a.text
def get_values_from(_regex_str, html=''):
_regex = re.compile(_regex_str)
_list = []
for _itm in re.findall(_regex, html):
[_val] = re.findall("\d+", _itm) # get the single element from list
_list.append(int(_val))
return _list
def generate_citation_bargraph(filename, years: list, cit_years: list, name: str, university: str, summary: str, gscholar_id: str):
x = np.arange(len(years)) # the label locations
width = 0.5 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x, cit_years, width, label='Citations')
_mean_str = ''
if len(cit_years) >= 6:
_mean_last5yrs = statistics.mean(cit_years[-6:-1])
_mean_str = ' (average of last 5 years, {}/{}: {:.2f})'.format(years[-6],years[-2], _mean_last5yrs)
summary_str = '[all] Citations: {}; h-index: {}; i10-index: {}'.format(summary[0], summary[2], summary[4])
ax.set_ylabel('Citations')
ax.set_xlabel('Years\n' +
'{\\small (data source: {\\tt ' + GSCHOLAR_URL + gscholar_id + '}; fetched on ' + datetime.now().strftime("%d %b, %Y") + ')}')
ax.set_title('{\\bf Citations per year' + _mean_str +
'}\nScholar: {\\em ' + name + '} ' + '(' + university +
')\n' + summary_str
)
ax.set_xticks(x)
ax.set_xticklabels(years)
autolabel(ax, rects1)
fig.tight_layout()
#plt.show()
plt.savefig(os.path.join(os.getcwd(), filename), bbox_inches='tight')
plt.clf()
def autolabel(ax, rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom', size=9)
if __name__ == "__main__":
main(sys.argv[1:])
GitHub Events
Total
Last Year
Dependencies
- beautifulsoup4 ==4.9.3
- lxml ==4.6.2
- matplotlib ==3.3.3
- numpy ==1.19.4
- requests ==2.32.2