biorxiv_speaker_finder

Extract a list of authors who have published preprints containing a user-defined keyword

https://github.com/emilyasterjones/biorxiv_speaker_finder

Last synced: 10 months ago · JSON representation ·

Repository

Extract a list of authors who have published preprints containing a user-defined keyword

Basic Info

Host: GitHub
Owner: emilyasterjones
License: gpl-3.0
Language: Jupyter Notebook
Default Branch: master
Size: 41 KB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created almost 6 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

BioRxiv Speaker Finder

This iPython notebook extracts first and last authors who have published bioRxiv preprints or Pubmed manuscripts relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.

March 2023 update: Rxvist, the API required for bioRxiv search portion of this tool, has shut down. The remaining bioRxiv API doesn't have keyword search functions. The Pubmed search portion still works.

Inputs: keyword (in title or abstract) & whether you are looking for trainees (first authors) or PIs (last authors)

Outputs: Printed data frame & CSV with a list of authors who have published manuscripts containing all keywords, with the following attributes: * institution * email * ORCID link * # manuscripts with provided keywords * source (bioRxiv or Pubmed) * # total preprints * # total downloads * # total works (from ORCID).

Lists are sorted in order of # of keyword manuscripts so the most relevant authors will be at the top.

Optional: code cells at the end use APIs to predict the gender & ethnicity of the authors. Predicted female & minority authors are printed and the predictions are appended as columns to the CSV.
If you use these cells, please read the important caveats at the top of the notebook and treat these predictions with caution.

If you use this tool, please cite the original paper which created the Rxivist API:
Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.

Citation Overrepresentation Tool

This second iPython notebook identifies overrepresented authors, journals, & institutions from a citation list.

Inputs: .bib file extracted from a paper

Outputs: prints sorted list most-cited last authors, journals, & institutions (as extracted from ORCID)

Launch both notebooks

Owner

Name: Emily A. Aery Jones
Login: emilyasterjones
Kind: user
Company: Stanford University

Website: emilyjon.es
Repositories: 3
Profile: https://github.com/emilyasterjones

Postdoc studying spatial memory 🐭🧠 at Stanford Incoming Assistant Professor at University of Maryland, Baltimore starting Jan 2026

Citation (citation_overrepresentation_tool.ipynb)

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Citation Overrepresentation Tool\n",
    "This tool extracts last authors & journals of papers you have cited and creates frequency tables so you can see who you cite the most often. It then attempts to map authors to instutions via the ORCID database. You can use it to find which journals, labs, and universities receive most of your attention.\n",
    "\n",
    "1. Extract your citations as a .bib file. Either extract all refs from a single folder in your citation manager or extract them straight from a Word doc if you use Mendeley or Zotero using [this tool](https://rintze.zelle.me/ref-extractor/).\n",
    "2. Upload your .bib file to the binder.\n",
    "3. Run each cell (Shift+Enter or Play button).\n",
    "4. Optional: extract institutions from Google Scholar\n",
    "5. Optional: get an ORCID API key (instructions below) to extract institutions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#imports\n",
    "!pip install pybtex\n",
    "!pip install scholarly\n",
    "\n",
    "from pybtex.database.input import bibtex\n",
    "import glob\n",
    "import pandas as pd\n",
    "import requests\n",
    "import re\n",
    "from scholarly import scholarly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#read .bib file\n",
    "ID = glob.glob('*bib')\n",
    "parser = bibtex.Parser()\n",
    "try:\n",
    "    bib_data = parser.parse_file(ID[0])\n",
    "except:\n",
    "    raise ValueError(\"Your .bib file has non-UTF8 characters it in (like smart quotes). Please remove them & try again.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#extract author & journal names from each citation\n",
    "authors = list()\n",
    "for key in bib_data.entries:\n",
    "    author = bib_data.entries[key].persons['author']\n",
    "    first_name = author[-1].rich_first_names\n",
    "    last_name = author[-1].rich_last_names\n",
    "    first_name = str(first_name)[7:-3]\n",
    "    last_name = str(last_name)[7:-3]\n",
    "\n",
    "    try:\n",
    "        journal = bib_data.entries[key].fields['journal']\n",
    "    except:\n",
    "        journal = 'Book'\n",
    "    authors.append([first_name, last_name, journal])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#build data frame & print\n",
    "auth_df = pd.DataFrame(authors, columns=['First Name','Last Name', 'Journal'])\n",
    "print('Overcited Authors')\n",
    "names_grouped = auth_df.groupby(['First Name','Last Name']).size()\n",
    "print(names_grouped.sort_values(ascending=False).head(10))\n",
    "print('\\nOvercited Journals')\n",
    "print(auth_df.groupby(['Journal']).size().sort_values(ascending=False).head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional: Get institutions from Google Scholar\n",
    "Using the [Scholarly package](https://pypi.org/project/scholarly/), query Google Scholar for institutions. Limitations: assumes first hit is correct, may need to use method get_proxy if too many requests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#match each author to an institutions\n",
    "#NB this section calls Google Scholar once for each author, so it takes a while\n",
    "inst_list = list()\n",
    "for name in names_grouped.axes[0]:\n",
    "    search_query = scholarly.search_author(name[0]+' '+name[1])\n",
    "    try:\n",
    "        author = next(search_query).fill()\n",
    "        affil = re.sub('Professor .+, ','',author.affiliation)\n",
    "        inst_list.append(affil)\n",
    "    except:\n",
    "        inst_list.append('Undetermined')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#add to author counts series and print\n",
    "inst_series = pd.Series(names_grouped.values, index = inst_list)\n",
    "print('Overcited Institutions')\n",
    "print(inst_series.sort_values(ascending=False).head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional: Get institutions from ORCID\n",
    "Using the ORCID API, query ORCID for institutions. Limitations: assumes first hit is correct, many authors have empty ORCIDs.\n",
    "\n",
    "Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)\n",
    "\n",
    "You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#input your client ID & key\n",
    "ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'\n",
    "ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# build a request for a token\n",
    "payload = {'client_id': ORCIDAPI_ID,\n",
    "                   'client_secret': ORCIDAPI_key,\n",
    "                   'scope': '/read-public',\n",
    "                   'grant_type': 'client_credentials'\n",
    "                   }\n",
    "url = 'https://orcid.org/oauth/token'\n",
    "headers = {'Accept': 'application/json'}\n",
    "response = requests.post(url, data=payload, headers=headers, timeout=None)\n",
    "response.raise_for_status()\n",
    "token = response.json()['access_token']\n",
    "\n",
    "# set up headers for searches\n",
    "headers = {'Accept': 'application/vnd.orcid+json',\n",
    "           'Authorization type': 'Bearer',\n",
    "           'Access token': token}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#find ORCIDs\n",
    "#NB this section calls the API once for each author, so it takes a while\n",
    "orcid_list = list()\n",
    "for name in names_grouped.axes[0]:\n",
    "    given = re.sub(' ','%20',name[0])\n",
    "    family = re.sub(' ','%20',name[1])\n",
    "\n",
    "    #build search\n",
    "    url = \"https://pub.orcid.org/v3.0/search/?q=\" \\\n",
    "        + \"family-name:\" + family + \"+AND+given-names:\" + given \\\n",
    "        + \"&rows=1\"\n",
    "    auth_id = requests.get(url, headers=headers, timeout=None)\n",
    "\n",
    "    #get first returned ORCID\n",
    "    if auth_id.json()['result'] is not None:\n",
    "        orcid_list.append(auth_id.json()['result'][0]['orcid-identifier']['path'])\n",
    "    else:\n",
    "        orcid_list.append('')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get institution from ORCID\n",
    "#NB this section calls the API once for each author, so it takes a while\n",
    "inst_list = list()\n",
    "for orcid in orcid_list:\n",
    "    if len(orcid)>0:\n",
    "        url = \"https://pub.orcid.org/v2.1/\" + orcid + \"/record\"\n",
    "        orcid_request = requests.get(url, headers=headers, timeout=None)\n",
    "        affil = orcid_request.json()['activities-summary']['employments']['employment-summary']\n",
    "        if len(affil)>0:\n",
    "#         print(json.dumps(var, indent=2, separators=(',', ':')))\n",
    "            inst_list.append(affil[0]['organization']['name'])\n",
    "        else:\n",
    "            inst_list.append('Undetermined')\n",
    "    else:\n",
    "        inst_list.append('Undetermined')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#add to author counts series and print\n",
    "inst_series = pd.Series(names_grouped.values, index = inst_list)\n",
    "print('Overcited Institutions')\n",
    "print(inst_series.sort_values(ascending=False).head(10))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

biorxiv_speaker_finder

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

BioRxiv Speaker Finder

Citation Overrepresentation Tool

Launch both notebooks

Owner

Citation (citation_overrepresentation_tool.ipynb)

GitHub Events

Total

Last Year