biorxiv_speaker_finder
Extract a list of authors who have published preprints containing a user-defined keyword
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Repository
Extract a list of authors who have published preprints containing a user-defined keyword
Basic Info
- Host: GitHub
- Owner: emilyasterjones
- License: gpl-3.0
- Language: Jupyter Notebook
- Default Branch: master
- Size: 41 KB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
BioRxiv Speaker Finder 
This iPython notebook extracts first and last authors who have published bioRxiv preprints or Pubmed manuscripts relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.
March 2023 update: Rxvist, the API required for bioRxiv search portion of this tool, has shut down. The remaining bioRxiv API doesn't have keyword search functions. The Pubmed search portion still works.
Inputs: keyword (in title or abstract) & whether you are looking for trainees (first authors) or PIs (last authors)
Outputs: Printed data frame & CSV with a list of authors who have published manuscripts containing all keywords, with the following attributes: * institution * email * ORCID link * # manuscripts with provided keywords * source (bioRxiv or Pubmed) * # total preprints * # total downloads * # total works (from ORCID).
Lists are sorted in order of # of keyword manuscripts so the most relevant authors will be at the top.
Optional: code cells at the end use APIs to predict the gender & ethnicity of the authors. Predicted female & minority authors are printed and the predictions are appended as columns to the CSV.
If you use these cells, please read the important caveats at the top of the notebook and treat these predictions with caution.
If you use this tool, please cite the original paper which created the Rxivist API:
Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.
Citation Overrepresentation Tool
This second iPython notebook identifies overrepresented authors, journals, & institutions from a citation list.
Inputs: .bib file extracted from a paper
Outputs: prints sorted list most-cited last authors, journals, & institutions (as extracted from ORCID)
Launch both notebooks
Owner
- Name: Emily A. Aery Jones
- Login: emilyasterjones
- Kind: user
- Company: Stanford University
- Website: emilyjon.es
- Repositories: 3
- Profile: https://github.com/emilyasterjones
Postdoc studying spatial memory 🐭🧠 at Stanford Incoming Assistant Professor at University of Maryland, Baltimore starting Jan 2026
Citation (citation_overrepresentation_tool.ipynb)
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Citation Overrepresentation Tool\n",
"This tool extracts last authors & journals of papers you have cited and creates frequency tables so you can see who you cite the most often. It then attempts to map authors to instutions via the ORCID database. You can use it to find which journals, labs, and universities receive most of your attention.\n",
"\n",
"1. Extract your citations as a .bib file. Either extract all refs from a single folder in your citation manager or extract them straight from a Word doc if you use Mendeley or Zotero using [this tool](https://rintze.zelle.me/ref-extractor/).\n",
"2. Upload your .bib file to the binder.\n",
"3. Run each cell (Shift+Enter or Play button).\n",
"4. Optional: extract institutions from Google Scholar\n",
"5. Optional: get an ORCID API key (instructions below) to extract institutions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#imports\n",
"!pip install pybtex\n",
"!pip install scholarly\n",
"\n",
"from pybtex.database.input import bibtex\n",
"import glob\n",
"import pandas as pd\n",
"import requests\n",
"import re\n",
"from scholarly import scholarly"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#read .bib file\n",
"ID = glob.glob('*bib')\n",
"parser = bibtex.Parser()\n",
"try:\n",
" bib_data = parser.parse_file(ID[0])\n",
"except:\n",
" raise ValueError(\"Your .bib file has non-UTF8 characters it in (like smart quotes). Please remove them & try again.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#extract author & journal names from each citation\n",
"authors = list()\n",
"for key in bib_data.entries:\n",
" author = bib_data.entries[key].persons['author']\n",
" first_name = author[-1].rich_first_names\n",
" last_name = author[-1].rich_last_names\n",
" first_name = str(first_name)[7:-3]\n",
" last_name = str(last_name)[7:-3]\n",
"\n",
" try:\n",
" journal = bib_data.entries[key].fields['journal']\n",
" except:\n",
" journal = 'Book'\n",
" authors.append([first_name, last_name, journal])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#build data frame & print\n",
"auth_df = pd.DataFrame(authors, columns=['First Name','Last Name', 'Journal'])\n",
"print('Overcited Authors')\n",
"names_grouped = auth_df.groupby(['First Name','Last Name']).size()\n",
"print(names_grouped.sort_values(ascending=False).head(10))\n",
"print('\\nOvercited Journals')\n",
"print(auth_df.groupby(['Journal']).size().sort_values(ascending=False).head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional: Get institutions from Google Scholar\n",
"Using the [Scholarly package](https://pypi.org/project/scholarly/), query Google Scholar for institutions. Limitations: assumes first hit is correct, may need to use method get_proxy if too many requests."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#match each author to an institutions\n",
"#NB this section calls Google Scholar once for each author, so it takes a while\n",
"inst_list = list()\n",
"for name in names_grouped.axes[0]:\n",
" search_query = scholarly.search_author(name[0]+' '+name[1])\n",
" try:\n",
" author = next(search_query).fill()\n",
" affil = re.sub('Professor .+, ','',author.affiliation)\n",
" inst_list.append(affil)\n",
" except:\n",
" inst_list.append('Undetermined')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#add to author counts series and print\n",
"inst_series = pd.Series(names_grouped.values, index = inst_list)\n",
"print('Overcited Institutions')\n",
"print(inst_series.sort_values(ascending=False).head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional: Get institutions from ORCID\n",
"Using the ORCID API, query ORCID for institutions. Limitations: assumes first hit is correct, many authors have empty ORCIDs.\n",
"\n",
"Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)\n",
"\n",
"You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#input your client ID & key\n",
"ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'\n",
"ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# build a request for a token\n",
"payload = {'client_id': ORCIDAPI_ID,\n",
" 'client_secret': ORCIDAPI_key,\n",
" 'scope': '/read-public',\n",
" 'grant_type': 'client_credentials'\n",
" }\n",
"url = 'https://orcid.org/oauth/token'\n",
"headers = {'Accept': 'application/json'}\n",
"response = requests.post(url, data=payload, headers=headers, timeout=None)\n",
"response.raise_for_status()\n",
"token = response.json()['access_token']\n",
"\n",
"# set up headers for searches\n",
"headers = {'Accept': 'application/vnd.orcid+json',\n",
" 'Authorization type': 'Bearer',\n",
" 'Access token': token}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#find ORCIDs\n",
"#NB this section calls the API once for each author, so it takes a while\n",
"orcid_list = list()\n",
"for name in names_grouped.axes[0]:\n",
" given = re.sub(' ','%20',name[0])\n",
" family = re.sub(' ','%20',name[1])\n",
"\n",
" #build search\n",
" url = \"https://pub.orcid.org/v3.0/search/?q=\" \\\n",
" + \"family-name:\" + family + \"+AND+given-names:\" + given \\\n",
" + \"&rows=1\"\n",
" auth_id = requests.get(url, headers=headers, timeout=None)\n",
"\n",
" #get first returned ORCID\n",
" if auth_id.json()['result'] is not None:\n",
" orcid_list.append(auth_id.json()['result'][0]['orcid-identifier']['path'])\n",
" else:\n",
" orcid_list.append('')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get institution from ORCID\n",
"#NB this section calls the API once for each author, so it takes a while\n",
"inst_list = list()\n",
"for orcid in orcid_list:\n",
" if len(orcid)>0:\n",
" url = \"https://pub.orcid.org/v2.1/\" + orcid + \"/record\"\n",
" orcid_request = requests.get(url, headers=headers, timeout=None)\n",
" affil = orcid_request.json()['activities-summary']['employments']['employment-summary']\n",
" if len(affil)>0:\n",
"# print(json.dumps(var, indent=2, separators=(',', ':')))\n",
" inst_list.append(affil[0]['organization']['name'])\n",
" else:\n",
" inst_list.append('Undetermined')\n",
" else:\n",
" inst_list.append('Undetermined')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#add to author counts series and print\n",
"inst_series = pd.Series(names_grouped.values, index = inst_list)\n",
"print('Overcited Institutions')\n",
"print(inst_series.sort_values(ascending=False).head(10))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1