findpapers

Find academic papers on PubMed and cluster by similarity.

https://github.com/ahl27/findpapers

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Find academic papers on PubMed and cluster by similarity.

Basic Info
  • Host: GitHub
  • Owner: ahl27
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 69.3 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed over 4 years ago
Metadata Files
Readme License Citation

README.md

Running the Script

Run this script with python3 main.py. This script is implemented almost entirely in base Python3. Only dependencies should be numpy and Python >= 3.7.

See definition of user parameters below.

NCBI enforces an API access limit of 3 requests/second, which is the main speed bottleneck. This script queries at a rate of approximately 7,000 calls per hour. I might go back and make it more efficient later, but this is pretty close to the ideal limit of 10,800 calls per hour.

You can generate an API key from your account on NCBI, and use this in the api_key argument. This is supposed to increase the limit to 10 requests/second, which should improve processing time. Even with this, though, my rate is still roughly 2 abstracts processed per second.

Tested and working on Pythonista 3 for iPad.

User Parameters

Parameters are defined within main.py.

Example usage:

``` toolname = 'ahl27litreview'

email = 'example@example.com'

init_pmids = ['24349035']

api_key = None

outfile_name = 'filesfound.txt'

verbose = True

depth = 1

nclust = 2

terms = [['coevolution', 'coevolutionary', 'cooccurence'], ['phylogenetic', 'profile', 'phylogeny', 'mirrortree', 'contexttree']] ```

Explanations:

Required params to use NCBI API:

  • toolname: name of tool, set it to some string corresponding with your project.
    • ex. 'alakshmantool’
  • email: email address for contacting if there's a problem.
    • ex. 'a@gmail.com'
  • init_pmids: PubMed IDs to build a network from, as a comma separated list of numbers.
    • Should be fine as strings or integers.
    • If only using a single ID, still put it into a list (like ['1'])
    • ex. ['001', '002', '003']

Note that if you do not provide an email and toolname your requests may be blocked. Supplying an invalid email address will mean NCBI cannot contact you if there's a problem, and can result in your IP address being blacklisted from using any NCBI API commands.

Additional parameters:

  • verbose: True to print out progress, False to suppress most output
    • default: True
  • api_key: Input your API key as a string, or use None if you don't have one.
    • ex. None or '123456789abcdef'
  • outfile_name: file to save results to. Set to None to print out output instead of saving.
    • ex. fileout.txt or None
  • depth: How far into the network to go.
    • ex. 3
    • Depth=n means that all papers returned will be within n distance from at least one paper in init_pmids. A paper that cites or is cited by a given paper are distance 1 away from each other.
    • Papers are filtered by search term before expanding to a new depth.
    • CAUTION: The number of papers returned grows rapidly with depth. The example returns 35 papers at depth 1, and 750 at depth 2.
  • nclust: Number of clusters to cluster into.
    • ex. 3
  • terms: Search terms, organized as a list of lists.
    • Within each nested list, the abstract must contain at least one word from it.
    • The format is essentially: [ [1, 2], [3, 4] ] => (1 OR 2) AND (3 OR 4)
    • Leave empty ( [] ) to just grab everything.
    • only use lowercase letters--abstract is lowercased before it's filtered. Additionally, hyphens are removed.
    • "Co-Evolution" becomes "coevolution"
    • ex. [['streptomyces', 'pseudomonas'], ['antibiotics']]
    • This returns abstracts that include both 'antibiotics' and at least one term from (streptomyces, pseudomonas)

Example Output:

``` Finding papers from initial paper(s)...

Search depth of 1 1 element(s) to search. ========================= 35 papers found.

Finding abstracts... =========================

8 total abstracts matched search criteria.

Clustering with k-means (k=2)...


Clusters found:

Cluster 1 (4 items): [18930732, 18199838, 16139301, 24349035]

Cluster 2 (4 items): [23458856, 18818697, 20363731, 32043173] ```

Owner

  • Name: Aidan Lakshman
  • Login: ahl27
  • Kind: user
  • Location: Pittsburgh, PA
  • Company: University of Pittsburgh

Citation (citation_network.py)

import requests
import xmltodict
import re
from time import sleep
from stopwords import stopwords

def find_citing_articles(pmid, toolname, email, apikey, return_pmids=True, use_pmcid=False):
	# By default returns a list of PMIDs that cite the given article in PubMed
	params = {'id': pmid, 'tool': toolname, 'email': email,
						'linkname': 'pubmed_pmc_refs', 'dbfrom': 'pubmed'}
	if apikey is not None:
		params['api_key'] = apikey
	if (return_pmids):
		params['linkname'] = 'pubmed_pubmed_citedin'
	elif (use_pmcid):
		params['linkname'] = 'pmc_pmc_citedby'
	
	return(make_request(params))


def find_cited_articles(pmid, toolname, email, apikey, pmc_refs_only=False, id_by_pmid=True):
	# Returns a list of all articles cited by the article
	params = {'id': pmid, 'tool': toolname, 'email': email,
						'linkname': 'pmc_refs_pubmed', 'dbfrom': 'pmc'}
	if apikey is not None:
		params['api_key'] = apikey
	if (pmc_refs_only):
		params['linkname'] = 'pmc_pmc_cites'
	elif (id_by_pmid):
		params['linkname'] = 'pubmed_pubmed_refs'
		params['dbfrom'] = 'pubmed'
	
	return(make_request(params))
	
	
def make_request(params):
	request = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi'
	r = requests.get(request, params=params)
	
	# Only 3 requests are allowed per second, delay if we're sending too many
	while r.status_code != 200:
		sleep(1)
		r = requests.get(request, params=params)
	ids = []
	parsed = xmltodict.parse(r.text)
	
	if ('LinkSetDb' not in parsed['eLinkResult']['LinkSet'].keys()):
		return(ids)
		
	cites = parsed['eLinkResult']['LinkSet']['LinkSetDb']['Link']
	for elem in cites:
		if len(elem) > 1:
			pass
		else:
			ids.append(elem['Id'])
		
	return(ids)
	

def get_abstract(id, toolname, email, apikey):
	request = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
	params = {'db': 'pubmed', 'id': id, 'tool': toolname, 'email': email,
						'retmode': 'JSON', 'rettype': 'abstract'}
	if apikey is not None:
		params['api_key'] = apikey
	r = requests.get(request, params=params)
	while r.status_code != 200:
		sleep(1)
		r = requests.get(request, params=params)
	r = r.text
	r = r.split('\n\n')[4]
	r = r.replace('\n', ' ')

	return(tokenize_abstract(r))

def tokenize_abstract(abstract):
	alph_abs = abstract.replace('-', '')
	alph_abs = re.sub("[^0-9a-zA-Z]+", " ", alph_abs).lower().strip()
	tokens = alph_abs.split(' ')
	tokens = [word for word in tokens if word not in stopwords]

	return(tokens)
	
def get_unique_wordslist(list_of_abstracts):
	res = {x for i in list_of_abstracts for x in i}	

def print_progress_bar(k, maxk, barwidth=25, forcesame=False, time_per_k = 0.5):
	# 0.5 is about how fast it can compute based on my machine
	# at the end of the day it's really just an estimate
	if k == 0 and not forcesame:
			print('[' + (' '*barwidth) + '] (0/' + str(maxk) + ')', end='')
	else:
		num_bars = int((k/maxk) * barwidth)
		prog = '=' * num_bars
		spacer = ' ' * (barwidth - num_bars)
		remaining_sec = int((maxk - k) * time_per_k + 0.5)
		remaining_hr = remaining_sec // 3600
		remaining_min = (remaining_sec % 3600) // 60
		remaining_sec = remaining_sec % 60
		timestring = "{hr}:{min:02d}:{sec:02d}".format(hr=remaining_hr, min=remaining_min, sec=remaining_sec)
		print('\r' + '[' + prog + spacer + '] (' + str(k) + '/' + str(maxk) + ', ' + timestring + ')', end='')
	
	if k == maxk:
		print()
		
	
def gen_paper_network(pmids, toolname, email, terms, apikey=None, depth=1, verbose=True):
	if any(i is None for i in [toolname, email]):
		raise Exception("Tool and Email name must be specified")
	if verbose:
		print("Finding papers from initial paper(s)...\n")
		print('Gathering initial papers (' + str(len(pmids)) + ' to search)')
	network = abstracts_from_network(pmids, toolname, email, terms, apikey, verbose)
	temp = pmids
	
	for j in range(1,(depth+1)):
		print("\nSearch depth of " + str(j))
		num_elements = len(temp)
		print(str(num_elements) + " element(s) to search.")
		cur = set(temp)
		temp = []
		if verbose:
			k = 0
			print_progress_bar(0, num_elements)
		for item in cur: 
			temp = temp + find_citing_articles(item, toolname, email, apikey) + find_cited_articles(item, toolname, email, apikey)
			if verbose:
				k += 1 
				print_progress_bar(k, num_elements)
		temp = list(set([paperid for paperid in temp if paperid not in network.keys()]))
		if verbose:
			print(str(len(temp)) + " papers found.\n")
			print("Finding abstracts...")
		abstracts = abstracts_from_network(temp, toolname, email, terms, apikey, verbose)
		if verbose:
			print(str(len(abstracts.keys())) + " papers met search criteria.\n")
		temp = list(abstracts.keys())
		if temp is []:
			break
		network.update(abstracts)
				
	return(network)
	
def abstracts_from_network(network, toolname, email, terms, apikey, verbose=True):
	abstracts = {}
	maxlen = len(network)
	if verbose:
		print_progress_bar(0, maxlen)
		k = 0
	for item in network:
		abstract = get_abstract(item, toolname, email, apikey)
		flag = True
		if terms is not []:
			flag = find_search_terms(abstract, terms)
		if flag:
			abstracts[item] = abstract
		if verbose:
			k += 1
			print_progress_bar(k, maxlen, forcesame=True)
		
	return(abstracts)
	
def find_search_terms(corpus, terms):
	state = True
	for term_list in terms:
		state = state and any([term in corpus for term in term_list])
	return(state)		
			
if __name__ == '__main__':
	toolname = 'ahl27litreview'
	email = 'example@example.com'
	init_pmids = ['24349035']
	search_terms = [['coevolution', 'coevolutionary', 'cooccurence'],
								['phylogenetic', 'profile', 'phylogeny', 'mirrortree', 'contexttree']]
		
	print(gen_paper_network(init_pmids, toolname, email, terms=search_terms, depth=2))

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels