findpapers
Find academic papers on PubMed and cluster by similarity.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary
Repository
Find academic papers on PubMed and cluster by similarity.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Running the Script
Run this script with python3 main.py.
This script is implemented almost entirely in base Python3.
Only dependencies should be numpy and Python >= 3.7.
See definition of user parameters below.
NCBI enforces an API access limit of 3 requests/second, which is the main speed bottleneck. This script queries at a rate of approximately 7,000 calls per hour. I might go back and make it more efficient later, but this is pretty close to the ideal limit of 10,800 calls per hour.
You can generate an API key from your account on NCBI, and use this in the api_key argument.
This is supposed to increase the limit to 10 requests/second, which should improve processing time.
Even with this, though, my rate is still roughly 2 abstracts processed per second.
Tested and working on Pythonista 3 for iPad.
User Parameters
Parameters are defined within main.py.
Example usage:
``` toolname = 'ahl27litreview'
email = 'example@example.com'
init_pmids = ['24349035']
api_key = None
outfile_name = 'filesfound.txt'
verbose = True
depth = 1
nclust = 2
terms = [['coevolution', 'coevolutionary', 'cooccurence'], ['phylogenetic', 'profile', 'phylogeny', 'mirrortree', 'contexttree']] ```
Explanations:
Required params to use NCBI API:
- toolname: name of tool, set it to some string corresponding with your project.
- ex.
'alakshmantool’
- ex.
- email: email address for contacting if there's a problem.
- ex.
'a@gmail.com'
- ex.
- init_pmids: PubMed IDs to build a network from, as a comma separated list of numbers.
- Should be fine as strings or integers.
- If only using a single ID, still put it into a list (like
['1']) - ex.
['001', '002', '003']
Note that if you do not provide an email and toolname your requests may be blocked. Supplying an invalid email address will mean NCBI cannot contact you if there's a problem, and can result in your IP address being blacklisted from using any NCBI API commands.
Additional parameters:
- verbose:
Trueto print out progress,Falseto suppress most output- default:
True
- default:
- api_key: Input your API key as a string, or use
Noneif you don't have one.- ex.
Noneor'123456789abcdef'
- ex.
- outfile_name: file to save results to. Set to
Noneto print out output instead of saving.- ex.
fileout.txtorNone
- ex.
- depth: How far into the network to go.
- ex.
3 - Depth=n means that all papers returned will be within n distance from at least one paper in init_pmids. A paper that cites or is cited by a given paper are distance 1 away from each other.
- Papers are filtered by search term before expanding to a new depth.
- CAUTION: The number of papers returned grows rapidly with depth. The example returns 35 papers at depth 1, and 750 at depth 2.
- ex.
- nclust: Number of clusters to cluster into.
- ex.
3
- ex.
- terms: Search terms, organized as a list of lists.
- Within each nested list, the abstract must contain at least one word from it.
- The format is essentially:
[ [1, 2], [3, 4] ]=> (1 OR 2) AND (3 OR 4) - Leave empty (
[]) to just grab everything. - only use lowercase letters--abstract is lowercased before it's filtered. Additionally, hyphens are removed.
- "Co-Evolution" becomes "coevolution"
- ex.
[['streptomyces', 'pseudomonas'], ['antibiotics']] - This returns abstracts that include both 'antibiotics' and at least one term from (streptomyces, pseudomonas)
Example Output:
``` Finding papers from initial paper(s)...
Search depth of 1 1 element(s) to search. ========================= 35 papers found.
Finding abstracts... =========================
8 total abstracts matched search criteria.
Clustering with k-means (k=2)...
Clusters found:
Cluster 1 (4 items): [18930732, 18199838, 16139301, 24349035]
Cluster 2 (4 items): [23458856, 18818697, 20363731, 32043173] ```
Owner
- Name: Aidan Lakshman
- Login: ahl27
- Kind: user
- Location: Pittsburgh, PA
- Company: University of Pittsburgh
- Website: www.ahl27.com
- Twitter: ahlakshman
- Repositories: 8
- Profile: https://github.com/ahl27
Citation (citation_network.py)
import requests
import xmltodict
import re
from time import sleep
from stopwords import stopwords
def find_citing_articles(pmid, toolname, email, apikey, return_pmids=True, use_pmcid=False):
# By default returns a list of PMIDs that cite the given article in PubMed
params = {'id': pmid, 'tool': toolname, 'email': email,
'linkname': 'pubmed_pmc_refs', 'dbfrom': 'pubmed'}
if apikey is not None:
params['api_key'] = apikey
if (return_pmids):
params['linkname'] = 'pubmed_pubmed_citedin'
elif (use_pmcid):
params['linkname'] = 'pmc_pmc_citedby'
return(make_request(params))
def find_cited_articles(pmid, toolname, email, apikey, pmc_refs_only=False, id_by_pmid=True):
# Returns a list of all articles cited by the article
params = {'id': pmid, 'tool': toolname, 'email': email,
'linkname': 'pmc_refs_pubmed', 'dbfrom': 'pmc'}
if apikey is not None:
params['api_key'] = apikey
if (pmc_refs_only):
params['linkname'] = 'pmc_pmc_cites'
elif (id_by_pmid):
params['linkname'] = 'pubmed_pubmed_refs'
params['dbfrom'] = 'pubmed'
return(make_request(params))
def make_request(params):
request = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi'
r = requests.get(request, params=params)
# Only 3 requests are allowed per second, delay if we're sending too many
while r.status_code != 200:
sleep(1)
r = requests.get(request, params=params)
ids = []
parsed = xmltodict.parse(r.text)
if ('LinkSetDb' not in parsed['eLinkResult']['LinkSet'].keys()):
return(ids)
cites = parsed['eLinkResult']['LinkSet']['LinkSetDb']['Link']
for elem in cites:
if len(elem) > 1:
pass
else:
ids.append(elem['Id'])
return(ids)
def get_abstract(id, toolname, email, apikey):
request = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
params = {'db': 'pubmed', 'id': id, 'tool': toolname, 'email': email,
'retmode': 'JSON', 'rettype': 'abstract'}
if apikey is not None:
params['api_key'] = apikey
r = requests.get(request, params=params)
while r.status_code != 200:
sleep(1)
r = requests.get(request, params=params)
r = r.text
r = r.split('\n\n')[4]
r = r.replace('\n', ' ')
return(tokenize_abstract(r))
def tokenize_abstract(abstract):
alph_abs = abstract.replace('-', '')
alph_abs = re.sub("[^0-9a-zA-Z]+", " ", alph_abs).lower().strip()
tokens = alph_abs.split(' ')
tokens = [word for word in tokens if word not in stopwords]
return(tokens)
def get_unique_wordslist(list_of_abstracts):
res = {x for i in list_of_abstracts for x in i}
def print_progress_bar(k, maxk, barwidth=25, forcesame=False, time_per_k = 0.5):
# 0.5 is about how fast it can compute based on my machine
# at the end of the day it's really just an estimate
if k == 0 and not forcesame:
print('[' + (' '*barwidth) + '] (0/' + str(maxk) + ')', end='')
else:
num_bars = int((k/maxk) * barwidth)
prog = '=' * num_bars
spacer = ' ' * (barwidth - num_bars)
remaining_sec = int((maxk - k) * time_per_k + 0.5)
remaining_hr = remaining_sec // 3600
remaining_min = (remaining_sec % 3600) // 60
remaining_sec = remaining_sec % 60
timestring = "{hr}:{min:02d}:{sec:02d}".format(hr=remaining_hr, min=remaining_min, sec=remaining_sec)
print('\r' + '[' + prog + spacer + '] (' + str(k) + '/' + str(maxk) + ', ' + timestring + ')', end='')
if k == maxk:
print()
def gen_paper_network(pmids, toolname, email, terms, apikey=None, depth=1, verbose=True):
if any(i is None for i in [toolname, email]):
raise Exception("Tool and Email name must be specified")
if verbose:
print("Finding papers from initial paper(s)...\n")
print('Gathering initial papers (' + str(len(pmids)) + ' to search)')
network = abstracts_from_network(pmids, toolname, email, terms, apikey, verbose)
temp = pmids
for j in range(1,(depth+1)):
print("\nSearch depth of " + str(j))
num_elements = len(temp)
print(str(num_elements) + " element(s) to search.")
cur = set(temp)
temp = []
if verbose:
k = 0
print_progress_bar(0, num_elements)
for item in cur:
temp = temp + find_citing_articles(item, toolname, email, apikey) + find_cited_articles(item, toolname, email, apikey)
if verbose:
k += 1
print_progress_bar(k, num_elements)
temp = list(set([paperid for paperid in temp if paperid not in network.keys()]))
if verbose:
print(str(len(temp)) + " papers found.\n")
print("Finding abstracts...")
abstracts = abstracts_from_network(temp, toolname, email, terms, apikey, verbose)
if verbose:
print(str(len(abstracts.keys())) + " papers met search criteria.\n")
temp = list(abstracts.keys())
if temp is []:
break
network.update(abstracts)
return(network)
def abstracts_from_network(network, toolname, email, terms, apikey, verbose=True):
abstracts = {}
maxlen = len(network)
if verbose:
print_progress_bar(0, maxlen)
k = 0
for item in network:
abstract = get_abstract(item, toolname, email, apikey)
flag = True
if terms is not []:
flag = find_search_terms(abstract, terms)
if flag:
abstracts[item] = abstract
if verbose:
k += 1
print_progress_bar(k, maxlen, forcesame=True)
return(abstracts)
def find_search_terms(corpus, terms):
state = True
for term_list in terms:
state = state and any([term in corpus for term in term_list])
return(state)
if __name__ == '__main__':
toolname = 'ahl27litreview'
email = 'example@example.com'
init_pmids = ['24349035']
search_terms = [['coevolution', 'coevolutionary', 'cooccurence'],
['phylogenetic', 'profile', 'phylogeny', 'mirrortree', 'contexttree']]
print(gen_paper_network(init_pmids, toolname, email, terms=search_terms, depth=2))
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0