Pubmed Parser

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset - Published in JOSS (2020)

https://github.com/titipata/pubmed_parser

Keywords

article doi medline-xml nlp parse parser pmid pubmed-central pubmed-parser python xml

Last synced: 6 months ago · JSON representation ·

Repository

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

Basic Info

Host: GitHub
Owner: titipata
License: mit
Language: Python
Default Branch: master
Homepage: http://titipata.github.io/pubmed_parser/
Size: 60.4 MB

Statistics

Stars: 691
Watchers: 21
Forks: 177
Open Issues: 14
Releases: 8

Topics

article doi medline-xml nlp parse parser pmid pubmed-central pubmed-parser python xml

Created almost 11 years ago · Last pushed 7 months ago

Metadata Files

Readme Contributing License Citation

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the lxml library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.

Available Parsers

path provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the data folder.
for website parsing, you should scrape with pause. Please see the copyright notice because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from pubmed_parser.

Parse PubMed OA XML information
Parse PubMed OA citation references
Parse PubMed OA images and captions
Parse PubMed OA Paragraph
Parse PubMed OA Table [WIP]
Parse MEDLINE XML
Parse MEDLINE Grant ID
Parse MEDLINE XML from eutils website
Parse MEDLINE XML citations from website
Parse Outgoing XML citations from website

Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called parse_pubmed_xml which will return a dictionary with the following information:

full_title : article's title
abstract : abstract
journal : Journal name
pmid : PubMed ID
pmc : PubMed Central ID
doi : DOI of the article
publisher_id : publisher ID
author_list : list of authors with affiliation keys in the following format

python [['last_name_1', 'first_name_1', 'aff_key_1'], ['last_name_1', 'first_name_1', 'aff_key_2'], ['last_name_2', 'first_name_2', 'aff_key_1'], ...]

affiliation_list : list of affiliation keys and affiliation strings in the following format

python [['aff_key_1', 'affiliation_1'], ['aff_key_2', 'affiliation_2'], ...]

publication_year : publication year
subjects : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.

python import pubmed_parser as pp dict_out = pp.parse_pubmed_xml(path)

Parse PubMed OA citation references

The function parse_pubmed_references will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

pmid : PubMed ID of the article
pmc : PubMed Central ID of the article
article_title : title of cited article
journal : journal name
journal_type : type of journal
pmid_cited : PubMed ID of article that article cites
doi_cited : DOI of article that article cites
year : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)

python dicts_out = pp.parse_pubmed_references(path) # return list of dictionary

Parse PubMed OA images and captions

The function parse_pubmed_caption can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

pmid : PubMed ID
pmc : PubMed Central ID
fig_caption : string of caption
fig_id : reference id for figure (use to refer in XML article)
fig_label : label of the figure
graphic_ref : reference to image file name provided from Pubmed OA

python dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary

Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use parse_pubmed_paragraph to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

pmid : PubMed ID
pmc : PubMed Central ID
text : full text of the paragraph
reference_ids : list of reference code within that paragraph.

This IDs can merge with output from parse_pubmed_references .

section : section of paragraph (e.g. Background, Discussion, Appendix, etc.)

python dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)

Parse PubMed OA Table [WIP]

You can use parse_pubmed_table to parse table from XML file. This function will return list of dictionaries where each has following keys.

pmid : PubMed ID
pmc : PubMed Central ID
caption : caption of the table
label : lable of the table
table_columns : list of column name
table_values : list of values inside the table
table_xml : raw xml text of the table (return if return_xml=True)

python dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)

Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function parse_medline_xml to parse that format. This function will return list of dictionaries, where each element contains:

pmid : PubMed ID
pmc : PubMed Central ID
doi : DOI
other_id : Other IDs found, each separated by ;
title : title of the article
abstract : abstract of the article
authors : authors, each separated by ;
mesh_terms : list of MeSH terms with corresponding MeSH ID, each separated by ; e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...
publication_types : list of publication type list each separated by ; e.g. 'D016428:Journal Article'
keywords : list of keywords, each separated by ;
chemical_list : list of chemical terms, each separated by ;
pubdate : Publication date. Defaults to year information only.
journal : journal of the given paper
medline_ta : this is abbreviation of the journal name
nlm_unique_id : NLM unique identification
issn_linking : ISSN linkage, typically use to link with Web of Science dataset
country : Country extracted from journal information field
reference : string of PMID each separated by ; or list of references made to the article
delete : boolean if False means paper got updated so you might have two
languages : list of languages, separated by ;
vernacular_title: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

python dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz', year_info_only=False, nlm_category=False, author_list=False, reference_list=False) # return list of dictionary

To extract month and day information from PubDate, set year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing nlm_category argument.

Parse MEDLINE Grant ID

Use parse_grant_id in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

pmid : PubMed ID
grant_id : Grant ID
grant_acronym : Acronym of grant
country : Country where grant funding from
agency : Grant agency

If no Grant ID is found, it will return None

Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from E-Utilities using parse_xml_web . For this function, you can provide a single pmid as an input and get a dictionary with following keys

title : title
abstract : abstract
journal : journal
affiliation : affiliation of first author
authors : string of authors, separated by ;
year : Publication year
keywords : keywords or MESH terms of the article

python dict_out = pp.parse_xml_web(pmid, save_xml=False)

Parse MEDLINE XML citations from website

The function parse_citation_web allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

pmc : PubMed Central ID
pmid : PubMed ID
doi : DOI of the article
n_citations : number of citations for given articles
pmc_cited : list of PMCs that cite the given PMC

python dict_out = pp.parse_citation_web(doc_id, id_type='PMC')

Parse Outgoing XML citations from website

The function parse_outgoing_citation_web allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

n_citations : number of cited articles
doc_id : the document identifier given
id_type : the type of identifier given. Either 'PMID' or 'PMC'
pmid_cited : list of PMIDs cited by the article

python dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the 'PMC' prefix. If no citations are found, or if no article is found matching doc_id in the indicated database, it will return None.

Installation

You can install the most update version of the package directly from the repository

bash pip install git+https://github.com/titipata/pubmed_parser.git

or install recent release with PyPI using

bash pip install pubmed-parser

or clone the repository and install using pip

bash git clone https://github.com/titipata/pubmed_parser pip install ./pubmed_parser

You can test your installation by running pytest --cov=pubmed_parser tests/ --verbose in the root of the repository.

Example snippet to parse PubMed OA dataset

An example usage is shown as follows

``` python import pubmedparser as pp pathxml = pp.listxmlpath('data') # list all xml paths under directory pubmeddict = pp.parsepubmedxml(pathxml[0]) # dictionary output print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...", 'affiliationlist': [['I1': 'Department of Biological Sciences, ...'], ['I2': 'Biology Department, Queens College, and the Graduate Center ...']], 'authorlist': [['Dennehy', 'John J', 'I1'], ['Dennehy', 'John J', 'I2'], ['Wang', 'Ing-Nang', 'I1']], 'fulltitle': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb', 'journal': 'BMC Microbiology', 'pmc': '3166277', 'pmid': '21810267', 'publicationyear': '2011', 'publisher_id': '1471-2180-11-174', 'subjects': 'Research Article'} ```

Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using PySpark 2.1

``` python import os import pubmed_parser as pp from pyspark.sql import Row

pathall = pp.listxmlpath('/path/to/xml/folder/') pathrdd = spark.sparkContext.parallelize(pathall, numSlices=10000) parseresultsrdd = pathrdd.map(lambda x: Row(filename=os.path.basename(x), **pp.parsepubmedxml(x))) pubmedoadf = parseresultsrdd.toDF() # Spark dataframe pubmedoadfsel = pubmedoadf[['fulltitle', 'abstract', 'doi', 'filename', 'pmc', 'pmid', 'publicationyear', 'publisherid', 'journal', 'subjects']] # select columns pubmedoadfsel.write.parquet('pubmedoa.parquet', mode='overwrite') # write dataframe ```

See scripts folder for more information.

Core Members

and contributors

Dependencies

Citation

If you use Pubmed Parser, please cite it from JOSS as follows

Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

@article{Achakulvisut2020, doi = {10.21105/joss.01979}, url = {https://doi.org/10.21105/joss.01979}, year = {2020}, publisher = {The Open Journal}, volume = {5}, number = {46}, pages = {1979}, author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording}, title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset}, journal = {Journal of Open Source Software} }

Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.

Acknowledgement

This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including tleonardi, timClicks, and majensen. They made our repository much better!

License

Owner

Name: Titipat Achakulvisut
Login: titipata
Kind: user
Location: Bangkok, Thailand
Company: Mahidol University

Website: titipata.github.io
Twitter: titipat_a
Repositories: 47
Profile: https://github.com/titipata

Applied ML & Science of Science @biodatlab Mahidol University | Former @KordingLab UPenn, intern @allenai, organizer/co-founder @neuromatch

JOSS Publication

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset

Published

February 08, 2020

DOI

10.21105/joss.01979

Volume 5, Issue 46, Page 1979

Authors

Titipat Achakulvisut
University of Pennsylvania

Daniel E. Acuna
Syracuse University

Konrad Kording
University of Pennsylvania

Editor

Mark A. Jensen

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Citation for Pubmed Parser library"
authors:
  - family-names: Achakulvisut
    given-names: Titipat
  - family-names: Acuna
    given-names: Daniel
  - family-names: Kording
    given-names: Konrad
title: "Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset"
date-released: 2019-12-15
doi: 10.21105/joss.01979
url: https://github.com/titipata/pubmed_parser
preferred-citation:
  type: article
  authors:
    - family-names: Achakulvisut
      given-names: Titipat
    - family-names: Acuna
      given-names: Daniel
    - family-names: Kording
      given-names: Konrad
  doi: 10.21105/joss.01979
  journal: "Journal of Open Source Software"
  publisher: The Open Journal
  month: 9
  year: 2020
  number: 46
  volume: 5
  start: 1979
  title: "Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset"
  url: https://doi.org/10.21105/joss.01979

GitHub Events

Total

Issues event: 8
Watch event: 99
Issue comment event: 22
Push event: 19
Pull request review event: 5
Pull request review comment event: 1
Pull request event: 10
Fork event: 9

Last Year

Issues event: 8
Watch event: 99
Issue comment event: 22
Push event: 19
Pull request review event: 5
Pull request review comment event: 1
Pull request event: 10
Fork event: 9

Committers

Last synced: 7 months ago

All Time

Total Commits: 322
Total Committers: 39
Avg Commits per committer: 8.256
Development Distribution Score (DDS): 0.689

Past Year

Commits: 17
Committers: 7
Avg Commits per committer: 2.429
Development Distribution Score (DDS): 0.471

Top Committers

Name	Email	Commits
titipata	t**a@u**u	100
titipata	m**t@g**m	94
Nils Herrmann	n**8@l**x	24
Ray Pereda	r**a@g**m	11
Daniel Acuna	d**a@s**u	10
Michael E. Rose	M**e@g**m	9
tulakann	t**r@g**m	8
Daniel Acuna	d**a@n**u	7
titipata	t**a@a**g	7
Ted Cybulski	t**i@g**m	6
Simon Wörpel	s**l@m**e	6
Tommaso Leonardi	t**m@i**z	4
Kevin Henner	k****r	3
Daniel Mietchen	d**n@g**m	2
Julien Tourille	j**e@g**m	2
Mark A. Jensen	m**t@f**s	2
TariqAHassan	t**n@g**m	2
Tiansu	t**0@i**m	2
jim	z**3@s**u	2
patrusso2	p**2@g**m	2
tanganyao	t**9@1**m	1
iacopo	i****y	1
ZhangWoW123	1****3	1
Ray Pereda	r**a@G**e	1
Ubuntu	u**u@i**l	1
titipata	t**a@u**u	1
Vincent Batts	v**s@h**m	1
Thomas Pan	t**n@g**m	1
The Gitter Badger	b**r@g**m	1
Sean Davis	s**i@g**m	1
and 9 more...

Committer Domains (Top 20 + Academic)

syr.edu: 2 u.northwestern.edu: 2 morgan.harvard.edu: 1 yale.edu: 1 gitter.im: 1 hashbangbash.com: 1 ip-172-31-11-208.us-west-2.compute.internal: 1 genesiss-mbp.home: 1 163.com: 1 fortinbras.us: 1 itm6.xyz: 1 medienrevolte.de: 1 allenai.org: 1 northwestern.edu: 1 genesisrg.com: 1 live.com.mx: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 87
Total pull requests: 56
Average time to close issues: 8 months
Average time to close pull requests: 13 days
Total issue authors: 52
Total pull request authors: 29
Average comments per issue: 2.99
Average comments per pull request: 1.02
Merged pull requests: 49
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 12
Pull requests: 9
Average time to close issues: about 1 month
Average time to close pull requests: 9 days
Issue authors: 8
Pull request authors: 5
Average comments per issue: 2.67
Average comments per pull request: 0.33
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

titipata (14)
deakkon (6)
nils-herrmann (5)
tleonardi (5)
uludag (2)
shrimonmuke0202 (2)
ZhangWoW123 (2)
Michael-E-Rose (2)
ghost (2)
qm-intel (2)
octotus (2)
callebalik (2)
soupstandstop (1)
schnobi1990 (1)
RengarAndKhz (1)

Pull Request Authors

nils-herrmann (31)
titipata (4)
kjhenner (3)
patrusso2 (2)
ZhangWoW123 (2)
tleonardi (2)
thomascpan (2)
jtourille (2)
enjalot (2)
iacopy (2)
raypereda (2)
Daniel-Mietchen (2)
zyi103 (2)
ethandrower (1)
grivaz (1)

Top Labels

Issue Labels

bug (22) enhancement (11) question (5) help wanted (4) invalid (2) feature request (1) duplicate (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 3,572 last-month
Total docker downloads: 5,783

Total dependent packages: 1
Total dependent repositories: 4
Total versions: 5
Total maintainers: 1

pypi.org: pubmed-parser

A python parser for Pubmed Open-Access Subset and MEDLINE XML repository

Documentation: https://pubmed-parser.readthedocs.io/
License: MIT (c) 2015 - 2024 Titipat Achakulvisut, Daniel E. Acuna
Latest release: 0.5.1
published over 1 year ago

Versions: 5
Dependent Packages: 1
Dependent Repositories: 4
Downloads: 3,572 Last month
Docker Downloads: 5,783

Rankings

Docker downloads count: 1.6%

Stargazers count: 2.8%

Forks count: 4.0%

Average: 4.3%

Dependent packages count: 4.7%

Downloads: 5.2%

Dependent repos count: 7.5%

Maintainers (1)

titipata

Last synced: 6 months ago

Pubmed Parser

Science Score: 100.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

Available Parsers

Parse PubMed OA XML information

Parse PubMed OA citation references

Parse PubMed OA images and captions

Parse PubMed OA Paragraph

Parse PubMed OA Table [WIP]

Parse MEDLINE XML

Parse MEDLINE Grant ID

Parse MEDLINE XML from eutils website

Parse MEDLINE XML citations from website

Parse Outgoing XML citations from website

Installation

Example snippet to parse PubMed OA dataset

Example Usage with PySpark

Core Members

Dependencies

Citation

Contributions

Acknowledgement

License

Owner

JOSS Publication

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pubmed-parser

Rankings

Maintainers (1)

Dependencies