Pubmed Parser
Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset - Published in JOSS (2020)
Science Score: 100.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 13 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, science.org, joss.theoj.org, zenodo.org -
✓Committers with academic emails
7 of 39 committers (17.9%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Basic Info
- Host: GitHub
- Owner: titipata
- License: mit
- Language: Python
- Default Branch: master
- Homepage: http://titipata.github.io/pubmed_parser/
- Size: 60.4 MB
Statistics
- Stars: 691
- Watchers: 21
- Forks: 177
- Open Issues: 14
- Releases: 8
Topics
Metadata Files
README.md
Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset
, MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the lxml library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.
For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.
Available Parsers
pathprovided to a function can be the path to a compressed or uncompressed XML file. We provide example files in thedatafolder.- for website parsing, you should scrape with pause. Please see the copyright notice because your IP can get blocked if you try to download in bulk.
Below, we list available parsers from pubmed_parser.
- Parse PubMed OA XML information
- Parse PubMed OA citation references
- Parse PubMed OA images and captions
- Parse PubMed OA Paragraph
- Parse PubMed OA Table [WIP]
- Parse MEDLINE XML
- Parse MEDLINE Grant ID
- Parse MEDLINE XML from eutils website
- Parse MEDLINE XML citations from website
- Parse Outgoing XML citations from website
Parse PubMed OA XML information
We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called parse_pubmed_xml which will return a dictionary with the following information:
full_title: article's titleabstract: abstractjournal: Journal namepmid: PubMed IDpmc: PubMed Central IDdoi: DOI of the articlepublisher_id: publisher IDauthor_list: list of authors with affiliation keys in the following format
python
[['last_name_1', 'first_name_1', 'aff_key_1'],
['last_name_1', 'first_name_1', 'aff_key_2'],
['last_name_2', 'first_name_2', 'aff_key_1'], ...]
affiliation_list: list of affiliation keys and affiliation strings in the following format
python
[['aff_key_1', 'affiliation_1'],
['aff_key_2', 'affiliation_2'], ...]
publication_year: publication yearsubjects: list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.
python
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)
Parse PubMed OA citation references
The function parse_pubmed_references will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows
pmid: PubMed ID of the articlepmc: PubMed Central ID of the articlearticle_title: title of cited articlejournal: journal namejournal_type: type of journalpmid_cited: PubMed ID of article that article citesdoi_cited: DOI of article that article citesyear: Publication year as it appears in the reference (may include letter suffix, e.g.2007a)
python
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
Parse PubMed OA images and captions
The function parse_pubmed_caption can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys
pmid: PubMed IDpmc: PubMed Central IDfig_caption: string of captionfig_id: reference id for figure (use to refer in XML article)fig_label: label of the figuregraphic_ref: reference to image file name provided from Pubmed OA
python
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
Parse PubMed OA Paragraph
For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use parse_pubmed_paragraph to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:
pmid: PubMed IDpmc: PubMed Central IDtext: full text of the paragraphreference_ids: list of reference code within that paragraph.
This IDs can merge with output from parse_pubmed_references .
section: section of paragraph (e.g. Background, Discussion, Appendix, etc.)
python
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
Parse PubMed OA Table [WIP]
You can use parse_pubmed_table to parse table from XML file. This function will return list of dictionaries where each has following keys.
pmid: PubMed IDpmc: PubMed Central IDcaption: caption of the tablelabel: lable of the tabletable_columns: list of column nametable_values: list of values inside the tabletable_xml: raw xml text of the table (return ifreturn_xml=True)
python
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
Parse MEDLINE XML
MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function parse_medline_xml to parse that format. This function will return list of dictionaries, where each element contains:
pmid: PubMed IDpmc: PubMed Central IDdoi: DOIother_id: Other IDs found, each separated by;title: title of the articleabstract: abstract of the articleauthors: authors, each separated by;mesh_terms: list of MeSH terms with corresponding MeSH ID, each separated by;e.g.'D000161:Acoustic Stimulation; D000328:Adult; ...publication_types: list of publication type list each separated by;e.g.'D016428:Journal Article'keywords: list of keywords, each separated by;chemical_list: list of chemical terms, each separated by;pubdate: Publication date. Defaults to year information only.journal: journal of the given papermedline_ta: this is abbreviation of the journal namenlm_unique_id: NLM unique identificationissn_linking: ISSN linkage, typically use to link with Web of Science datasetcountry: Country extracted from journal information fieldreference: string of PMID each separated by;or list of references made to the articledelete: boolean ifFalsemeans paper got updated so you might have twolanguages: list of languages, separated by;vernacular_title: vernacular title. Defaults to empty string whenever non-available.
XMLs for the same paper. You can delete the record of deleted paper because it got updated.
python
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
year_info_only=False,
nlm_category=False,
author_list=False,
reference_list=False) # return list of dictionary
To extract month and day information from PubDate, set year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing nlm_category argument.
Parse MEDLINE Grant ID
Use parse_grant_id in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing
pmid: PubMed IDgrant_id: Grant IDgrant_acronym: Acronym of grantcountry: Country where grant funding fromagency: Grant agency
If no Grant ID is found, it will return None
Parse MEDLINE XML from eutils website
You can use PubMed parser to parse XML file from E-Utilities using parse_xml_web . For this function, you can provide a single pmid as an input and get a dictionary with following keys
title: titleabstract: abstractjournal: journalaffiliation: affiliation of first authorauthors: string of authors, separated by;year: Publication yearkeywords: keywords or MESH terms of the article
python
dict_out = pp.parse_xml_web(pmid, save_xml=False)
Parse MEDLINE XML citations from website
The function parse_citation_web allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
pmc: PubMed Central IDpmid: PubMed IDdoi: DOI of the articlen_citations: number of citations for given articlespmc_cited: list of PMCs that cite the given PMC
python
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
Parse Outgoing XML citations from website
The function parse_outgoing_citation_web allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
n_citations: number of cited articlesdoc_id: the document identifier givenid_type: the type of identifier given. Either'PMID'or'PMC'pmid_cited: list of PMIDs cited by the article
python
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the 'PMC' prefix. If no citations are found, or if no article is found matching doc_id in the indicated database, it will return None.
Installation
You can install the most update version of the package directly from the repository
bash
pip install git+https://github.com/titipata/pubmed_parser.git
or install recent release with PyPI using
bash
pip install pubmed-parser
or clone the repository and install using pip
bash
git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser
You can test your installation by running pytest --cov=pubmed_parser tests/ --verbose
in the root of the repository.
Example snippet to parse PubMed OA dataset
An example usage is shown as follows
``` python import pubmedparser as pp pathxml = pp.listxmlpath('data') # list all xml paths under directory pubmeddict = pp.parsepubmedxml(pathxml[0]) # dictionary output print(pubmed_dict)
{'abstract': u"Background Despite identical genotypes and ...", 'affiliationlist': [['I1': 'Department of Biological Sciences, ...'], ['I2': 'Biology Department, Queens College, and the Graduate Center ...']], 'authorlist': [['Dennehy', 'John J', 'I1'], ['Dennehy', 'John J', 'I2'], ['Wang', 'Ing-Nang', 'I1']], 'fulltitle': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb', 'journal': 'BMC Microbiology', 'pmc': '3166277', 'pmid': '21810267', 'publicationyear': '2011', 'publisher_id': '1471-2180-11-174', 'subjects': 'Research Article'} ```
Example Usage with PySpark
This is a snippet to parse all PubMed Open Access subset using PySpark 2.1
``` python import os import pubmed_parser as pp from pyspark.sql import Row
pathall = pp.listxmlpath('/path/to/xml/folder/') pathrdd = spark.sparkContext.parallelize(pathall, numSlices=10000) parseresultsrdd = pathrdd.map(lambda x: Row(filename=os.path.basename(x), **pp.parsepubmedxml(x))) pubmedoadf = parseresultsrdd.toDF() # Spark dataframe pubmedoadfsel = pubmedoadf[['fulltitle', 'abstract', 'doi', 'filename', 'pmc', 'pmid', 'publicationyear', 'publisherid', 'journal', 'subjects']] # select columns pubmedoadfsel.write.parquet('pubmedoa.parquet', mode='overwrite') # write dataframe ```
See scripts folder for more information.
Core Members
and contributors
Dependencies
Citation
If you use Pubmed Parser, please cite it from JOSS as follows
Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979
or using BibTex
@article{Achakulvisut2020,
doi = {10.21105/joss.01979},
url = {https://doi.org/10.21105/joss.01979},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {46},
pages = {1979},
author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
journal = {Journal of Open Source Software}
}
Contributions
We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.
Acknowledgement
This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including tleonardi, timClicks, and majensen. They made our repository much better!
License
MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna
Owner
- Name: Titipat Achakulvisut
- Login: titipata
- Kind: user
- Location: Bangkok, Thailand
- Company: Mahidol University
- Website: titipata.github.io
- Twitter: titipat_a
- Repositories: 47
- Profile: https://github.com/titipata
Applied ML & Science of Science @biodatlab Mahidol University | Former @KordingLab UPenn, intern @allenai, organizer/co-founder @neuromatch
JOSS Publication
Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset
Authors
University of Pennsylvania
Syracuse University
University of Pennsylvania
Tags
MEDLINE PubMed Biomedical corpus Natural Language ProcessingCitation (CITATION.cff)
cff-version: 1.2.0
message: "Citation for Pubmed Parser library"
authors:
- family-names: Achakulvisut
given-names: Titipat
- family-names: Acuna
given-names: Daniel
- family-names: Kording
given-names: Konrad
title: "Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset"
date-released: 2019-12-15
doi: 10.21105/joss.01979
url: https://github.com/titipata/pubmed_parser
preferred-citation:
type: article
authors:
- family-names: Achakulvisut
given-names: Titipat
- family-names: Acuna
given-names: Daniel
- family-names: Kording
given-names: Konrad
doi: 10.21105/joss.01979
journal: "Journal of Open Source Software"
publisher: The Open Journal
month: 9
year: 2020
number: 46
volume: 5
start: 1979
title: "Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset"
url: https://doi.org/10.21105/joss.01979
GitHub Events
Total
- Issues event: 8
- Watch event: 99
- Issue comment event: 22
- Push event: 19
- Pull request review event: 5
- Pull request review comment event: 1
- Pull request event: 10
- Fork event: 9
Last Year
- Issues event: 8
- Watch event: 99
- Issue comment event: 22
- Push event: 19
- Pull request review event: 5
- Pull request review comment event: 1
- Pull request event: 10
- Fork event: 9
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| titipata | t****a@u****u | 100 |
| titipata | m****t@g****m | 94 |
| Nils Herrmann | n****8@l****x | 24 |
| Ray Pereda | r****a@g****m | 11 |
| Daniel Acuna | d****a@s****u | 10 |
| Michael E. Rose | M****e@g****m | 9 |
| tulakann | t****r@g****m | 8 |
| Daniel Acuna | d****a@n****u | 7 |
| titipata | t****a@a****g | 7 |
| Ted Cybulski | t****i@g****m | 6 |
| Simon Wörpel | s****l@m****e | 6 |
| Tommaso Leonardi | t****m@i****z | 4 |
| Kevin Henner | k****r | 3 |
| Daniel Mietchen | d****n@g****m | 2 |
| Julien Tourille | j****e@g****m | 2 |
| Mark A. Jensen | m****t@f****s | 2 |
| TariqAHassan | t****n@g****m | 2 |
| Tiansu | t****0@i****m | 2 |
| jim | z****3@s****u | 2 |
| patrusso2 | p****2@g****m | 2 |
| tanganyao | t****9@1****m | 1 |
| iacopo | i****y | 1 |
| ZhangWoW123 | 1****3 | 1 |
| Ray Pereda | r****a@G****e | 1 |
| Ubuntu | u****u@i****l | 1 |
| titipata | t****a@u****u | 1 |
| Vincent Batts | v****s@h****m | 1 |
| Thomas Pan | t****n@g****m | 1 |
| The Gitter Badger | b****r@g****m | 1 |
| Sean Davis | s****i@g****m | 1 |
| and 9 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 87
- Total pull requests: 56
- Average time to close issues: 8 months
- Average time to close pull requests: 13 days
- Total issue authors: 52
- Total pull request authors: 29
- Average comments per issue: 2.99
- Average comments per pull request: 1.02
- Merged pull requests: 49
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 12
- Pull requests: 9
- Average time to close issues: about 1 month
- Average time to close pull requests: 9 days
- Issue authors: 8
- Pull request authors: 5
- Average comments per issue: 2.67
- Average comments per pull request: 0.33
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- titipata (14)
- deakkon (6)
- nils-herrmann (5)
- tleonardi (5)
- uludag (2)
- shrimonmuke0202 (2)
- ZhangWoW123 (2)
- Michael-E-Rose (2)
- ghost (2)
- qm-intel (2)
- octotus (2)
- callebalik (2)
- soupstandstop (1)
- schnobi1990 (1)
- RengarAndKhz (1)
Pull Request Authors
- nils-herrmann (31)
- titipata (4)
- kjhenner (3)
- patrusso2 (2)
- ZhangWoW123 (2)
- tleonardi (2)
- thomascpan (2)
- jtourille (2)
- enjalot (2)
- iacopy (2)
- raypereda (2)
- Daniel-Mietchen (2)
- zyi103 (2)
- ethandrower (1)
- grivaz (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 3,572 last-month
- Total docker downloads: 5,783
- Total dependent packages: 1
- Total dependent repositories: 4
- Total versions: 5
- Total maintainers: 1
pypi.org: pubmed-parser
A python parser for Pubmed Open-Access Subset and MEDLINE XML repository
- Documentation: https://pubmed-parser.readthedocs.io/
- License: MIT (c) 2015 - 2024 Titipat Achakulvisut, Daniel E. Acuna
-
Latest release: 0.5.1
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- sphinx *
- sphinx-gallery *
- sphinx_rtd_theme *
- lxml *
- numpy *
- pytest *
- pytest-cov *
- requests *
- six *
- unidecode *
- lxml *
- numpy *
- pytest *
- pytest-cov *
- requests *
- six *
- unidecode *
