citation_normalisation
The goal of this repository is to create a tool that creates a normalised citation output based on any identifier that directs to a publication. The idea is to gather information using different freely available APIs.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.0%) to scientific vocabulary
Repository
The goal of this repository is to create a tool that creates a normalised citation output based on any identifier that directs to a publication. The idea is to gather information using different freely available APIs.
Basic Info
- Host: GitHub
- Owner: OBrink
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 1.76 MB
Statistics
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
citation_normalisation
Note: This repository is a contruction site! In general, things should work and are tested but at this stage, I cannot guarantee anything.
The goal of this repository is to create a tool that creates a normalised citation output based on any identifier that refers to a publication. The idea is to gather information using different freely available APIs.
Usage:
``` import citation_normalisation as cn
COCONUT Examples
test_list = ['Morvan-Bertrand,Physiol Plant,111,(2001),225', '20512739', '10.1021/ol502216j']
references = [] for testID in testlist: print(cn.getfinaldictfromref_str)
{'Morvan-Bertrand,Physiol Plant,111,(2001),225': {'reference': 'Morvan-Bertrand et al., Physiologia Plantarum, 2001, 111 (2), 225', 'DOI': '10.1034/j.1399-3054.2001.1110214.x', 'PMID': None}} {'20512739': {'reference': 'Sheu et al., J Environ Sci Health B, 2010, 45 (5), 478', 'DOI': '10.1080/03601231003800347', 'PMID': '20512739'}} {'10.1021/ol502216j': {'reference': 'Grudniewska et al., Organic Letters, 2014, 16 (18), 4695', 'DOI': '10.1021/ol502216j', 'PMID': None}}
```
What works:
Workflow:
- Detect DOI:
- Retrieval of information via Metapub or Crossref with DOI
- Detect PMID (only works if the input only is the PMID, RegEx for any number would be too unspecific and dangerous.)
- Retrieval of information via Metapub with PMID
- No success with DOI/PMID:
- Retrieval of information via Crossref with given keyword
- String query based retrieval is only accepted if all of the information is also found in parsed input string
Owner
- Name: Otto Brinkhaus
- Login: OBrink
- Kind: user
- Repositories: 5
- Profile: https://github.com/OBrink
Citation (citation_normalisation.ipynb)
{
"cells": [
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import time\n",
"import pandas as pd\n",
"import re\n",
"from typing import Dict\n",
"import citation_normalisation as cn\n",
"import reference_parser as rp\n",
"import retrieve_COCONUT_references as rCr\n",
"import importlib\n",
"import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read COCONUT references from file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Read data from file\n",
"coconut_references = pd.read_csv('./coconut_references.csv')\n",
"unstructured_references = coconut_references['citationDOI']\n",
"COCONUT_IDs = coconut_references['coconut_id']\n",
"\n",
"ID_ref_tuples = zip(COCONUT_IDs, unstructured_references)\n",
"\n",
"# Get everything into the right format and filter empty reference lists\n",
"ID_ref_tuples = [(tup[0], eval(tup[1])) # the references are read as str and need to be converted to lists\n",
" for tup in ID_ref_tuples\n",
" if len(eval(tup[1])) != 0 # Don't include empty reference lists\n",
" if eval(tup[1]) != ['NA']] # Don't include reference list with 'NA' as only element\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test parsers for different reference notations\n",
"\n",
"There is a variety of different reference notation styles in the COCONUT references. Most of them can be defined by specific regular expressions and the specific sub-units can hence be identified. The parser functions with the regular expressions used can be found in reference_parser.py\n",
"\n",
"The single parsing functions can be used separately after calling an instance of reference_parser. \n",
"\n",
"Example: \n",
"> parser = rp.reference_parser()\n",
"\n",
"> parser.parse_general_pattern('Haba,Phytochem.,68,(2007),1255')\n",
"\n",
"If the pattern is unknown, the instance of reference_parser can simply be called as a function to try all available parsing functions.\n",
"\n",
"Example:\n",
"> parser = rp.reference_parser()\n",
"\n",
"> parser('Haba,Phytochem.,68,(2007),1255')\n",
"\n",
"All parsing functions return a dictionary which contains the parsed information or None if predefined the pattern could not be matched."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"importlib.reload(rp)\n",
"parser = rp.reference_parser()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern N°1: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: Haba,Phytochem.,68,(2007),1255 \n",
" Resulting dict:\n",
" {'authors': 'Haba', 'first_author_surname': 'Haba', 'journal': 'Phytochem.', 'volume': '68', 'issue': None, 'year': '2007', 'pages': '1255', 'first_page': '1255'}\n",
"\n",
" Original reference: El-Sayed,Phytochem.,30,(1991),2442 \n",
" Resulting dict:\n",
" {'authors': 'El-Sayed', 'first_author_surname': 'El-Sayed', 'journal': 'Phytochem.', 'volume': '30', 'issue': None, 'year': '1991', 'pages': '2442', 'first_page': '2442'}\n",
"\n",
" Original reference: Fujita,J.Nat.Prod.,49,(1986),1122-1125 \n",
" Resulting dict:\n",
" {'authors': 'Fujita', 'first_author_surname': 'Fujita', 'journal': 'J.Nat.Prod.', 'volume': '49', 'issue': None, 'year': '1986', 'pages': '1122-1125', 'first_page': '1122'}\n",
"\n",
" Original reference: Kim, et al., Chem Pharm Bull, 52, (2004), 1466 \n",
" Resulting dict:\n",
" {'authors': 'Kim, ', 'first_author_surname': 'Kim', 'journal': 'Chem Pharm Bull', 'volume': '52', 'issue': None, 'year': '2004', 'pages': '1466', 'first_page': '1466'}\n",
"\n",
" Original reference: Lansky et al.,J.Ethnopharmacol.,19,(2007),177-206 \n",
" Resulting dict:\n",
" {'authors': 'Lansky et al', 'first_author_surname': 'Lansky', 'journal': 'J.Ethnopharmacol.', 'volume': '19', 'issue': None, 'year': '2007', 'pages': '177-206', 'first_page': '177'}\n",
"\n",
" Original reference: Imperato,Chim.Ind.(Milan),71,(1989),86 \n",
" Resulting dict:\n",
" {'authors': 'Imperato', 'first_author_surname': 'Imperato', 'journal': 'Chim.Ind.', 'volume': '71', 'issue': None, 'year': '1989', 'pages': '86', 'first_page': '86'}\n",
"\n",
" Original reference: Cole,R.J.et al.,Can.J.Microbiol.,20(1974),1159 \n",
" Resulting dict:\n",
" {'authors': 'Cole,R.J.', 'first_author_surname': 'Cole', 'journal': 'Can.J.Microbiol.', 'volume': '20', 'issue': None, 'year': '1974', 'pages': '1159', 'first_page': '1159'}\n",
"\n",
" Original reference: Mathews.,J. Biol. Chem.,241(21),(1966),5008 \n",
" Resulting dict:\n",
" {'authors': 'Mathews', 'first_author_surname': 'Mathews', 'journal': 'J. Biol. Chem.', 'volume': '241', 'issue': '21', 'year': '1966', 'pages': '5008', 'first_page': '5008'}\n",
"\n",
" Original reference: Fang,Chung Ts'ao Yao,12,(1981),1 \n",
" Resulting dict:\n",
" {'authors': 'Fang', 'first_author_surname': 'Fang', 'journal': \"Chung Ts'ao Yao\", 'volume': '12', 'issue': None, 'year': '1981', 'pages': '1', 'first_page': '1'}\n",
"\n",
" Original reference: N.V.Thu,Pharmazie,26,(1971),504 \n",
" Resulting dict:\n",
" {'authors': 'N.V.Thu', 'first_author_surname': 'Thu', 'journal': 'Pharmazie', 'volume': '26', 'issue': None, 'year': '1971', 'pages': '504', 'first_page': '504'}\n",
"\n",
" Original reference: Ruan,Yun-Nan Chih Wu Yen Chiu,13,(1991),225 \n",
" Resulting dict:\n",
" {'authors': 'Ruan', 'first_author_surname': 'Ruan', 'journal': 'Yun-Nan Chih Wu Yen Chiu', 'volume': '13', 'issue': None, 'year': '1991', 'pages': '225', 'first_page': '225'}\n",
"\n",
" Original reference: Peng J.-P.,Phytochem.,41,(1996),283-285 \n",
" Resulting dict:\n",
" {'authors': 'Peng J.-P.', 'first_author_surname': 'Peng ', 'journal': 'Phytochem.', 'volume': '41', 'issue': None, 'year': '1996', 'pages': '283-285', 'first_page': '283'}\n",
"\n",
" Original reference: Hussain,J.Nat.Prod.51.,(1988),809 \n",
" Resulting dict:\n",
" {'authors': 'Hussain', 'first_author_surname': 'Hussain', 'journal': 'J.Nat.Prod.', 'volume': '51', 'issue': None, 'year': '1988', 'pages': '809', 'first_page': '809'}\n",
"\n",
" Original reference: Bondarenko,Khim.Prir.Soedin,(1983),243 \n",
" Resulting dict:\n",
" {'authors': 'Bondarenko', 'first_author_surname': 'Bondarenko', 'journal': 'Khim.Prir.Soedin', 'volume': None, 'issue': None, 'year': '1983', 'pages': '243', 'first_page': '243'}\n",
"\n",
" Original reference: Haba,Phytochem.,68,82007),1255 \n",
" Resulting dict:\n",
" {'authors': 'Haba', 'first_author_surname': 'Haba', 'journal': 'Phytochem.', 'volume': '68', 'issue': None, 'pages': '1255', 'first_page': '1255', 'year': '2007'}\n",
"\n",
" Original reference: Ingham,Phytochem.,15,819769,1489 \n",
" Resulting dict:\n",
" {'authors': 'Ingham', 'first_author_surname': 'Ingham', 'journal': 'Phytochem.', 'volume': '15', 'issue': None, 'pages': '1489', 'first_page': '1489', 'year': '1976'}\n",
"\n",
" Original reference: Bondarenko,Khim.Prir.Soedin,(1983),243 \n",
" Resulting dict:\n",
" {'authors': 'Bondarenko', 'first_author_surname': 'Bondarenko', 'journal': 'Khim.Prir.Soedin', 'volume': None, 'issue': None, 'year': '1983', 'pages': '243', 'first_page': '243'}\n",
"\n"
]
}
],
"source": [
"## Check regex for typical pattern: \n",
"# Pattern: Author,? (et al.)?, Journal, issue, (year), page(-page)?\n",
"\n",
"# Examples that are supposed to be matched\n",
"general_pattern_references = [\n",
" 'Haba,Phytochem.,68,(2007),1255',\n",
" 'El-Sayed,Phytochem.,30,(1991),2442',\n",
" 'Fujita,J.Nat.Prod.,49,(1986),1122-1125',\n",
" 'Kim, et al., Chem Pharm Bull, 52, (2004), 1466',\n",
" 'Lansky et al.,J.Ethnopharmacol.,19,(2007),177-206',\n",
" 'Imperato,Chim.Ind.(Milan),71,(1989),86',\n",
" 'Cole,R.J.et al.,Can.J.Microbiol.,20(1974),1159',\n",
" 'Mathews.,J. Biol. Chem.,241(21),(1966),5008',\n",
" \" Fang,Chung Ts'ao Yao,12,(1981),1\",\n",
" 'N.V.Thu,Pharmazie,26,(1971),504',\n",
" 'Ruan,Yun-Nan Chih Wu Yen Chiu,13,(1991),225',\n",
" 'Peng J.-P.,Phytochem.,41,(1996),283-285',\n",
" ' Hussain,J.Nat.Prod.51.,(1988),809',\n",
" 'Bondarenko,Khim.Prir.Soedin,(1983),243',\n",
" 'Haba,Phytochem.,68,82007),1255',\n",
" 'Ingham,Phytochem.,15,819769,1489',\n",
" 'Bondarenko,Khim.Prir.Soedin,(1983),243',]\n",
"\n",
"\n",
"# Test\n",
"for ref in general_pattern_references:\n",
" assert parser.parse_general_pattern(ref)\n",
" print(' Original reference: {} \\n Resulting dict:\\n {}\\n'.format(ref, parser.parse_general_pattern(ref)))\n",
" \n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern N°2"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: J_Agric_Food_Chem_2016_64_(21):4255-4263 \n",
" Resulting dict:\n",
" {'journal': 'J Agric Food Chem', 'year': '2016', 'volume': '64', 'issue': '21', 'pages': '4255-4263', 'first_page': '4255'}\n",
"\n",
" Original reference: J_Nat_Prod_2015_78_(4):730-735 \n",
" Resulting dict:\n",
" {'journal': 'J Nat Prod', 'year': '2015', 'volume': '78', 'issue': '4', 'pages': '730-735', 'first_page': '730'}\n",
"\n",
" Original reference: Phytochemistry_2003;64:285-291 \n",
" Resulting dict:\n",
" {'journal': 'Phytochemistry', 'year': '2003', 'volume': '64', 'issue': None, 'pages': '285-291', 'first_page': '285'}\n",
"\n",
" Original reference: J_Ethnopharmacol_2008;118(3):448-54 \n",
" Resulting dict:\n",
" {'journal': 'J Ethnopharmacol', 'year': '2008', 'volume': '118', 'issue': '3', 'pages': '448-54', 'first_page': '448'}\n",
"\n",
" Original reference: \"J_Nat_Prod_2002;65_(7):1030-1032\" \n",
" Resulting dict:\n",
" {'journal': 'J Nat Prod', 'year': '2002', 'volume': '65', 'issue': '7', 'pages': '1030-1032', 'first_page': '1030'}\n",
"\n"
]
}
],
"source": [
"# Test parser for a rarer structured reference pattern\n",
"underscore_pattern_references = ['J_Agric_Food_Chem_2016_64_(21):4255-4263',\n",
" 'J_Nat_Prod_2015_78_(4):730-735',\n",
" \"Phytochemistry_2003;64:285-291\",\n",
" 'J_Ethnopharmacol_2008;118(3):448-54',\n",
" '\"J_Nat_Prod_2002;65_(7):1030-1032\"']\n",
"\n",
"\n",
"for ref in underscore_pattern_references:\n",
" assert parser.parse_underscore_pattern(ref)\n",
" print(' Original reference: {} \\n Resulting dict:\\n {}\\n'.format(ref, parser.parse_underscore_pattern(ref)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern N°3"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: Gunasekera,J.Chem.Soc.,Perkin 1,(1975),2447 \n",
" Resulting dict:\n",
" {'authors': 'Gunasekera', 'first_author_surname': 'Gunasekera', 'journal': 'J.Chem.Soc., Perkin 1', 'year': '1975', 'pages': '2447', 'first_page': '2447'}\n",
"\n",
" Original reference: Locksley,J.Chem.Soc.,C,(1971),1332 \n",
" Resulting dict:\n",
" {'authors': 'Locksley', 'first_author_surname': 'Locksley', 'journal': 'J.Chem.Soc., C', 'year': '1971', 'pages': '1332', 'first_page': '1332'}\n",
"\n"
]
}
],
"source": [
"# Regex for J.Chem.Soc. references \n",
"\n",
"jchemsoc_pattern_references = [\n",
" 'Gunasekera,J.Chem.Soc.,Perkin 1,(1975),2447',\n",
" 'Locksley,J.Chem.Soc.,C,(1971),1332',]\n",
"\n",
"# Test\n",
"for ref in jchemsoc_pattern_references:\n",
" assert parser.parse_jchemsoc_pattern(ref)\n",
" print(' Original reference: {} \\n Resulting dict:\\n {}\\n'.format(ref, parser.parse_jchemsoc_pattern(ref)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern for Harborne´s Handbook of Natural Flavonoids"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: Harborne, The Handbook of Natural Flavonoids, 2, (1999), 115,Chalcones,dihydrochalcones and aurones \n",
" Resulting dict:\n",
" {'authors': 'Harborne, J.B., Baxter, H.', 'title': 'The Handbook of Natural Flavonoids', 'volume': '2', 'year': '1999', 'chapter_no': '115', 'chapter_title': 'Chalcones,dihydrochalcones and aurones', 'publisher': 'Wiley', 'doi': '10.1016/S0039-9140(00)00629-9', 'isbn': '0-471-95893-2', 'original_str': 'Harborne, The Handbook of Natural Flavonoids, 2, (1999), 115,Chalcones,dihydrochalcones and aurones'}\n",
"\n",
" Original reference: Harborne, The Handbook of Natural Flavonoids, 1, (1999), 181.Flavonols \n",
" Resulting dict:\n",
" {'authors': 'Harborne, J.B., Baxter, H.', 'title': 'The Handbook of Natural Flavonoids', 'volume': '1', 'year': '1999', 'chapter_no': '181', 'chapter_title': 'Flavonols', 'publisher': 'Wiley', 'doi': '10.1016/S0039-9140(00)00629-9', 'isbn': '0-471-95893-1', 'original_str': 'Harborne, The Handbook of Natural Flavonoids, 1, (1999), 181.Flavonols'}\n",
"\n",
" Original reference: Harborne, The Handbook of Natural Flavonoids, 1, (1999), 3.Flavone O-glycosides, John Wiley & Son \n",
" Resulting dict:\n",
" {'authors': 'Harborne, J.B., Baxter, H.', 'title': 'The Handbook of Natural Flavonoids', 'volume': '1', 'year': '1999', 'chapter_no': '3', 'chapter_title': 'Flavone O-glycosides', 'publisher': 'Wiley', 'doi': '10.1016/S0039-9140(00)00629-9', 'isbn': '0-471-95893-1', 'original_str': 'Harborne, The Handbook of Natural Flavonoids, 1, (1999), 3.Flavone O-glycosides, John Wiley & Son'}\n",
"\n"
]
}
],
"source": [
"# Check regex for the Handbook of Natural Flavonoids (Harborne)\n",
"\n",
"harborne_flavonoid_references = [\n",
" 'Harborne, The Handbook of Natural Flavonoids, 2, (1999), 115,Chalcones,dihydrochalcones and aurones',\n",
" 'Harborne, The Handbook of Natural Flavonoids, 1, (1999), 181.Flavonols',\n",
" 'Harborne, The Handbook of Natural Flavonoids, 1, (1999), 3.Flavone O-glycosides, John Wiley & Son',]\n",
"\n",
"for ref in harborne_flavonoid_references:\n",
" assert parser.parse_harborne_flavonoid_pattern(ref)\n",
" print(' Original reference: {} \\n Resulting dict:\\n {}\\n'.format(ref, parser.parse_harborne_flavonoid_pattern(ref)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern for Harborne´s Phytochemical Dictionary"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: Harborne,Phytochemical Dictionary Second Edition,Taylor and Francis,(1999),Chapter54 \n",
" Resulting dict:\n",
" {'year': '1999', 'chapter_no': '54', 'authors': 'Harborne, J.B., Baxter, H., Moss, G.P.', 'publisher': 'Taylor & Francis', 'title': 'Phytochemical Dictionary. A Handbook of Bioactive Compounds from Plants (Second Edition)', 'doi': 'https://doi.org/10.4324/9780203483756', 'isbn': '9780748406203', 'original_str': 'Harborne,Phytochemical Dictionary Second Edition,Taylor and Francis,(1999),Chapter54'}\n",
"\n"
]
}
],
"source": [
"# Check regex for the Phytochemical Dictionary (Harborne)\n",
"\n",
"harborne_phytochemdict_references = [\n",
" 'Harborne,Phytochemical Dictionary Second Edition,Taylor and Francis,(1999),Chapter54',]\n",
"\n",
"\n",
"\n",
"for ref in harborne_phytochemdict_references:\n",
" assert parser.parse_harborne_phytochemdict_pattern(ref)\n",
" print(' Original reference: {} \\n Resulting dict:\\n {}\\n'.format(ref, parser.parse_harborne_phytochemdict_pattern(ref)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis of COCONUT reference composition"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# How many references are there and what type of references are we dealing with?\n",
"\n",
"total_number = 0 # Total number of references\n",
"unique_number = 0 # Tumber of unique references\n",
"PMID_number = 0 # Number of PubMed IDs\n",
"DOI_number = 0 # Number of DOIs\n",
"general_pattern_number = 0 # Number of references that can be matched exactly be the pattern above.\n",
"\n",
"harborne_flavonoid_number = 0\n",
"harborne_phytochemdict_number = 0\n",
"underscore_pattern_number = 0\n",
"j_chem_soc_number = 0\n",
"no_digits_number = 0\n",
"suspiciously_short_number = 0\n",
"unmatched_references = []\n",
"no_digits_references = []\n",
"suspiciously_short_references = []\n",
"\n",
"references = []\n",
"for tup in ID_ref_tuples:\n",
" for ref in tup[1]:\n",
" if ref != \"NA\":\n",
" total_number += 1\n",
" if ref not in references:\n",
" unique_number += 1\n",
" references.append(ref)\n",
" # Check for DOI\n",
" if cn.contains_DOI(ref):\n",
" DOI_number += 1\n",
" # Check for PMID (reference str is a number of at least 6 digits)\n",
" elif ref.isdigit():\n",
" if len(ref) > 3:\n",
" PMID_number += 1\n",
" # Check for the most frequent reference notation pattern\n",
" elif parser.parse_general_pattern(ref):\n",
" general_pattern_number += 1\n",
" # Check for other reference notation patterns\n",
" elif parser.parse_underscore_pattern(ref):\n",
" underscore_pattern_number += 1\n",
" elif parser.parse_jchemsoc_pattern(ref):\n",
" j_chem_soc_number += 1\n",
" # Check for Harbornes Handbook of Natural Flavonoids\n",
" elif parser.parse_harborne_flavonoid_pattern(ref):\n",
" harborne_flavonoid_number += 1\n",
" # Check for Harbornes Phytochemical Dictionary Second Edition\n",
" elif parser.parse_harborne_phytochemdict_pattern(ref):\n",
" harborne_phytochemdict_number += 1\n",
" # Check for (useless) references that contain no digits\n",
" elif re.search('\\D+', ref).group() == ref:\n",
" no_digits_number += 1\n",
" no_digits_references.append(ref)\n",
" elif len(ref) < 10:\n",
" suspiciously_short_number += 1\n",
" suspiciously_short_references.append(ref)\n",
" else:\n",
" unmatched_references.append((tup[0], ref))\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 406747 COCONUT entries\n",
"70969 of them have a total of 158824 references (66151 of them are unique).\n",
"20953 of them are PMIDs.\n",
"12232 of them are DOIs.\n",
"Another 29297 of them follow a very specific pattern (as in: Haba,Phytochem.,68,(2007),1255). This would offer enough information\n",
"Another 182 of them follow a different specific pattern (as in: J_Nat_Prod_2015_78_(4):730-735).\n",
"Another 111 of them follow a different pattern that only occures in J Chem Soc references (as in: Locksley,J.Chem.Soc.,C,(1971),1332)\n",
"Another 589 of them come from Harborne´s Handbook of Natural Flavonoids\n",
"Another 59 of them come from Harborne´s Phytochemical Dictionary Second Edition\n",
"That leaves us with 2728 unique references that do not match a specific pattern (that we know).\n",
"285 of them do not contain any digit.\n",
"155 of the remaining references are shorter than 10 characters.\n"
]
}
],
"source": [
"non_specific_ref = unique_number - DOI_number - PMID_number - general_pattern_number - harborne_flavonoid_number - harborne_phytochemdict_number - underscore_pattern_number - j_chem_soc_number\n",
"\n",
"print('There are {} COCONUT entries'.format(len(COCONUT_IDs)))\n",
"print('{} of them have a total of {} references ({} of them are unique).'.format(len(ID_ref_tuples), total_number, unique_number))\n",
"print('{} of them are PMIDs.'.format(PMID_number))\n",
"print('{} of them are DOIs.'.format(DOI_number))\n",
"print('Another {} of them follow a very specific pattern (as in: Haba,Phytochem.,68,(2007),1255). This would offer enough information'.format(general_pattern_number))\n",
"print('Another {} of them follow a different specific pattern (as in: J_Nat_Prod_2015_78_(4):730-735).'.format(underscore_pattern_number))\n",
"print('Another {} of them follow a different pattern that only occures in J Chem Soc references (as in: Locksley,J.Chem.Soc.,C,(1971),1332)'.format(j_chem_soc_number))\n",
"print('Another {} of them come from Harborne´s Handbook of Natural Flavonoids'.format(harborne_flavonoid_number))\n",
"print('Another {} of them come from Harborne´s Phytochemical Dictionary Second Edition'.format(harborne_phytochemdict_number))\n",
"print('That leaves us with {} unique references that do not match a specific pattern (that we know).'.format(non_specific_ref))\n",
"print('{} of them do not contain any digit.'.format(no_digits_number))\n",
"print('{} of the remaining references are shorter than 10 characters.'.format(suspiciously_short_number))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Excerpt of remaining references"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('CNP0106606', 'Cole,Handbook of Secondary Fungal Metabolites,Volume I,(2003)')\n",
"('CNP0309481', 'Yin, et al., Modern Study of Chinese Drugs and Clinical Applications (1), Xueyuan Press, Beijing, (1993).')\n",
"('CNP0309481', 'Buckingham(Executive Editor), Dictionary of Natural Products, Chapman & Hall, 1994, Vol1-7')\n",
"('CNP0309481', 'Buckingham(Executive Editor)Dictionary of Natural ProductsChapman & Hall 1995, Vol8')\n",
"('CNP0309481', 'Buckingham(Executive Editor)Dictionary of Natural ProductsChapman & Hall 1996, Vol9')\n",
"('CNP0309481', 'Buckingham(Executive Editor)Dictionary of Natural ProductsChapman & Hall 1997, Vol10')\n",
"('CNP0309481', 'Buckingham(Executive Editor)Dictionary of Natural ProductsChapman & Hall 1998, Vol11.')\n",
"('CNP0251815', 'Ohmiya,The Alkaloids,47,(1995),1,Lupine alkaloids')\n",
"('CNP0251815', 'Ji, et al., Pharmacological Action and Application of Available Composition of Traditional Chinese Medicine, Heilongjiang Science and technology Press, Heilongjiang, (1995).')\n",
"('CNP0251815', 'Sun, et al., Brief Handbook of Natural Active Compounds, Medicinal Science and Technology Press of China, Beijing, (1998).')\n",
"('CNP0251815', \"Ou, et al., Brief Handbook of Components of Traditional Chinese Medicines, The People's Medical Publishing House, Beijing, (2003).\")\n",
"('CNP0251815', \"Chen, Liu, et al., Determination of Effective Components in Traditional Chinese medicines, People's Medical Publishing House, Beijing, (2009)\")\n",
"('CNP0235814', 'Wang, et al., Handbook of Effective Components in Vegetal Medicines, People Health Press, Beijing, (1986).')\n",
"('CNP0235814', 'Edited by Jiangsu New Medicinal College, Chinese Medicine Dictionary, Shanghai Science and technology Press, Shanghai, (1979).')\n",
"('CNP0326009', 'Chinese Materia Medica Editing Committee of the National Chinese Medicine and Pharmacology Bureau, Chinese Materia Medica (ZHONG HUA BEN CAO), Vol.1-Vol.30, Shanghai Science and technology Press, Shanghai, (1999).')\n",
"('CNP0205647', 'Chinese Materia Medica Editing Committee of the National Chinese Medicine and Pharmacology Bureau, Chinese Materia Medica (ZHONG HUA BEN CAO), Vol.1-Vol.30, Shanghai Science and technology Press, Shanghai, (1999)')\n",
"('CNP0205647', 'Takabayashi,J.,The Ecology of Symbiosis,(1995)Jpn,ISBN4-582-50034X')\n",
"('CNP0108045', 'Asakawa,Chemical sonstituents of Chilean liverworts,in Studies on Cryptogams in Southern Chile,Tokai University Press,(1984),109')\n",
"('CNP0186061', 'Chem Abstr, 111, (1989), 70925q')\n",
"('CNP0186061', 'MED:23234130')\n",
"('CNP0258993', 'Edited by Jiangsu New Medicinal College, Chinese Medicine Dictionary, Shanghai Science and technology Press, Shanghai, (1979)')\n",
"('CNP0258993', 'de Ville,J.Chem.Soc.,Chem.Commun.,(1969),1311')\n",
"('CNP0236994', 'Cole,Handbook of Toxic Fungal Metabolites,Academic Press, (1981),386-389')\n",
"('CNP0426957', '33-10-/t28-')\n",
"('CNP0352050', 'Keller,P.A. and Nugraha, A.S., Research Online Open Access Int.Univ. Rep. Univ. Wollongong, 6, (2011),1953-1966.')\n",
"('CNP0226964', 'Asakawa,Chemical constituents of Hepaticae, in Progress in the Chemistry of Organic Natural Products,Springer,Vienna,(1982),1')\n",
"('CNP0121737', 'PLoS ONE., 2013, 8(9), e73076')\n",
"('CNP0159086', 'Backer,H.J.Recl.Trav.Chim.(J.R.Neth.Chem.Soc.),60,(1941),661-667')\n",
"('CNP0294939', 'Valant-Vetschera,Biochem.,Syst.Ecol.,27,(1999),27')\n",
"('CNP0423725', 'Strack,Z.Naturforsch.,C.,41,(1986),707')\n",
"('CNP0423725', 'Strack,28,(1989),2127')\n",
"('CNP0199385', 'Aust.J.Chem.,23,(1970),2343-2351')\n",
"('CNP0199385', 'Biochem.J.,67,(1957),390-399')\n",
"('CNP0199385', 'Kameda, K., Aoki, H., Namiki, M. and Overeem, J.C. (1974) An alternative structure for botrallin a metabolite of botrytis allii. Tetrahedron Lett., 15(1), 103-106.')\n",
"('CNP0196088', 'Phytochem.,28,(1989),2717')\n",
"('CNP0226356', 'Zhou, et al., Chinese Pharmaceutical Journal(Zhongguo Yaoxue Zazhi), 38, (2003), 81')\n",
"('CNP0117445', 'Si, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 20, (1995), 295.')\n",
"('CNP0421207', 'Sun, et al., Brief Handbook of Natural Active Compounds, Medicinal Science and Technology Press of China, Beijing, (1998)')\n",
"('CNP0112595', 'Chexal,Chem.Ind.(London.),(1970),28')\n",
"('CNP0112595', 'Ji, et al., Pharmacological Action and Application of Available Antitumor Composition of Traditional Chinese Medicine, Heilongjiang Science and technology Press, Heilongjiang, (1998).')\n",
"('CNP0112595', 'Karikome, Wen-ben Yang translated, Phytochemistry, Science Press, Beijing, (1985)')\n",
"('CNP0171089', 'McFeeters,CA,65,(1966),4545b')\n",
"('CNP0171089', 'Bakowski,CA62,(1964),11073a')\n",
"('CNP0143230', 'Dean,F.M.Naturally Occurring Oxygen Ring Compounds,Butterworths,(1963),337')\n",
"('CNP0262237', 'Chang, et al., Dictionary of Chemistry, Science Press, Beijing, (2008).')\n",
"('CNP0423737', 'Ni, et al., Acta Pharmaceutica Sinica(Yaoxue Xuebao), 35, (2000), 115.')\n",
"('CNP0120326', 'Ho.J.Chem.Soc.Perkin Trans.,1,(1973),2579')\n",
"('CNP0356790', ' Shabrawy,Abstr.23rd Annual Meeting American Society of Pharmacognosy.Pittsburgh,U.S.A,(1982),23')\n",
"('CNP0295949', ' 26, (1978), 1453')\n",
"('CNP0224900', \"Dell'Agli, et al., Planta Med, 69, (2003), 162.\")\n",
"('CNP0224900', 'Guo, et al., Acta Pharmaceutica Sinica(Yaoxue Xuebao), 22, (1987), 28.')\n",
"('CNP0224900', 'Kasajima,Phytochem.,69,(2008).3080')\n",
"('CNP0243611', 'Sun, et al., Diterpenoids from Isodon Species, Science Press, Beijing, (2001)')\n",
"('CNP0245342', 'Uvarova,N.I.et al.,Chem.Nat.Compd.(Engl.Transl.),1,(1965),63-66')\n",
"('CNP0102727', 'PLoS ONE., 2011, 6(2), e16957')\n",
"('CNP0423758', 'Wang, et al., Handbook of Effective Components in Vegetal Medicines, People Health Press, Beijing, (1986)')\n",
"('CNP0386613', 'Buckingham(Executive Editor)Dictionary of Natural ProductsChapman & Hall 1998, Vol11')\n",
"('CNP0206087', 'Zhao, et al., Natural Product Research and Development(Tianran Chanwu Yanjiu Yu Kaifa), 14, (2002), 29.')\n",
"('CNP0389032', 'Munro,J.Chem.Soc.,(C),(1971),685-688')\n",
"('CNP0145249', 'CBA:464361')\n",
"('CNP0144084', 'Beckmann,Phytochem.,10,(1971),1465.Enoki,Chem.Abstr.,88 71421r,71422s:Braun,Tetrahedron,33,(1977),145')\n",
"('CNP0135748', 'Natural Products Vol.1 ISBN978-3-642-22143-9')\n",
"('CNP0298404', 'Karikome, Wen-ben Yang translated, Phytochemistry, Science Press, Beijing, (1985).')\n",
"('CNP0165536', 'Aldrich Library of FT-IR Spectra,1st edn.,2,(1985),660C')\n",
"('CNP0159776', 'Stuart,J.Am.Chem.Soc.,C,(1970),1228')\n",
"('CNP0260173', 'Buckingham,Dictionary of Natural Products,S-01280.Champan & Hall,London,(1993)')\n",
"('CNP0255093', 'Seto,PCT Int.Appl.WO98,(1998),41.503')\n",
"('CNP0140046', 'Rao, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 18, (1993), 681.')\n",
"('CNP0140046', 'Jumar,Planta Med.,B 30,(1976),291')\n",
"('CNP0206373', \"D'Auria,J.Nat.Prod.,60,(1997),814\")\n",
"('CNP0131323', 'Chen, et al., Lexicon of Active Componentsin in Plants, Vol1, Medicinal Science and Technology Press of China, Beijing, (2001).')\n",
"('CNP0310325', 'Gottlieb,An. Acad. Bras. Cienc.,42(Suppl.),(1970),65')\n",
"('CNP0147332', 'Chen, et al., Lexicon of Active Componentsin in Plants, Vol 3, Medicinal Science and Technology Press of China, Beijing, (2001)')\n",
"('CNP0360040', 'Sachdev,Indian,J.Chem.Sect.B,21,(1982),789')\n",
"('CNP0182212', 'Huang, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 22, (1997), 247.')\n",
"('CNP0202496', 'Experientia,28,(1972),1150-1151')\n",
"('CNP0344781', 'Govindachari,J.Chem.Soc.,(195),3839')\n",
"('CNP0344781', 'Saxena,J.Inst.Chem.,(India),49,(1977),107')\n",
"('CNP0168428', 'Li, et al., Acta Pharmaceutica Sinica(Yaoxue Xuebao), 37, (2002), 69.')\n",
"('CNP0168428', 'Li, et al., Acta Pharmaceutica Sinica(Yaoxue Xuebao), 36, (2001), 944.')\n",
"('CNP0168428', 'Manson,The leucoanthocyanin from black spruce inner bark. Tappi,43,(19854),59')\n",
"('CNP0168428', 'Shibanova,Akad.Nauk,SSSR,Ser.Khim.Nauk.,(1977),153')\n",
"('CNP0253850', 'Buckingham,Dictionary of Natural Products,C-02216.Champan & Hall,London,(1993)')\n",
"('CNP0253850', 'Sugimoto,Tokkyo Koho,JP02,(1990),138984')\n",
"('CNP0423781', 'Rettig,Biochem.Syst.Ecol.,18,(1990).,393')\n",
"('CNP0181070', 'Sun, et al., Diterpenoids from Isodon Species, Science Press, Beijing, (2001).')\n",
"('CNP0227768', 'Burreson,J.Chem.Soc,Chem.Commun,(1975),486')\n",
"('CNP0265600', 'Geiger,Z.Naturforsch.,C.,44,(1989),559')\n",
"('CNP0329876', 'J.Kolar and I.Machackova, J.Pineal Res.,39,(2005),333-341')\n",
"('CNP0329876', 'D.Tan,J.Pineal.Res.,61,(2016),27-40[Pathway]')\n",
"('CNP0329876', 'J.Dai et al.,Molecules,21,(2016),493.1-13')\n",
"('CNP0329876', 'L.Erland,Neurotransmitter,5,(2017),e1538.1-12')\n",
"('CNP0315425', 'Wink,Z. Naturforsch.,C 42,(1982),197')\n",
"('CNP0315425', 'Kinghorn,Phytochem.,21,(1082),2269')\n",
"('CNP0342321', 'Li, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 28, (2003), 1120.')\n",
"('CNP0380914', 'Schmelz,Plant J.,2004,39,790')\n",
"('CNP0283094', 'ClarkJ.Am.Chem.Soc.,53,(1931),729')\n",
"('CNP0152773', 'Quang,Chem.Pharm.Bull,,51,(2003),1441')\n",
"('CNP0243796', 'Achenbach,Z.Naturforsch.,B:Anorg.Chem.Org.Chem.,35B,(1980),219')\n",
"('CNP0241052', 'Nielsen,Phytochem.,49,(2171),1998')\n",
"('CNP0257394', 'Saito,Microb,Toxins,7,(1971),293')\n",
"('CNP0427196', '3S)-N-((S)-5-guanidino-1-hydroxypentan-2-yl)-2-(4-(4-hydroxyphenyl)butanamido)-3-methylpentanamide\"')\n",
"('CNP0112729', 'Takeda,Bryophytes:Their Chemistry and Chemical Taxonomy.Clarendon Press.Oxford,(1990),p 201')\n",
"('CNP0112729', 'Brown,Tetrahedron,51,(1995),13 061')\n",
"('CNP0353638', 'Academic Press Inc.,New York,Ny,341 pp.(1989)')\n",
"('CNP0353638', 'Academic Press,New York,NY,631 pp.(1983)')\n",
"('CNP0170129', 'Academic Press,New York,NY,316 pp.(1983)')\n",
"('CNP0210327', 'Tetrahedron Lett.,32,(1991),5915')\n",
"('CNP0189068', 'Turner,Fungal Metabolites II,Academic Press,New York,NY,308 pp.(1983)')\n",
"('CNP0189068', 'Turner,Fungal Metabolites II,Academic Press, (1983),631')\n",
"('CNP0322491', 'Daniewski,Bull.Acad.Pol.Sci.,Ser.Sci.Chim.,18,(1970),585')\n",
"('CNP0280892', 'Guo, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 21, (1996), 290.')\n",
"('CNP0423799', 'San Feliciano,28,(1989),2717')\n",
"('CNP0268879', 'Bartley,J.Biol.Chem,264,(1989),13 109')\n",
"('CNP0205963', 'Liu, et al., China Journal of Chinese Materia Medica(Zhongguo Zhongyao Zazhi), 20, (1995), 738.')\n",
"('CNP0420898', 'Chumbalov,Khim.Prir,Soedin,12,(1976),658')\n",
"('CNP0279184', 'Aldrich Library of FT-IR Spectra,1st edn.,1,(1985),309D')\n",
"('CNP0302588', 'Delle Monache,Gazz,Chim.Ital.,104,(1974),861')\n",
"('CNP0270079', 'Michael,Nat.Prod.Rep.,18,( 2001),520')\n",
"('CNP0169938', \"D'Ari,J. Biol. Chem.,266,(1991),23953\")\n",
"('CNP0217588', 'CBA:300080')\n",
"('CNP0186309', 'Flury,J.Chem.Soc.,Chem.Commun.,26,(1965),27')\n",
"('CNP0239197', 'Karrer,Konstitution und Vorkommen der Organischen Pflanzenstoffe,2nd.Ed.,Birkhauser Verlag, Basel,1972-1985,no 1929')\n",
"('CNP0229365', 'Perkins,Bacillus subtilis and Other Gram-positive Bacteria.Biochenistry,Physiology,and Morecular Genetics,American Society for Microbiology,Washington DC,(1993),p319')\n",
"('CNP0229365', 'DeMoll,Escherichia coli and Salmonella.Cellular and Molecular Biology,ASM Press.Washington DC,vol 1,(1996),p704')\n",
"('CNP0164629', 'Stockigt,Z.Naturforsch.,Teil C,37,(1982),857')\n",
"('CNP0121399', 'Liang, et al., Fundamental Research on Common Traditional Chinese Medicines. Vol.1. Science Press, Beijing, (2003)')\n",
"('CNP0211423', 'Robeson,Z.Naturforsch.,Teil C,36,(1981),1081-1083')\n",
"('CNP0405299', 'Achenbach,Z.Naturforsch.,B:Anorg.Chem.Or.Chem.,35B,(1980),885')\n",
"('CNP0103732', 'Gottlieb,Natural Products og Woody Plants-Chemicals Extraneous to the Lignocellulosic Cell Wall.Springer-Verlag Berlin,(1989),p439')\n",
"('CNP0236491', 'Weete,Structure and Function of Sterols in Fungi,Advances in Lipid Research,23,(1989),115-167')\n",
"('CNP0266186', 'Oritani,Agric.Biol.,Chem.,49,(1984),245')\n",
"('CNP0266186', 'Agric.Biol.,52,(1988),2119')\n",
"('CNP0266186', 'AgricBiol.Chem.,54,(1990),125')\n",
"('CNP0302062', \"O'Neill,Z. Naturforsch.,C 38,(1983),693\")\n",
"('CNP0302062', \"O'Neill,Phytochem.,25,(1986),1315\")\n",
"('CNP0140264', 'Haensel,R.et al.,Arch.Pharm.(Weinheim,Ger.),305,(1972),33')\n",
"('CNP0212606', 'Oikawa,J.Chem.Soc,Chem.Commun,(1989),1284')\n",
"('CNP0314423', \"D'Agostino,Phytochem.,31,(1992),4387\")\n",
"('CNP0279481', 'Aldrich Library of FT-IR Spectra,1st edn.,1,(1985),1289C')\n",
"('CNP0195903', 'Aldrich Library of 13C and 1H FT NMR Spectra,1,(1992),712A')\n",
"('CNP0232327', 'Aldrich Library of FT-IR Spectra,1st edn.,2,(1985),160C')\n",
"('CNP0298204', 'Geiger,Z.Naturforsch.,C.,43,(1988),1')\n",
"('CNP0161993', 'Phadnis,J.Chem.Soc.,Perkin Trans. 1,(1984),937')\n",
"('CNP0289490', 'Yu, et al., Natural Product Research and Development(Tianran Chanwu Yanjiu Yu Kaifa), 6, (1994), 1.')\n",
"('CNP0300329', 'Toth,J.O.et al.,J.Chem.Res.,Miniprint,(1983),2722-2787')\n",
"('CNP0194074', 'Zhu,Planta Med.,63,(1986)')\n",
"('CNP0264571', 'Sugiura,Chem.Lett,471,(1975)')\n",
"('CNP0396732', 'Rodriguez,J.Nat.Prod.,60<(1997),1331')\n",
"('CNP0228197', 'J.Chem.Soc,Chem.Commun,(1980),183')\n"
]
}
],
"source": [
"for ref in unmatched_references[:150]:\n",
" print(ref)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve information for all COCONUT references \n",
"#### (or read the already retrieved information from a file)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"if os.path.exists('./COCONUT_reference_retrieval_raw_output.tsv'):\n",
" with open('COCONUT_reference_retrieval_raw_output.tsv', 'r') as retrieved_data:\n",
" retrieved_data = [line.split('\\t') for line in retrieved_data.readlines()[1:]]\n",
"else:\n",
" if os.path.exists('./coconut_references.csv'):\n",
" # Warning: This may take multiple days.\n",
" retrieval_data = rCr.retrieval_coordination('./coconut_references.csv')\n",
" else:\n",
" print('The COCONUT reference file at given path!')\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"66150\n"
]
}
],
"source": [
"# Count retrieved entries\n",
"print(len(retrieved_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sort retrieved reference data by query type (DOI, PMID or keyword) [or read sorted data from file]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 11676 retrieved references based on DOI queries, 20860 based on PMID queries and 33227 on keyword queries with the reference strings.\n"
]
}
],
"source": [
"# Sort retrieved data by query string type (PMID, DOI, keyword)\n",
"PMID_based_dicts = []\n",
"DOI_based_dicts = []\n",
"keyword_based_dicts = []\n",
"failed_queries = []\n",
"\n",
"# Check if the retrieved dicts have already been filtered and saved before\n",
"if os.path.exists('./retrieved_dicts_filtered.csv'):\n",
" with open('./retrieved_dicts_filtered.csv') as filtered_retrieved_dicts:\n",
" for entry in filtered_retrieved_dicts.readlines():\n",
" query_type, retrieved_dict = entry.split(', ', 1)\n",
" if query_type == 'PMID':\n",
" PMID_based_dicts.append(eval(retrieved_dict))\n",
" elif query_type == 'DOI':\n",
" DOI_based_dicts.append(eval(retrieved_dict))\n",
" elif query_type == 'KEYWORD':\n",
" keyword_based_dicts.append(eval(retrieved_dict))\n",
"# If no file with filtered dicts exists, filter the dictionaries from xml str and PubMed objects\n",
"else:\n",
" for ref_str, retrieved_dict in retrieved_data:\n",
" # Get rid of some xml str and other elements that eval() does not agree with\n",
" xml_str = re.search(\"\\'xml\\':.+\\>\\'\\,\", retrieved_dict)\n",
" a = deepcopy(retrieved_dict)\n",
" if xml_str:\n",
" retrieved_dict = retrieved_dict.replace(xml_str.group(), '')\n",
" #for what_makes_eval_unhappy in re.findall('\\'.+?\\'\\:\\s\\<.+?\\>\\,?\\>?\\,?', retrieved_dict):\n",
" for what_makes_eval_unhappy in re.findall('\\<\\<?(?:metapub|bound|function|Element).+?\\>\\>?', retrieved_dict):\n",
" retrieved_dict = retrieved_dict.replace(what_makes_eval_unhappy, 'False')\n",
" retrieved_dict = eval(retrieved_dict)\n",
"\n",
" if retrieved_dict:\n",
" if retrieved_dict['query_str_type'] == 'PMID':\n",
" PMID_based_dicts.append(retrieved_dict)\n",
" elif retrieved_dict['query_str_type'] == 'DOI':\n",
" DOI_based_dicts.append(retrieved_dict)\n",
" elif retrieved_dict['query_str_type'] == 'unstructured_ID':\n",
" keyword_based_dicts.append(retrieved_dict)\n",
" else:\n",
" failed_queries.append(ref_str)# Write filtered dicts to file\n",
"\n",
" # Write filtered dicts to file\n",
" with open(\"retrieved_dicts_filtered.csv\", \"a\") as output:\n",
" for retrieved_dict in DOI_based_dicts:\n",
" output.write(\"DOI, \" + str(retrieved_dict) + '\\n')\n",
" for retrieved_dict in PMID_based_dicts:\n",
" output.write(\"PMID, \" + str(retrieved_dict) + '\\n')\n",
" for retrieved_dict in keyword_based_dicts:\n",
" output.write(\"KEYWORD, \" + str(retrieved_dict) + '\\n')\n",
" \n",
"print('There are {} retrieved references based on DOI queries, '.format(len(DOI_based_dicts))\n",
" + '{} based on PMID queries '.format(len(PMID_based_dicts))\n",
" + 'and {} on keyword queries with the reference strings.'.format(len(keyword_based_dicts)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example comparison of reference str with retrieved data based on keyword query:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'Two new coumarins from Murraya plants.',\n",
" 'abstract': None,\n",
" 'DOI': '10.1248/cpb.37.819',\n",
" 'issue': '3',\n",
" 'volume': '37',\n",
" 'year': 1989,\n",
" 'journal': 'Chemical and Pharmaceutical Bulletin',\n",
" 'authors': ['Ito, C.', 'Furukawa, H.'],\n",
" 'first_author_surname': 'Ito',\n",
" 'pages': '819-820',\n",
" 'first_page': '819',\n",
" 'reference_retrieved_from': 'Crossref',\n",
" 'query_str_type': 'unstructured_ID',\n",
" 'query_str': 'Ito,Chem. Pharm. Bull.,37,(1989),819',\n",
" 'PMID': None}"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_dict = keyword_based_dicts[0]\n",
"norm_example_dict = cn.normalize_crossref_dict(example_dict)\n",
"norm_example_dict"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'authors': 'Ito',\n",
" 'first_author_surname': 'Ito',\n",
" 'journal': 'Chem. Pharm. Bull.',\n",
" 'volume': '37',\n",
" 'issue': None,\n",
" 'year': '1989',\n",
" 'pages': '819',\n",
" 'first_page': '819'}"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parsed_info = parser(norm_example_dict['query_str'])\n",
"parsed_info"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"importlib.reload(cn)\n",
"cn.is_same_publication(parsed_info, norm_example_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Match all retrieved data based on keywords queries with parsed data from reference strings"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"importlib.reload(cn)\n",
"confirmed_retrieved_dicts = []\n",
"harborne_dicts = []\n",
"falsified_retrieved_dicts = []\n",
"for retrieved_dict in keyword_based_dicts:\n",
" norm_keyword_dict = cn.normalize_crossref_dict(retrieved_dict)\n",
" if norm_keyword_dict:\n",
" parsed_info = parser(norm_keyword_dict['query_str'])\n",
" if parsed_info:\n",
" # If a reference can be confirmed as in the reference str: Good.\n",
" if cn.is_same_publication(parsed_info, norm_keyword_dict):\n",
" confirmed_retrieved_dicts.append(retrieved_dict)\n",
" # If a reference can be identified as one of the known books: Good.\n",
" elif parser.parse_harborne_flavonoid_pattern(norm_keyword_dict['query_str']):\n",
" harborne_dicts.append(parser.parse_harborne_flavonoid_pattern(norm_keyword_dict['query_str']))\n",
" elif parser.parse_harborne_phytochemdict_pattern(norm_keyword_dict['query_str']):\n",
" harborne_dicts.append(parser.parse_harborne_phytochemdict_pattern(norm_keyword_dict['query_str']))\n",
" # Retrieved info does not overlap with parsed info and does not refer to known book: Bad.\n",
" else:\n",
" falsified_retrieved_dicts.append(retrieved_dict)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of the 33227 references that were retrieved based on keyword queries, 16837 belong to the original publication or can be allocated to one of the parsed book references.\n",
"That leaves us with 13209 keyword_based queries that led to False information retrieval.\n"
]
}
],
"source": [
"print('Out of the {} references that were retrieved based on keyword queries, {} belong to the original publication or can be allocated to one of the parsed book references.'.format(len(keyword_based_dicts), len(confirmed_retrieved_dicts)+len(harborne_dicts)))\n",
"print('That leaves us with {} keyword_based queries that led to False information retrieval.'.format(len(falsified_retrieved_dicts)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Show examples of original references and references based on queries"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Original reference: \n",
" Ito,Chem. Pharm. Bull.,37,(1989),819 \n",
" Normalised reference: \n",
" Ito, Chemical and Pharmaceutical Bulletin, 1989, 37 (3), 819 \n",
"\n",
" Original reference: \n",
" Lambden,J. Bacteriol.,115,(1973),992 \n",
" Normalised reference: \n",
" Lambden, Journal of Bacteriology, 1973, 115 (3), 992 \n",
"\n",
" Original reference: \n",
" Nash,R.J.et al.,Tet.Lett.,35,(1994),7849-7852 \n",
" Normalised reference: \n",
" Nash, Tetrahedron Letters, 1994, 35 (41), 7849 \n",
"\n",
" Original reference: \n",
" Fujimoto,Chem.Pharm.Bull.,54,(2006),550 \n",
" Normalised reference: \n",
" Fujimoto et al., Chemical and Pharmaceutical Bulletin, 2006, 54 (4), 550 \n",
"\n",
" Original reference: \n",
" Westley,J. Antibiotics,27,(1974),744 \n",
" Normalised reference: \n",
" Westley et al., The Journal of Antibiotics, 1974, 27 (10), 744 \n",
"\n",
" Original reference: \n",
" Dong,Chem.Pharm.Bull.,56,(2008),1600 \n",
" Normalised reference: \n",
" Dong et al., Chemical and Pharmaceutical Bulletin, 2008, 56 (11), 1600 \n",
"\n",
" Original reference: \n",
" Ezaki,J. Antibiotics,34,(1981),1363 \n",
" Normalised reference: \n",
" Ezaki et al., The Journal of Antibiotics, 1981, 34 (10), 1363 \n",
"\n",
" Original reference: \n",
" Masuda,Phytochem.,30,(1991),2391 \n",
" Normalised reference: \n",
" Masuda et al., Phytochemistry, 1991, 30 (7), 2391 \n",
"\n",
" Original reference: \n",
" Ikeshiro,Planta Med.,50,(1984),485 \n",
" Normalised reference: \n",
" Ikeshiro, Planta Medica, 1984, 50 (06), 485 \n",
"\n",
" Original reference: \n",
" Connell,D.W.et al.,Aust.J.Chem.,23,(1970),369-376 \n",
" Normalised reference: \n",
" Connell, Australian Journal of Chemistry, 1970, 23 (2), 369 \n",
"\n"
]
}
],
"source": [
"for confirmed_info in confirmed_retrieved_dicts[:10]:\n",
" confirmed_info = cn.normalize_crossref_dict(confirmed_info)\n",
" original_ref_str = confirmed_info['query_str']\n",
" improved_ref_str = cn.create_normalized_reference_str(confirmed_info)\n",
" print(' Original reference: \\n {} \\n Normalised reference: \\n {} \\n'.format(original_ref_str, improved_ref_str))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extended query for unmatched references\n",
"\n",
"The information that could not be confirmed is going to be used again in a Crossref query. This time, the string is cleaned up and not only the first, but the first 200 results are checked.\n",
"\n",
"#### Example"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': None,\n",
" 'abstract': None,\n",
" 'DOI': '10.1111/ppl.2001.111.issue-1',\n",
" 'issue': '1',\n",
" 'volume': '111',\n",
" 'year': 2001,\n",
" 'pages': False,\n",
" 'first_page': False,\n",
" 'reference_retrieved_from': 'Crossref',\n",
" 'query_str_type': 'unstructured_ID',\n",
" 'query_str': 'Morvan-Bertrand,Physiol Plant,111,(2001),225',\n",
" 'PMID': None}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_wrong_data = falsified_retrieved_dicts[0]\n",
"cn.normalize_crossref_dict(example_wrong_data)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'authors': 'Morvan-Bertrand',\n",
" 'first_author_surname': 'Morvan-Bertrand',\n",
" 'journal': 'Physiol Plant',\n",
" 'volume': '111',\n",
" 'issue': None,\n",
" 'year': '2001',\n",
" 'pages': '225',\n",
" 'first_page': '225'}"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Parse original query_str\n",
"example_dict = parser(example_wrong_data['query_str'])\n",
"example_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Second information retrieval\n",
"\n",
"Warning, this may take ~ 1 day (if the data has not been saved yet)."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"if os.path.exists('./COCONUT_reference_second_retrieval_raw_output.tsv'):\n",
" with open('COCONUT_reference_second_retrieval_raw_output.tsv', 'r') as retrieved_data:\n",
" second_retrieved_data = [line.split('\\t') for line in retrieved_data.readlines()[1:]]\n",
"else:\n",
" if os.path.exists('./retrieved_dicts_filtered.csv'):\n",
" # Warning: This may take multiple days.\n",
" second_retrieval_data = rCr.retrieval_coordination('./retrieved_dicts_filtered.csv')\n",
" else:\n",
" print('The COCONUT reference file at given path!')"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"falsified_twice_dicts = []\n",
"second_retrieval_confirmed_dicts = []\n",
"for data in second_retrieved_data:\n",
" norm_dict = cn.normalize_crossref_dict(eval(data[1]))\n",
" parsed_ref_dict = parser(data[0])\n",
" if norm_dict: \n",
" match = cn.is_same_publication(parsed_ref_dict, norm_dict)\n",
" if match:\n",
" second_retrieval_confirmed_dicts.append(eval(data[1]))\n",
" else:\n",
" falsified_twice_dicts.append(data)\n",
"confirmed_retrieved_dicts += second_retrieval_confirmed_dicts\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"During the second information retrieval, an additional 2344 confirmed references have been retrieved There are now 20880 confirmed dicts and 10829 falsified_dicts.\n"
]
}
],
"source": [
"print('During the second information retrieval, an additional {} confirmed references have been retrieved'.format(len(second_retrieval_confirmed_dicts)),\n",
" 'There are now {} confirmed dicts and {} falsified_dicts.'.format(len(confirmed_retrieved_dicts)+len(second_retrieval_confirmed_dicts), len(falsified_twice_dicts)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create normalised reference strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generate dict that maps all \"old\" references to a dictionary containing a normalised reference str, the DOI and the PMID"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"importlib.reload(cn)\n",
"references = {}\n",
"a = 0\n",
"# Map retrieved info (ref str, DOI, PMID) to original ref str\n",
"retrieved_data = confirmed_retrieved_dicts+ DOI_based_dicts + PMID_based_dicts\n",
"for ref in retrieved_data:\n",
" if ref[\"reference_retrieved_from\"] == \"Crossref\":\n",
" norm_dict = cn.normalize_crossref_dict(ref)\n",
" if norm_dict['query_str'][0] == '{':\n",
" norm_dict['query_str'] = eval(norm_dict['query_str'])['query_str']\n",
" elif ref['reference_retrieved_from'] == 'MetaPub':\n",
" norm_dict = cn.normalize_metapub_dict(ref) \n",
" if cn.create_normalized_reference_str(norm_dict):\n",
" references[norm_dict['query_str']] = {'reference': cn.create_normalized_reference_str(norm_dict),\n",
" 'DOI': norm_dict['DOI'],\n",
" 'PMID': norm_dict['PMID']}\n",
"# Map old Harborne ref str to same kind of dict\n",
"for ref in harborne_dicts:\n",
" if 'volume' in ref.keys():\n",
" harborne_str = 'Harborne, {}, {}, ({}), Chapter {}'.format(ref['title'],\n",
" ref['volume'],\n",
" ref['year'], \n",
" ref['chapter_no'])\n",
" else:\n",
" harborne_str = 'Harborne, {}, ({}), Chapter {}'.format(ref['title'],\n",
" ref['year'], \n",
" ref['chapter_no'])\n",
" if 'chapter_title' in ref.keys():\n",
" harborne_str += ' - {}'.format(ref['chapter_title'])\n",
" references[ref['original_str']] = {'reference': harborne_str,\n",
" 'DOI': ref['doi'],\n",
" 'PMID': None}\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_____\n",
"Ito,Chem. Pharm. Bull.,37,(1989),819\n",
"{'reference': 'Ito, Chemical and Pharmaceutical Bulletin, 1989, 37 (3), 819', 'DOI': '10.1248/cpb.37.819', 'PMID': None}\n",
"_____\n",
"Cao,J.Nat.Prod.,67,(2004),986\n",
"{'reference': 'Cao et al., Journal of Natural Products, 2004, 67 (6), 986', 'DOI': '10.1021/np040058h', 'PMID': None}\n",
"_____\n",
"Westley,J. Antibiotics,32,(1979),874\n",
"{'reference': 'Westley et al., The Journal of Antibiotics, 1979, 32 (9), 874', 'DOI': '10.7164/antibiotics.32.874', 'PMID': None}\n",
"_____\n",
"Evans,Phytochem.,12,(1973),2505\n",
"{'reference': 'Evans, Phytochemistry, 1973, 12 (10), 2505', 'DOI': '10.1016/0031-9422(73)80464-9', 'PMID': None}\n",
"_____\n",
"10.1021/np070664n\n",
"{'reference': 'Chen et al., Journal of Natural Products, 2008, 71 (3), 431', 'DOI': '10.1021/np070664n', 'PMID': None}\n",
"_____\n",
"10.1021/ol502216j\n",
"{'reference': 'Grudniewska et al., Organic Letters, 2014, 16 (18), 4695', 'DOI': '10.1021/ol502216j', 'PMID': None}\n",
"_____\n",
"10.7164/antibiotics.28.83\n",
"{'reference': 'Shimura et al., The Journal of Antibiotics, 1975, 28 (1), 83', 'DOI': '10.7164/antibiotics.28.83', 'PMID': None}\n",
"_____\n",
"17701557\n",
"{'reference': 'Liu et al., J Asian Nat Prod Res, 2007, 9 (6-8), 689', 'DOI': '10.1080/17415990500209064', 'PMID': '17701557'}\n",
"_____\n",
"20512739\n",
"{'reference': 'Sheu et al., J Environ Sci Health B, 2010, 45 (5), 478', 'DOI': '10.1080/03601231003800347', 'PMID': '20512739'}\n",
"_____\n",
"17327465\n",
"{'reference': 'Hu et al., Mol Pharmacol, 2007, 71 (6), 1475', 'DOI': '10.1124/mol.106.032748', 'PMID': '17327465'}\n"
]
}
],
"source": [
"for key in list(references.keys())[:50000:5000]:\n",
" print('_____\\n{}'.format(key))\n",
" print(references[key])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Write reference dict to json file"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"with open('COCONUT_reference_dict.json', 'w') as output:\n",
" output.write(str(references))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'title': 'Endogenous gibberellins in Lolium perenne\\n and influence of defoliation on their contents in elongating leaf bases and in leaf sheaths', 'abstract': None, 'DOI': '10.1034/j.1399-3054.2001.1110214.x', 'issue': '2', 'volume': '111', 'year': 2001, 'journal': 'Physiologia Plantarum', 'authors': ['Morvan-Bertrand, A.', 'Ernstsen, A.', 'Lindgård, B.', 'Koshioka, M.', 'Le Saos, J.', 'Boucaud, J.', \"Prud'homme, M.\", 'Junttila, O.'], 'first_author_surname': 'Morvan-Bertrand', 'pages': '225-231', 'first_page': '225', 'reference_retrieved_from': 'Crossref', 'query_str_type': 'Crossref_extended_query', 'query_str': \"{'authors': 'Morvan-Bertrand', 'first_author_surname': 'Morvan-Bertrand', 'journal': 'Physiol Plant', 'volume': '111', 'issue': None, 'year': '2001', 'pages': '225', 'first_page': '225', 'reference_retrieved_from': 'Crossref', 'query_str_type': 'unstructured_ID', 'query_str': 'Morvan-Bertrand,Physiol Plant,111,(2001),225'}\", 'PMID': None}\n",
"{'Morvan-Bertrand,Physiol Plant,111,(2001),225': {'reference': 'Morvan-Bertrand et al., Physiologia Plantarum, 2001, 111 (2), 225', 'DOI': '10.1034/j.1399-3054.2001.1110214.x', 'PMID': None}}\n",
"{'20512739': {'reference': 'Sheu et al., J Environ Sci Health B, 2010, 45 (5), 478', 'DOI': '10.1080/03601231003800347', 'PMID': '20512739'}}\n",
"{'title': 'Opaliferin, a New Polyketide from Cultures of Entomopathogenic Fungus Cordyceps sp. NBRC 106954', 'abstract': None, 'DOI': '10.1021/ol502216j', 'issue': '18', 'volume': '16', 'year': 2014, 'journal': 'Organic Letters', 'authors': ['Grudniewska, A.', 'Hayashi, S.', 'Shimizu, M.', 'Kato, M.', 'Suenaga, M.', 'Imagawa, H.', 'Ito, T.', 'Asakawa, Y.', 'Ban, S.', 'Kumada, T.', 'Hashimoto, T.', 'Umeyama, A.'], 'first_author_surname': 'Grudniewska', 'pages': '4695-4697', 'first_page': '4695', 'reference_retrieved_from': 'Crossref', 'query_str_type': 'DOI', 'query_str': '10.1021/ol502216j', 'PMID': None}\n",
"{'10.1021/ol502216j': {'reference': 'Grudniewska et al., Organic Letters, 2014, 16 (18), 4695', 'DOI': '10.1021/ol502216j', 'PMID': None}}\n"
]
}
],
"source": [
"importlib.reload(cn)\n",
"\n",
"\n",
"for test_ref in ['Morvan-Bertrand,Physiol Plant,111,(2001),225',\n",
" '20512739', \n",
" '10.1021/ol502216j'\n",
" ]:\n",
" print(cn.get_final_dict_from_ref_str(test_ref))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "base"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0