office-action-citations
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: moturesearch
- Language: Jupyter Notebook
- Default Branch: main
- Size: 968 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
office-action-citations
This repository contains code for consolidating office-action citations using OpenAlex (or Crossref when a match wasn’t found with OpenAlex).
Approach
We outline our approach below. We also provide a pipeline diagram summarising our approach (see below).
We used OpenAlex to consolidate citations. If a citation contained a title, we searched using the title. If a citation did not contain a title (or a match was not found using the title), we followed the steps below. We also followed the step below if the relevance score of a title-matched citation was below 600. We selected 600 as our threshold based on validating a sample of 100 title-matched citations.
- We filtered out citations that were missing information for first author, journal, or publication year. To ensure a low false positive match rate, we further limited our sample to citations that contained at least one of the following pieces of information: volume number, issue number, first page number, last page number.
- OpenAlex has a unique identifier for each author and journal. We searched for these unique identifiers, and only proceeded if a unique identifier was found for both the first author and the journal.
- We used these unique identifiers along with publication year and any extra information (namely, volume number, issue number, first page number, last page number) to search OpenAlex. Because volume number, issue number, and page numbers are not formatted consistently across citations, it possible for Grobid to incorrectly assign these values (e.g., volume and issue number could be swapped around). For this reason, we searched OpenAlex using all permutations of the values assigned to these four pieces of information. Note that we permuted the values for all fields (even if a value for a given field was missing).
- Unlike title searches, these searches do not return a relevance score. For this reason, we selected the first result returned by OpenAlex as we are unable to determine the best match in cases where more than one permutation returns a match (this is unlikely given our strict filtering criteria described above).
If a match was not found using OpenAlex, we used Crossref (which can be used via Grobid). Grobid extracts and consolidates citations using Crossref. We sent a reasonably small number of citations to Crossref. This is important because Crossref is unsuitable for processing large numbers of citations. If Crossref was able to consolidate the citation (i.e., a DOI was found), we sent the DOI to OpenAlex to retrieve the meta-data for the citation.
The pipeline diagram is below. Note that meta-data refers to a citation having information for: author, year, journal, and at least one of volume number, issue number, first or last page number.

Code
Below is a flowchart showing our workflow. The flowchart shows what code files we ran and in what order.

Data
Our data is available on figshare.
Office-action citation data. https://doi.org/10.6084/m9.figshare.25874452.v1
Classification data. https://doi.org/10.6084/m9.figshare.25874464.v1
Owner
- Name: Motu Economic and Public Policy Research
- Login: moturesearch
- Kind: organization
- Location: Wellington, New Zealand
- Website: https://motu.nz
- Repositories: 1
- Profile: https://github.com/moturesearch
Motu is New Zealand’s leading economic research institute.
Citation (citations_classification.ipynb)
{
"cells": [
{
"cell_type": "markdown",
"id": "80017f5f-1329-4bb7-ab2f-7f61ceebabde",
"metadata": {},
"source": [
"# This notebook provides the code to:\n",
"- evaluate GPT-3 and GPT-4's accuracy to classify the office action citations. \n",
"- use GPT-4 to classify a sample of 5000 citations\n",
"- train an LLM to classify the full set of citations \n",
"- deploy the model on the full set of citations \n",
"- sample the set of citations to manually label and classify them.\n",
"\n",
"If you have any questions on this notebook, please, feel free to contact me by email: scharfmann.emma@gmail.com"
]
},
{
"cell_type": "markdown",
"id": "a7681cce-4a61-4b3a-88cc-f74a2f0aedc4",
"metadata": {},
"source": [
"## Load packages "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "03819842-487e-4ec0-ae7b-dec30262d71b",
"metadata": {},
"outputs": [],
"source": [
"import glob \n",
"import pandas as pd\n",
"import random \n",
"import openai\n",
"import time\n",
"from tqdm import tqdm\n",
"from collections import Counter\n",
"from sklearn.metrics import multilabel_confusion_matrix\n",
"from multiprocessing import Pool\n",
"from functools import partial\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"\n",
"path_base = \"/home/fs01/spec1142/Emma/test/\"\n",
"\n",
"f = open(path_base + \"openai_key.txt\", \"r\")\n",
"openai.api_key = f.read()\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "fea846d0-0a46-4f84-93c8-591c879d5cb0",
"metadata": {},
"outputs": [],
"source": [
"# Below is the code for calcultating the accuracy, the TPR and the FPR.\n",
"def accuracy(cm):\n",
" accuracy = (cm.ravel()[0]+cm.ravel()[3])/sum(cm.ravel())\n",
" return accuracy\n",
"\n",
"def TPR(cm):\n",
" TPR = cm[1][1]/(cm[1][1]+cm[1][0])\n",
" return TPR\n",
"\n",
"def FPR(cm):\n",
" FPR = cm[0][1]/(cm[0][1]+cm[0][0])\n",
" return FPR"
]
},
{
"cell_type": "markdown",
"id": "6645e18e-8350-49a0-9897-8566b4ce08ac",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Sample oa citations"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cc64398d-58c5-4f16-b6af-4489832580cd",
"metadata": {},
"outputs": [],
"source": [
"## load oa citations and store the citations into a dictionary \n",
"\n",
"files= glob.glob(path_base + 'oa_data_v1/*')\n",
"\n",
"dic_result = {}\n",
"count = 0\n",
"\n",
"for k in range(len(files)):\n",
" file= glob.glob(path_base + 'oa_data_v1/*')[k]\n",
" \n",
" with open(file) as lines:\n",
" for line_ in lines: \n",
" \n",
" dic_result[count] = line_.replace('\\n','')\n",
" count += 1\n",
"\n",
"## count number of citations\n",
"print(len(dic_result))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c4fe0510-11e6-4d8b-a19c-9ab9dc4c39a2",
"metadata": {},
"outputs": [],
"source": [
"## store data into a dataframe\n",
"\n",
"table_oa_citations = pd.DataFrame()\n",
"table_oa_citations['citation'] = dic_result.values()\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0391b601-c786-4ca8-8038-d87ca459df3c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>citation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>666891</th>\n",
" <td>Gao Et Al Us Publication No 2018/0293445</td>\n",
" </tr>\n",
" <tr>\n",
" <th>726424</th>\n",
" <td>Gb-2291949-B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>619051</th>\n",
" <td>Jp-2012103941-A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>636592</th>\n",
" <td>English Translation Of Kr 10-1328742</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199107</th>\n",
" <td>Ep-1919136-A1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>609844</th>\n",
" <td>Jp-08244048-A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>151055</th>\n",
" <td>Walter De 102007050797 A1 – Translation Used ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>797624</th>\n",
" <td>Oxford Dictionary, Https://En.Oxforddictionar...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>536083</th>\n",
" <td>Wo-2017200295-A1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33670</th>\n",
" <td>Systemic Antibiotics Recommendations From The...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" citation\n",
"666891 Gao Et Al Us Publication No 2018/0293445\n",
"726424 Gb-2291949-B\n",
"619051 Jp-2012103941-A\n",
"636592 English Translation Of Kr 10-1328742\n",
"199107 Ep-1919136-A1\n",
"... ...\n",
"609844 Jp-08244048-A\n",
"151055 Walter De 102007050797 A1 – Translation Used ...\n",
"797624 Oxford Dictionary, Https://En.Oxforddictionar...\n",
"536083 Wo-2017200295-A1\n",
"33670 Systemic Antibiotics Recommendations From The...\n",
"\n",
"[100 rows x 1 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## sample citations (100 citations sample)\n",
"\n",
"table_oa_citations.sample(n=100)"
]
},
{
"cell_type": "markdown",
"id": "838d4c83-0491-4e88-ad7e-206fa63df32e",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Evaluate GPT-3's accuracy"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "dc600de7-568a-46d7-bcf4-2bdb1b27574e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>manual check</th>\n",
" <th>GPT4 check</th>\n",
" <th>GPT3.5 check</th>\n",
" <th>Bib subcategory</th>\n",
" <th>npl_biblio</th>\n",
" <th>md5</th>\n",
" <th>language_is_reliable</th>\n",
" <th>language_code</th>\n",
" <th>npl_cat</th>\n",
" <th>npl_cat_score</th>\n",
" <th>npl_cat_language_flag</th>\n",
" <th>patcit_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.51</td>\n",
" <td>False</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.33</td>\n",
" <td>False</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.60</td>\n",
" <td>False</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" manual check GPT4 check GPT3.5 check Bib subcategory \\\n",
"1 y y y NaN \n",
"4 y y y JOURNAL ARTICLE \n",
"5 y y y NaN \n",
"7 y y y JOURNAL ARTICLE \n",
"9 y y y JOURNAL ARTICLE \n",
"\n",
" npl_biblio \\\n",
"1 Watanabe Et Al Us Patent Application Publicat... \n",
"4 Dzulkafli Et Al., \"Effects Of Talc On Fire Re... \n",
"5 Zdepski Pub No Us 2017-0201784\\n \n",
"7 Nobori Et Al. (Cancer Research, 1997, 51:3193... \n",
"9 Fach Et Al, Neonatal Ovine Pulmonary Dendriti... \n",
"\n",
" md5 language_is_reliable language_code \\\n",
"1 2ca504f11c3b378ce7be4619e2ee843f True en \n",
"4 0685ae955c71d728f69046458ac1db0f True en \n",
"5 c6de38a1aa0a879105ced194459f343e True en \n",
"7 29e27156420faa44ae01a9e1a6363781 True en \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe True en \n",
"\n",
" npl_cat npl_cat_score npl_cat_language_flag \\\n",
"1 PATENT 0.51 False \n",
"4 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
"5 PATENT 0.33 False \n",
"7 BIBLIOGRAPHICAL_REFERENCE 0.60 False \n",
"9 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
"\n",
" patcit_id \n",
"1 2ca504f11c3b378ce7be4619e2ee843f \n",
"4 0685ae955c71d728f69046458ac1db0f \n",
"5 c6de38a1aa0a879105ced194459f343e \n",
"7 29e27156420faa44ae01a9e1a6363781 \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## load sample of ~300 citations manually classified by Kyle \n",
"\n",
"data1 = pd.read_excel(path_base + 'test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
"data1 = data1[data1['manual check'] == 'y']\n",
"data1.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "587b6c31-0241-48d5-b6fa-20e53c3909d1",
"metadata": {},
"outputs": [],
"source": [
"## clean the labels \n",
"\n",
"data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
"data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) & ( data1['category'] != '?' ) ]\n",
"data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
"data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
"data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "e5a233de-e66a-4e6c-bd52-db0d93404dbf",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|███████████████████████████████████████████| 20/20 [00:28<00:00, 1.44s/it]\n"
]
}
],
"source": [
"number = 10\n",
"\n",
"## prompt for GPT 3\n",
"prompt = \"\"\"I am going to give you\"\"\" + str(number) + \"\"\" cited documents that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
"WEBPAGE: Website\n",
"PATENT: A patent or patent application\n",
"PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
"JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
"CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
"BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
"THESIS: Thesis, sually archived by the degree-granting institution.\n",
"NORM_STANDARD: An industrial norm or standard\n",
"PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
"OFFICE_ACTION: A different office action sent by the patent office\n",
"WIKI: A wikipedia page (a subset of webpage)\n",
"DATABASE: A database, such as a genetic or corporate database\n",
"LITIGATION: A court case or formal opposition proceeding within the patent office\n",
"SEARCH_REPORT: A search report issued by a patent office\n",
"Only list the classes and the first word of the cited document. \n",
"Be VERY carefull not to forget cited documents!\n",
"\"\"\"\n",
"\n",
"true_labels = []\n",
"results = []\n",
"\n",
"## ask GPT to classify the chunks of citations. Note that GPT tends to forget some citations. \n",
"for k in tqdm(range(20)):\n",
"\n",
" citations = data1[['npl_biblio','category']].to_numpy()[number*k:number*(k+1)]\n",
" texts = \"; \".join([ str(k+1) + \": \" +citations[:,0][k] for k in range(len(citations[:,0])) ] )\n",
" \n",
" completion = openai.ChatCompletion.create(\n",
" model='gpt-3.5-turbo-0125',\n",
" messages=[{\"role\": \"system\", \"content\": prompt},\n",
" {\"role\": \"user\", \"content\": texts}],\n",
" temperature= 0.1)\n",
" \n",
" true_labels += list(citations[:,1])\n",
" res = completion['choices'][0]['message']['content'].split('\\n')\n",
" \n",
" results += res\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "3890f407-5cbc-417a-899f-68b5b108f717",
"metadata": {},
"outputs": [],
"source": [
"## clean GPT's classification\n",
"\n",
"labels = set(list(data1['category']))\n",
"predicted_labels = [ list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) & labels)[0] if list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) & labels) != [] else 'OTHER' for elem in results if elem != '']\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "7f8299ff-c739-427f-b401-0ebc47362eca",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>class</th>\n",
" <th>number of elements</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PATENT</td>\n",
" <td>43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>BOOK</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>WEBPAGE</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>OFFICE_ACTION</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>THESIS</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>DATABASE</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NORM_STANDARD</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>SEARCH_REPORT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>WIKI</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" class number of elements\n",
"0 PATENT 43\n",
"1 JOURNAL_ARTICLE 111\n",
"2 CONFERENCE_PROCEEDINGS 14\n",
"3 BOOK 4\n",
"4 PRODUCT_DOCUMENTATION 13\n",
"5 WEBPAGE 6\n",
"6 PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 1\n",
"7 OFFICE_ACTION 1\n",
"8 THESIS 1\n",
"9 DATABASE 2\n",
"10 NORM_STANDARD 1\n",
"11 SEARCH_REPORT 1\n",
"12 WIKI 2"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## count citations in each class\n",
"\n",
"df_counter = pd.DataFrame()\n",
"df_counter['class'] = Counter(true_labels).keys()\n",
"df_counter['number of elements'] = Counter(true_labels).values()\n",
"df_counter\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "8d685efe-03e2-40e1-9bab-29c90b3c2b7b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overall accuracy: 92.5\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>class</th>\n",
" <th>accuracy</th>\n",
" <th>TPR</th>\n",
" <th>FPR</th>\n",
" <th>number of elements</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0.980</td>\n",
" <td>0.972973</td>\n",
" <td>0.011236</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>WEBPAGE</td>\n",
" <td>0.965</td>\n",
" <td>1.000000</td>\n",
" <td>0.036082</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>BOOK</td>\n",
" <td>0.990</td>\n",
" <td>1.000000</td>\n",
" <td>0.010204</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>SEARCH_REPORT</td>\n",
" <td>0.995</td>\n",
" <td>1.000000</td>\n",
" <td>0.005025</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>OFFICE_ACTION</td>\n",
" <td>0.995</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" <td>0.990</td>\n",
" <td>0.928571</td>\n",
" <td>0.005376</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
" <td>0.990</td>\n",
" <td>0.000000</td>\n",
" <td>0.005025</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>PATENT</td>\n",
" <td>0.980</td>\n",
" <td>0.953488</td>\n",
" <td>0.012739</td>\n",
" <td>43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>WIKI</td>\n",
" <td>0.995</td>\n",
" <td>0.500000</td>\n",
" <td>0.000000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" <td>0.970</td>\n",
" <td>0.538462</td>\n",
" <td>0.000000</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NORM_STANDARD</td>\n",
" <td>0.995</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>THESIS</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>DATABASE</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" class accuracy TPR FPR \\\n",
"0 JOURNAL_ARTICLE 0.980 0.972973 0.011236 \n",
"1 WEBPAGE 0.965 1.000000 0.036082 \n",
"2 BOOK 0.990 1.000000 0.010204 \n",
"3 SEARCH_REPORT 0.995 1.000000 0.005025 \n",
"4 OFFICE_ACTION 0.995 0.000000 0.000000 \n",
"5 CONFERENCE_PROCEEDINGS 0.990 0.928571 0.005376 \n",
"6 PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 0.990 0.000000 0.005025 \n",
"7 PATENT 0.980 0.953488 0.012739 \n",
"8 WIKI 0.995 0.500000 0.000000 \n",
"9 PRODUCT_DOCUMENTATION 0.970 0.538462 0.000000 \n",
"10 NORM_STANDARD 0.995 0.000000 0.000000 \n",
"11 THESIS 1.000 1.000000 0.000000 \n",
"12 DATABASE 1.000 1.000000 0.000000 \n",
"\n",
" number of elements \n",
"0 111 \n",
"1 6 \n",
"2 4 \n",
"3 1 \n",
"4 1 \n",
"5 14 \n",
"6 1 \n",
"7 43 \n",
"8 2 \n",
"9 13 \n",
"10 1 \n",
"11 1 \n",
"12 2 "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## evaluate GPT's accuracy \n",
"\n",
"labels = list(set(list(data1['category'])))\n",
"conf_matrix = multilabel_confusion_matrix(true_labels , predicted_labels, labels=labels)\n",
"\n",
"\n",
"df_metrics = pd.DataFrame()\n",
"\n",
"missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(predicted_labels)\n",
"print('Overall accuracy: ', 100 - missclassifications_rate)\n",
"\n",
"\n",
"acc = []\n",
"tpr = []\n",
"fpr = []\n",
"for elem in conf_matrix:\n",
" acc.append(accuracy(elem))\n",
" tpr.append(TPR(elem))\n",
" fpr.append(FPR(elem))\n",
"\n",
"\n",
"df_metrics['class'] = labels\n",
"df_metrics['accuracy'] = acc\n",
"df_metrics['TPR'] = tpr\n",
"df_metrics['FPR'] = fpr\n",
"\n",
"df_metrics.merge(df_counter, on='class')\n",
" "
]
},
{
"cell_type": "markdown",
"id": "9dedb7bb-0c7a-46bc-bd20-e3173a58cfc5",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Evaluate GPT-4's accuracy"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "d8bb40d5-1eab-4b37-9334-0ce751ef6b59",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>manual check</th>\n",
" <th>GPT4 check</th>\n",
" <th>GPT3.5 check</th>\n",
" <th>Bib subcategory</th>\n",
" <th>npl_biblio</th>\n",
" <th>md5</th>\n",
" <th>language_is_reliable</th>\n",
" <th>language_code</th>\n",
" <th>npl_cat</th>\n",
" <th>npl_cat_score</th>\n",
" <th>npl_cat_language_flag</th>\n",
" <th>patcit_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.51</td>\n",
" <td>False</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.33</td>\n",
" <td>False</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.60</td>\n",
" <td>False</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>294</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Machine Translation Of Jp-2007224953 (Year: 2...</td>\n",
" <td>3e8f555055a762f69f1349d5df9887be</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.34</td>\n",
" <td>False</td>\n",
" <td>3e8f555055a762f69f1349d5df9887be</td>\n",
" </tr>\n",
" <tr>\n",
" <th>295</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...</td>\n",
" <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.65</td>\n",
" <td>False</td>\n",
" <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
" </tr>\n",
" <tr>\n",
" <th>296</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Mastaloudis, A., Et Al., “Antioxidant Supplem...</td>\n",
" <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.97</td>\n",
" <td>False</td>\n",
" <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>297</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>CONFERENCE PROCEEDINGS</td>\n",
" <td>Chang Et Al., Motion Registration And Correct...</td>\n",
" <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.97</td>\n",
" <td>False</td>\n",
" <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>299</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....</td>\n",
" <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.96</td>\n",
" <td>False</td>\n",
" <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>228 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" manual check GPT4 check GPT3.5 check Bib subcategory \\\n",
"1 y y y NaN \n",
"4 y y y JOURNAL ARTICLE \n",
"5 y y y NaN \n",
"7 y y y JOURNAL ARTICLE \n",
"9 y y y JOURNAL ARTICLE \n",
".. ... ... ... ... \n",
"294 y NaN NaN NaN \n",
"295 y NaN NaN NaN \n",
"296 y NaN NaN JOURNAL ARTICLE \n",
"297 y NaN NaN CONFERENCE PROCEEDINGS \n",
"299 y NaN NaN JOURNAL ARTICLE \n",
"\n",
" npl_biblio \\\n",
"1 Watanabe Et Al Us Patent Application Publicat... \n",
"4 Dzulkafli Et Al., \"Effects Of Talc On Fire Re... \n",
"5 Zdepski Pub No Us 2017-0201784\\n \n",
"7 Nobori Et Al. (Cancer Research, 1997, 51:3193... \n",
"9 Fach Et Al, Neonatal Ovine Pulmonary Dendriti... \n",
".. ... \n",
"294 Machine Translation Of Jp-2007224953 (Year: 2... \n",
"295 Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9... \n",
"296 Mastaloudis, A., Et Al., “Antioxidant Supplem... \n",
"297 Chang Et Al., Motion Registration And Correct... \n",
"299 Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp.... \n",
"\n",
" md5 language_is_reliable language_code \\\n",
"1 2ca504f11c3b378ce7be4619e2ee843f True en \n",
"4 0685ae955c71d728f69046458ac1db0f True en \n",
"5 c6de38a1aa0a879105ced194459f343e True en \n",
"7 29e27156420faa44ae01a9e1a6363781 True en \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe True en \n",
".. ... ... ... \n",
"294 3e8f555055a762f69f1349d5df9887be True en \n",
"295 02bf7bfcd8b0c278b2f170a4863b1963 True en \n",
"296 e994b3f8571d466a279ba05b87eacf87 True en \n",
"297 6e88f4dabe30e52db23a420f8203a433 True en \n",
"299 891695ca1e05a56afc5987ea1ab7feb0 True en \n",
"\n",
" npl_cat npl_cat_score npl_cat_language_flag \\\n",
"1 PATENT 0.51 False \n",
"4 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
"5 PATENT 0.33 False \n",
"7 BIBLIOGRAPHICAL_REFERENCE 0.60 False \n",
"9 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
".. ... ... ... \n",
"294 PATENT 0.34 False \n",
"295 BIBLIOGRAPHICAL_REFERENCE 0.65 False \n",
"296 BIBLIOGRAPHICAL_REFERENCE 0.97 False \n",
"297 BIBLIOGRAPHICAL_REFERENCE 0.97 False \n",
"299 BIBLIOGRAPHICAL_REFERENCE 0.96 False \n",
"\n",
" patcit_id \n",
"1 2ca504f11c3b378ce7be4619e2ee843f \n",
"4 0685ae955c71d728f69046458ac1db0f \n",
"5 c6de38a1aa0a879105ced194459f343e \n",
"7 29e27156420faa44ae01a9e1a6363781 \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe \n",
".. ... \n",
"294 3e8f555055a762f69f1349d5df9887be \n",
"295 02bf7bfcd8b0c278b2f170a4863b1963 \n",
"296 e994b3f8571d466a279ba05b87eacf87 \n",
"297 6e88f4dabe30e52db23a420f8203a433 \n",
"299 891695ca1e05a56afc5987ea1ab7feb0 \n",
"\n",
"[228 rows x 12 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## load sample of ~300 citations manually classified by Kyle \n",
"\n",
"data1 = pd.read_excel('/home/fs01/spec1142/Emma/test/test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
"data1 = data1[data1['manual check'] == 'y']\n",
"data1"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "610e08e9-ba78-45f9-9b77-7d124c9f6a77",
"metadata": {},
"outputs": [],
"source": [
"## clean the labels \n",
"\n",
"data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
"data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) & ( data1['category'] != '?' ) ]\n",
"data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
"data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
"data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "79c939e4-6d33-4bbf-8753-4a88a6f7bed0",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|███████████████████████████████████████████| 20/20 [00:58<00:00, 2.94s/it]\n"
]
}
],
"source": [
"import openai\n",
"import time\n",
"from tqdm import tqdm\n",
"\n",
"number = 10\n",
"\n",
"\n",
"## prompt for GPT 3\n",
"prompt = \"\"\"I am going to give you\"\"\" + str(number) + \"\"\" cited documents that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
"WEBPAGE: Website\n",
"PATENT: A patent or patent application\n",
"PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
"JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
"CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
"BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
"THESIS: Thesis, sually archived by the degree-granting institution.\n",
"NORM_STANDARD: An industrial norm or standard\n",
"PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
"OFFICE_ACTION: A different office action sent by the patent office\n",
"WIKI: A wikipedia page (a subset of webpage)\n",
"DATABASE: A database, such as a genetic or corporate database\n",
"LITIGATION: A court case or formal opposition proceeding within the patent office\n",
"SEARCH_REPORT: A search report issued by a patent office\n",
"Only list the classes and the first word of the cited document. \n",
"Be VERY carefull not to forget cited documents!\n",
"\"\"\"\n",
"\n",
"true_labels = []\n",
"results = []\n",
"\n",
"## ask GPT to classify the chunks of citations. Note that GPT tends to forget some citations. \n",
"for k in tqdm(range(20)):\n",
"\n",
" citations = data1[['npl_biblio','category']].to_numpy()[number*k:number*(k+1)]\n",
" texts = \"; \".join([ str(k+1) + \": \" +citations[:,0][k] for k in range(len(citations[:,0])) ] )\n",
" \n",
" completion = openai.ChatCompletion.create(\n",
" #model=\"gpt-4-0125-preview\", \n",
" model='gpt-4-0125-preview',\n",
" messages=[{\"role\": \"system\", \"content\": prompt},\n",
" {\"role\": \"user\", \"content\": texts}],\n",
" temperature= 0.1)\n",
" \n",
" true_labels += list(citations[:,1])\n",
" res = completion['choices'][0]['message']['content'].split('\\n')\n",
" \n",
" results += res\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "8f82ef1b-39f2-4632-aaa0-d3259b171e9d",
"metadata": {},
"outputs": [],
"source": [
"## clean GPT's classification\n",
"\n",
"labels = set(list(data1['category']))\n",
"predicted_labels = [ list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) & labels)[0] if list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) & labels) != [] else 'OTHER' for elem in results if elem != '']\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "8fb4f549-715a-49d8-908b-52c1261a1470",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>class</th>\n",
" <th>number of elements</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PATENT</td>\n",
" <td>43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>BOOK</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>WEBPAGE</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>OFFICE_ACTION</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>THESIS</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>DATABASE</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NORM_STANDARD</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>SEARCH_REPORT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>WIKI</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" class number of elements\n",
"0 PATENT 43\n",
"1 JOURNAL_ARTICLE 111\n",
"2 CONFERENCE_PROCEEDINGS 14\n",
"3 BOOK 4\n",
"4 PRODUCT_DOCUMENTATION 13\n",
"5 WEBPAGE 6\n",
"6 PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 1\n",
"7 OFFICE_ACTION 1\n",
"8 THESIS 1\n",
"9 DATABASE 2\n",
"10 NORM_STANDARD 1\n",
"11 SEARCH_REPORT 1\n",
"12 WIKI 2"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## count citations in each class\n",
"\n",
"df_counter = pd.DataFrame()\n",
"df_counter['class'] = Counter(true_labels).keys()\n",
"df_counter['number of elements'] = Counter(true_labels).values()\n",
"df_counter"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "20999d6b-0191-4d58-add1-7a1222d338a4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overall accuracy: 93.5\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>class</th>\n",
" <th>accuracy</th>\n",
" <th>TPR</th>\n",
" <th>FPR</th>\n",
" <th>number of elements</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0.985</td>\n",
" <td>0.972973</td>\n",
" <td>0.000000</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>WEBPAGE</td>\n",
" <td>0.975</td>\n",
" <td>1.000000</td>\n",
" <td>0.025773</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>BOOK</td>\n",
" <td>0.990</td>\n",
" <td>1.000000</td>\n",
" <td>0.010204</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>SEARCH_REPORT</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>OFFICE_ACTION</td>\n",
" <td>0.995</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" <td>0.985</td>\n",
" <td>0.857143</td>\n",
" <td>0.005376</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
" <td>0.980</td>\n",
" <td>1.000000</td>\n",
" <td>0.020101</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>PATENT</td>\n",
" <td>0.995</td>\n",
" <td>1.000000</td>\n",
" <td>0.006369</td>\n",
" <td>43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>WIKI</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" <td>0.970</td>\n",
" <td>0.538462</td>\n",
" <td>0.000000</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NORM_STANDARD</td>\n",
" <td>0.995</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>THESIS</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>DATABASE</td>\n",
" <td>1.000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" class accuracy TPR FPR \\\n",
"0 JOURNAL_ARTICLE 0.985 0.972973 0.000000 \n",
"1 WEBPAGE 0.975 1.000000 0.025773 \n",
"2 BOOK 0.990 1.000000 0.010204 \n",
"3 SEARCH_REPORT 1.000 1.000000 0.000000 \n",
"4 OFFICE_ACTION 0.995 0.000000 0.000000 \n",
"5 CONFERENCE_PROCEEDINGS 0.985 0.857143 0.005376 \n",
"6 PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 0.980 1.000000 0.020101 \n",
"7 PATENT 0.995 1.000000 0.006369 \n",
"8 WIKI 1.000 1.000000 0.000000 \n",
"9 PRODUCT_DOCUMENTATION 0.970 0.538462 0.000000 \n",
"10 NORM_STANDARD 0.995 0.000000 0.000000 \n",
"11 THESIS 1.000 1.000000 0.000000 \n",
"12 DATABASE 1.000 1.000000 0.000000 \n",
"\n",
" number of elements \n",
"0 111 \n",
"1 6 \n",
"2 4 \n",
"3 1 \n",
"4 1 \n",
"5 14 \n",
"6 1 \n",
"7 43 \n",
"8 2 \n",
"9 13 \n",
"10 1 \n",
"11 1 \n",
"12 2 "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## evaluate GPT's accuracy \n",
"\n",
"labels = list(set(list(data1['category'])))\n",
"conf_matrix = multilabel_confusion_matrix(true_labels , predicted_labels, labels=labels)\n",
"\n",
"df_metrics = pd.DataFrame()\n",
"\n",
"\n",
"missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(predicted_labels)\n",
"print('Overall accuracy: ', 100 - missclassifications_rate)\n",
"\n",
"\n",
"acc = []\n",
"tpr = []\n",
"fpr = []\n",
"for elem in conf_matrix:\n",
" acc.append(accuracy(elem))\n",
" tpr.append(TPR(elem))\n",
" fpr.append(FPR(elem))\n",
"\n",
"\n",
"df_metrics['class'] = labels\n",
"df_metrics['accuracy'] = acc\n",
"df_metrics['TPR'] = tpr\n",
"df_metrics['FPR'] = fpr\n",
"\n",
"df_metrics.merge(df_counter, on='class')\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "04682790-430e-4b6c-af80-40a9572e450e",
"metadata": {},
"source": [
"## Use GPT-4 to classify a 5000 citations sample"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "4f24a6b9-8d66-42e1-b2ba-719a1a271908",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5000\n"
]
}
],
"source": [
"dic_result = {}\n",
"count = 0\n",
"path = '/home/fs01/spec1142/Emma/test/for_grobid_all_v0/'\n",
"file = path + 'oa_crosswalk_without_sp_levi1_bq_citations0.txt'\n",
"\n",
"with open(file) as lines:\n",
" for line_ in lines: \n",
" \n",
" dic_result[count] = line_.replace('\\n','')\n",
" count += 1\n",
"\n",
"print(len(dic_result))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "b48d1f79-cbed-40ab-8125-17735a84cfb4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>oa_citation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Myositis Association Retrieved From On-Line We...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Wiegert ` 259</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Trakadis, Y.J. \"Patient-Controlled Encrypted ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" oa_citation\n",
"0 Myositis Association Retrieved From On-Line We...\n",
"1 Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...\n",
"2 Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...\n",
"3 Wiegert ` 259\n",
"4 Trakadis, Y.J. \"Patient-Controlled Encrypted ..."
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5000 = pd.DataFrame()\n",
"df5000['oa_citation'] = dic_result.values()\n",
"df5000.head()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "e3d6bbaf-e110-4429-afb7-d9e11475c5e2",
"metadata": {},
"outputs": [],
"source": [
"## function to classify a chunk of 50 citations using GPT-4\n",
"\n",
"def classify_50_oa_citations(citations, openai_api_key):\n",
"\n",
" \"\"\"\n",
" This function uses the OpenAI API to classify a series of 50 citations made in office actions by the US patent office.\n",
"\n",
" Parameters:\n",
" citations (list): A list of citations to be classified.\n",
" openai.api_key (str): The API key for the OpenAI API.\n",
"\n",
" Note:\n",
" - The function constructs a query string that includes the citations to be classified and a set of instructions for the OpenAI API.\n",
" - The function returns a list of classification results, with each result including the class of the cited document and the number of the cited document.\n",
" \"\"\"\n",
" \n",
" number = 50\n",
" results = []\n",
" \n",
" query = \"\"\"I am going to give you a series of \"\"\" + str(number) + \"\"\" citations that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
" WEBPAGE: Website\n",
" PATENT: A patent or patent application\n",
" PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
" JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
" CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
" BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
" THESIS: Thesis, sually archived by the degree-granting institution.\n",
" NORM_STANDARD: An industrial norm or standard\n",
" PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
" OFFICE_ACTION: A different office action sent by the patent office\n",
" WIKI: A wikipedia page (a subset of webpage)\n",
" DATABASE: A database, such as a genetic or corporate database\n",
" LITIGATION: A court case or formal opposition proceeding within the patent office\n",
" SEARCH_REPORT: A search report issued by a patent office\n",
" Only list the classes of the cited document and the number of the cited document.\"\"\"\n",
" \n",
" \n",
" texts = \"; \".join([ str(k) + ': ' + citations[k] for k in range(len(citations)) ] )\n",
" \n",
" completion = openai.ChatCompletion.create(\n",
" model=\"gpt-4-0125-preview\", \n",
" messages=[{\"role\": \"system\", \"content\": query},\n",
" {\"role\": \"user\", \"content\": texts}],\n",
" temperature= 0.2)\n",
" \n",
" results += completion['choices'][0]['message']['content'].split('\\n')\n",
" \n",
"\n",
" return results "
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "d517a584-2ec5-4817-93fb-f8918f9824fc",
"metadata": {},
"outputs": [],
"source": [
"## function \n",
"\n",
"def multi_gpt(openai_api_key,i):\n",
"\n",
" \"\"\"\n",
" This function uses the OpenAI API to classify a series of citations in a given dataframe, in batches of 50.\n",
"\n",
" Parameters:\n",
" openai.api_key (str): The API key for the OpenAI API.\n",
" i (int): The index of the dataframe to be classified.\n",
"\n",
" Note:\n",
" - The function selects a subset of the dataframe `df5000` based on the given index `i`.\n",
" - The function then divides the subset into smaller batches of 50 citations and uses the `classify_50_oa_citations` function to classify each batch.\n",
" - The function returns a list of classification results for the entire subset of the dataframe.\n",
" \"\"\"\n",
" \n",
" result = []\n",
" medium_df = df5000[200*i:200*(i+1)]\n",
" for k in range(4):\n",
" small_df = medium_df[50*k:50*(k+1)]\n",
" citations = list(small_df['oa_citation'])\n",
" res = classify_50_oa_citations(citations,openai_api_key)\n",
"\n",
" if len(res) == 50:\n",
" result += res\n",
" else:\n",
" res = classify_50_oa_citations(citations,openai_api_key)\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60ea422b-c18e-4eef-affd-76e08eff500b",
"metadata": {},
"outputs": [],
"source": [
"## classify the citations (uses 12 cpus)\n",
"\n",
"openai_api_key = openai.api_key\n",
"\n",
"p = Pool(processes=12)\n",
"func = partial(multi_gpt,openai_api_key)\n",
"results = p.map(func, [ i for i in range(12)])\n",
"p.close()"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "160f18ab-87a4-4e3a-b043-ed1e9ddc7db0",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1450\n",
"2349\n",
"4748\n"
]
}
],
"source": [
"## clean the results. Note that GPT-4 tends to forget some citations. \n",
"\n",
"labels = []\n",
"clean_list = []\n",
"count = 0 \n",
"\n",
"for elem in results:\n",
" k = 0 \n",
" for line in elem:\n",
" count += 1\n",
" if line.split(':')[0] == str(k):\n",
" k += 1\n",
" if len(line.split(': ')) == 1:\n",
" labels.append('None')\n",
" else:\n",
" labels.append(line.split(': ')[1]) \n",
" \n",
" if k == 50:\n",
" k = 0\n",
" else:\n",
" \n",
" labels.append('None') \n",
" k += 1\n",
" if len(line.split(': ')) == 1:\n",
" labels.append('None')\n",
" else:\n",
" labels.append(line.split(': ')[1]) \n",
" \n",
" if k == 50:\n",
" k = 0\n",
" k += 1\n",
" if k == 50:\n",
" k = 0\n",
" \n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "2d3a6c15-c963-48fe-8bbf-b75f35f7d94d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>oa_citation</th>\n",
" <th>labels</th>\n",
" <th>flag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Myositis Association Retrieved From On-Line We...</td>\n",
" <td>WEBPAGE</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...</td>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...</td>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Wiegert ` 259</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Trakadis, Y.J. \"Patient-Controlled Encrypted ...</td>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4995</th>\n",
" <td>Us 0057553 A</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4996</th>\n",
" <td>De-102017004043-A1</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4997</th>\n",
" <td>Ting Et Al. (Cn 105151567) Machine Translatio...</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4998</th>\n",
" <td>Ca2829631</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4999</th>\n",
" <td>Wo-2017079461-A2</td>\n",
" <td>PATENT</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" oa_citation labels flag\n",
"0 Myositis Association Retrieved From On-Line We... WEBPAGE 0\n",
"1 Hirsh Et Al. Weekly Nab-Paclitaxel In Combina... JOURNAL_ARTICLE 0\n",
"2 Vaidyanathan Et Al. Bioconjugate Chem. 1990, ... JOURNAL_ARTICLE 0\n",
"3 Wiegert ` 259 PATENT 0\n",
"4 Trakadis, Y.J. \"Patient-Controlled Encrypted ... JOURNAL_ARTICLE 0\n",
"... ... ... ...\n",
"4995 Us 0057553 A PATENT 0\n",
"4996 De-102017004043-A1 PATENT 0\n",
"4997 Ting Et Al. (Cn 105151567) Machine Translatio... PATENT 0\n",
"4998 Ca2829631 PATENT 0\n",
"4999 Wo-2017079461-A2 PATENT 0\n",
"\n",
"[5000 rows x 3 columns]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## save the data classified by GPT-4 and flag the potential errors\n",
"\n",
"df5000['labels'] = labels\n",
"df5000[df5000['labels'] == 'None']\n",
"df5000['flag'] = [ 1 if index in range(1400,1450) else 1 if index in range(2300,2350) else 1 if index in range(4700,4750) else 0 for index in df5000.index]\n",
"df5000"
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "8a0fabc3-3887-4eba-ba3d-bd2955fc4e46",
"metadata": {},
"outputs": [],
"source": [
"## save the 5000 citations sample\n",
"\n",
"df5000.to_csv('/home/fs01/spec1142/Emma/test/' + 'gpt4_5000sample.csv', index = False)"
]
},
{
"cell_type": "markdown",
"id": "18ddb37b-1d95-487c-970d-feaca841d433",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Train our own model"
]
},
{
"cell_type": "markdown",
"id": "1e75ebd3-54e6-4516-aece-443dc4e90c0b",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"### Load data"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2559990b-f03e-4382-a577-64dae9da6cf5",
"metadata": {},
"outputs": [],
"source": [
"## load manually classified citations and clean the labels \n",
"\n",
"files = glob.glob('/home/fs01/spec1142/Emma/test/test_files/oa/*')\n",
"data = pd.concat( [ pd.read_excel(elem) for elem in files])\n",
"data = data[ ( data['manual check'] == 'y' ) | ( data['manual_check'] == 'y')]\n",
"\n",
"data['category'] = [ elem for elem in data['npl_cat']]\n",
"\n",
"data = data[ ( data['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) & ( data['category'] != '?' ) ]\n",
"data['category'] = data['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
"data['category'] = data['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
"data['category'] = data['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8a5bb789-c973-4cfa-b5ea-5ae2f63e3482",
"metadata": {},
"outputs": [],
"source": [
"## load citations classified by GPT-4 and clean the labels \n",
"\n",
"gpt_data = pd.read_csv('/home/fs01/spec1142/Emma/test/test_files/gpt4_5000sample.csv')\n",
"gpt_data = gpt_data[gpt_data['flag'] == 0]\n",
"gpt_data = gpt_data.rename(columns = { 'oa_citation':'npl_biblio' , 'labels' : 'category' })\n",
"gpt_data = gpt_data[['npl_biblio','category']]\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4c584e21-706c-42bd-b734-fa440571a409",
"metadata": {},
"outputs": [],
"source": [
"## merge the two files\n",
"\n",
"data = data[['npl_biblio','category']]\n",
"data = pd.concat([data,gpt_data])\n",
"data = data[(data['category'] != 'None Cited')&(data['category'] != 'GOVERNMENT_REPORT') ]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4ce5185a-d8f2-43c3-93cf-e5fa041de256",
"metadata": {},
"outputs": [],
"source": [
"## sample the citations classified as patent, journal article and webpage to have a more balanced dataset.\n",
"\n",
"data_patents = data[data['category'] == 'PATENT'].sample(frac=0.12)\n",
"data_articles = data[data['category'] == 'JOURNAL_ARTICLE'].sample(frac=0.4)\n",
"data_webpage = data[data['category'] == 'WEBPAGE'].sample(frac=0.3)\n",
"\n",
"data_no_patents = data[(data['category'] != 'PATENT') & (data['category'] != 'JOURNAL_ARTICLE') & (data['category'] != 'WEBPAGE')]\n",
"data = pd.concat([data_patents,data_articles,data_webpage,data_no_patents]).sample(frac=1)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "def53362-505f-4345-b9ef-eeaa683d2a0b",
"metadata": {},
"outputs": [],
"source": [
"## keep classes with more than 20 datapoints\n",
"\n",
"df = data.groupby('category').count()\n",
"data = data[data['category'].isin(list(df[df['npl_biblio'] > 20].index))]\n",
"data['labels'] = pd.factorize(data['category'], sort=True)[0]\n",
"data = data.sample(frac=1)\n",
"dic_labels = { elem[1] : elem[0] for elem in data[['category','labels']].drop_duplicates().to_numpy() } \n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "f39daf5f-efcd-4577-9763-61317fc9fcc8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1908\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>npl_biblio</th>\n",
" <th>labels</th>\n",
" </tr>\n",
" <tr>\n",
" <th>category</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>BOOK</th>\n",
" <td>86</td>\n",
" <td>86</td>\n",
" </tr>\n",
" <tr>\n",
" <th>CONFERENCE_PROCEEDINGS</th>\n",
" <td>229</td>\n",
" <td>229</td>\n",
" </tr>\n",
" <tr>\n",
" <th>DATABASE</th>\n",
" <td>141</td>\n",
" <td>141</td>\n",
" </tr>\n",
" <tr>\n",
" <th>JOURNAL_ARTICLE</th>\n",
" <td>399</td>\n",
" <td>399</td>\n",
" </tr>\n",
" <tr>\n",
" <th>LITIGATION</th>\n",
" <td>34</td>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NORM_STANDARD</th>\n",
" <td>51</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>OFFICE_ACTION</th>\n",
" <td>103</td>\n",
" <td>103</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PATENT</th>\n",
" <td>347</td>\n",
" <td>347</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</th>\n",
" <td>101</td>\n",
" <td>101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PRODUCT_DOCUMENTATION</th>\n",
" <td>103</td>\n",
" <td>103</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SEARCH_REPORT</th>\n",
" <td>94</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>THESIS</th>\n",
" <td>41</td>\n",
" <td>41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WEBPAGE</th>\n",
" <td>104</td>\n",
" <td>104</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WIKI</th>\n",
" <td>75</td>\n",
" <td>75</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" npl_biblio labels\n",
"category \n",
"BOOK 86 86\n",
"CONFERENCE_PROCEEDINGS 229 229\n",
"DATABASE 141 141\n",
"JOURNAL_ARTICLE 399 399\n",
"LITIGATION 34 34\n",
"NORM_STANDARD 51 51\n",
"OFFICE_ACTION 103 103\n",
"PATENT 347 347\n",
"PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 101 101\n",
"PRODUCT_DOCUMENTATION 103 103\n",
"SEARCH_REPORT 94 94\n",
"THESIS 41 41\n",
"WEBPAGE 104 104\n",
"WIKI 75 75"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## print dataset by classes\n",
"\n",
"print(len(data['category']))\n",
"data.groupby('category').count()"
]
},
{
"cell_type": "markdown",
"id": "2ebb5e95-c21b-4cfe-8db2-bee25a91e008",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"### Train model"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ffb924a0-0fdc-4baf-96ef-0fe3788bfcf3",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/fs01/spec1142/anaconda3/envs/patents2/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
"/home/fs01/spec1142/anaconda3/envs/patents2/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
" warnings.warn(\n",
"Epoch 1: 100%|██████████████████████████████████| 24/24 [04:02<00:00, 10.08s/it]\n",
"Validation - Epoch 1: 100%|███████████████████████| 6/6 [00:08<00:00, 1.40s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1, Loss: 53.3975, Validation Accuracy: 0.5131\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 2: 100%|██████████████████████████████████| 24/24 [02:19<00:00, 5.83s/it]\n",
"Validation - Epoch 2: 100%|███████████████████████| 6/6 [00:09<00:00, 1.57s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 2, Loss: 38.5001, Validation Accuracy: 0.6047\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 3: 100%|██████████████████████████████████| 24/24 [02:28<00:00, 6.19s/it]\n",
"Validation - Epoch 3: 100%|███████████████████████| 6/6 [00:08<00:00, 1.38s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 3, Loss: 30.2190, Validation Accuracy: 0.6990\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 4: 100%|██████████████████████████████████| 24/24 [02:33<00:00, 6.38s/it]\n",
"Validation - Epoch 4: 100%|███████████████████████| 6/6 [00:08<00:00, 1.36s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 4, Loss: 23.6015, Validation Accuracy: 0.7592\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 5: 100%|██████████████████████████████████| 24/24 [02:19<00:00, 5.83s/it]\n",
"Validation - Epoch 5: 100%|███████████████████████| 6/6 [00:09<00:00, 1.61s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 5, Loss: 18.6547, Validation Accuracy: 0.7775\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 6: 100%|██████████████████████████████████| 24/24 [02:37<00:00, 6.58s/it]\n",
"Validation - Epoch 6: 100%|███████████████████████| 6/6 [00:08<00:00, 1.38s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 6, Loss: 15.1288, Validation Accuracy: 0.7592\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 7: 100%|██████████████████████████████████| 24/24 [02:23<00:00, 5.97s/it]\n",
"Validation - Epoch 7: 100%|███████████████████████| 6/6 [00:08<00:00, 1.48s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 7, Loss: 12.4134, Validation Accuracy: 0.7775\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Epoch 8: 100%|██████████████████████████████████| 24/24 [02:22<00:00, 5.93s/it]\n",
"Validation - Epoch 8: 100%|███████████████████████| 6/6 [00:08<00:00, 1.45s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 8, Loss: 10.0054, Validation Accuracy: 0.7984\n"
]
},
{
"data": {
"text/plain": [
"('fine_tuned_model/tokenizer_config.json',\n",
" 'fine_tuned_model/special_tokens_map.json',\n",
" 'fine_tuned_model/vocab.txt',\n",
" 'fine_tuned_model/added_tokens.json')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"from transformers import BertTokenizer, BertForSequenceClassification, AdamW\n",
"from torch.utils.data import DataLoader, Dataset\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.metrics import accuracy_score\n",
"from tqdm import tqdm\n",
"\n",
"# Dummy data (replace this with your dataset)\n",
"\n",
"texts = list(data['npl_biblio'])\n",
"labels = list(data['labels'])\n",
"\n",
"# Encoding labels\n",
"label_encoder = LabelEncoder()\n",
"encoded_labels = label_encoder.fit_transform(labels)\n",
"\n",
"# Splitting the data\n",
"train_texts, val_texts, train_labels, val_labels = train_test_split(\n",
" texts, encoded_labels, test_size=0.2, random_state=42\n",
")\n",
"\n",
"# Custom Dataset class\n",
"class CustomDataset(Dataset):\n",
" def __init__(self, texts, labels, tokenizer, max_length=128):\n",
" self.texts = texts\n",
" self.labels = labels\n",
" self.tokenizer = tokenizer\n",
" self.max_length = max_length\n",
"\n",
" def __len__(self):\n",
" return len(self.texts)\n",
"\n",
" def __getitem__(self, idx):\n",
" text = str(self.texts[idx])\n",
" label = self.labels[idx]\n",
"\n",
" encoding = self.tokenizer(\n",
" text,\n",
" truncation=True,\n",
" padding='max_length',\n",
" max_length=self.max_length,\n",
" return_tensors='pt',\n",
" )\n",
"\n",
" return {\n",
" 'input_ids': encoding['input_ids'].flatten(),\n",
" 'attention_mask': encoding['attention_mask'].flatten(),\n",
" 'labels': torch.tensor(label, dtype=torch.long)\n",
" }\n",
"\n",
"# Load BERT model and tokenizer\n",
"model_name = 'bert-base-multilingual-cased'\n",
"tokenizer = BertTokenizer.from_pretrained(model_name)\n",
"model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
"\n",
"# Create Dataset instances\n",
"train_dataset = CustomDataset(train_texts, train_labels, tokenizer)\n",
"val_dataset = CustomDataset(val_texts, val_labels, tokenizer)\n",
"\n",
"# DataLoader instances\n",
"train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)\n",
"val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n",
"\n",
"# Set device\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"model.to(device)\n",
"\n",
"# Optimizer and loss function\n",
"optimizer = AdamW(model.parameters(), lr=2e-5)\n",
"loss_fn = torch.nn.CrossEntropyLoss()\n",
"\n",
"# Training loop\n",
"num_epochs = 8\n",
"for epoch in range(num_epochs):\n",
" model.train()\n",
" total_loss = 0\n",
" for batch in tqdm(train_dataloader, desc=f\"Epoch {epoch + 1}\"):\n",
" optimizer.zero_grad()\n",
" input_ids = batch['input_ids'].to(device)\n",
" attention_mask = batch['attention_mask'].to(device)\n",
" labels = batch['labels'].to(device)\n",
" outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n",
" loss = outputs.loss\n",
" total_loss += loss.item()\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
" # Validation\n",
" model.eval()\n",
" val_predictions = []\n",
" val_true_labels = []\n",
" with torch.no_grad():\n",
" for batch in tqdm(val_dataloader, desc=f\"Validation - Epoch {epoch + 1}\"):\n",
" input_ids = batch['input_ids'].to(device)\n",
" attention_mask = batch['attention_mask'].to(device)\n",
" labels = batch['labels'].to(device)\n",
" outputs = model(input_ids, attention_mask=attention_mask)\n",
" logits = outputs.logits\n",
" predictions = torch.argmax(logits, dim=1)\n",
" val_predictions.extend(predictions.cpu().numpy())\n",
" val_true_labels.extend(labels.cpu().numpy())\n",
"\n",
" # Calculate accuracy\n",
" accuracy = accuracy_score(val_true_labels, val_predictions)\n",
" print(f\"Epoch {epoch + 1}, Loss: {total_loss:.4f}, Validation Accuracy: {accuracy:.4f}\")\n",
"\n",
"# Save the fine-tuned model\n",
"model.save_pretrained(\"fine_tuned_model\")\n",
"tokenizer.save_pretrained(\"fine_tuned_model\")\n"
]
},
{
"cell_type": "markdown",
"id": "78b87c8f-8135-4fff-b09a-80f2b156746b",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"### Evaluate the model"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "06a7e3d9-a9f4-45e0-be31-627817ddf56c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>manual check</th>\n",
" <th>GPT4 check</th>\n",
" <th>GPT3.5 check</th>\n",
" <th>Bib subcategory</th>\n",
" <th>npl_biblio</th>\n",
" <th>md5</th>\n",
" <th>language_is_reliable</th>\n",
" <th>language_code</th>\n",
" <th>npl_cat</th>\n",
" <th>npl_cat_score</th>\n",
" <th>npl_cat_language_flag</th>\n",
" <th>patcit_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.51</td>\n",
" <td>False</td>\n",
" <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>0685ae955c71d728f69046458ac1db0f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.33</td>\n",
" <td>False</td>\n",
" <td>c6de38a1aa0a879105ced194459f343e</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.60</td>\n",
" <td>False</td>\n",
" <td>29e27156420faa44ae01a9e1a6363781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>y</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.98</td>\n",
" <td>False</td>\n",
" <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>294</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Machine Translation Of Jp-2007224953 (Year: 2...</td>\n",
" <td>3e8f555055a762f69f1349d5df9887be</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>PATENT</td>\n",
" <td>0.34</td>\n",
" <td>False</td>\n",
" <td>3e8f555055a762f69f1349d5df9887be</td>\n",
" </tr>\n",
" <tr>\n",
" <th>295</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...</td>\n",
" <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.65</td>\n",
" <td>False</td>\n",
" <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
" </tr>\n",
" <tr>\n",
" <th>296</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Mastaloudis, A., Et Al., “Antioxidant Supplem...</td>\n",
" <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.97</td>\n",
" <td>False</td>\n",
" <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>297</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>CONFERENCE PROCEEDINGS</td>\n",
" <td>Chang Et Al., Motion Registration And Correct...</td>\n",
" <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.97</td>\n",
" <td>False</td>\n",
" <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>299</th>\n",
" <td>y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>JOURNAL ARTICLE</td>\n",
" <td>Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....</td>\n",
" <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
" <td>True</td>\n",
" <td>en</td>\n",
" <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
" <td>0.96</td>\n",
" <td>False</td>\n",
" <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>228 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" manual check GPT4 check GPT3.5 check Bib subcategory \\\n",
"1 y y y NaN \n",
"4 y y y JOURNAL ARTICLE \n",
"5 y y y NaN \n",
"7 y y y JOURNAL ARTICLE \n",
"9 y y y JOURNAL ARTICLE \n",
".. ... ... ... ... \n",
"294 y NaN NaN NaN \n",
"295 y NaN NaN NaN \n",
"296 y NaN NaN JOURNAL ARTICLE \n",
"297 y NaN NaN CONFERENCE PROCEEDINGS \n",
"299 y NaN NaN JOURNAL ARTICLE \n",
"\n",
" npl_biblio \\\n",
"1 Watanabe Et Al Us Patent Application Publicat... \n",
"4 Dzulkafli Et Al., \"Effects Of Talc On Fire Re... \n",
"5 Zdepski Pub No Us 2017-0201784\\n \n",
"7 Nobori Et Al. (Cancer Research, 1997, 51:3193... \n",
"9 Fach Et Al, Neonatal Ovine Pulmonary Dendriti... \n",
".. ... \n",
"294 Machine Translation Of Jp-2007224953 (Year: 2... \n",
"295 Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9... \n",
"296 Mastaloudis, A., Et Al., “Antioxidant Supplem... \n",
"297 Chang Et Al., Motion Registration And Correct... \n",
"299 Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp.... \n",
"\n",
" md5 language_is_reliable language_code \\\n",
"1 2ca504f11c3b378ce7be4619e2ee843f True en \n",
"4 0685ae955c71d728f69046458ac1db0f True en \n",
"5 c6de38a1aa0a879105ced194459f343e True en \n",
"7 29e27156420faa44ae01a9e1a6363781 True en \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe True en \n",
".. ... ... ... \n",
"294 3e8f555055a762f69f1349d5df9887be True en \n",
"295 02bf7bfcd8b0c278b2f170a4863b1963 True en \n",
"296 e994b3f8571d466a279ba05b87eacf87 True en \n",
"297 6e88f4dabe30e52db23a420f8203a433 True en \n",
"299 891695ca1e05a56afc5987ea1ab7feb0 True en \n",
"\n",
" npl_cat npl_cat_score npl_cat_language_flag \\\n",
"1 PATENT 0.51 False \n",
"4 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
"5 PATENT 0.33 False \n",
"7 BIBLIOGRAPHICAL_REFERENCE 0.60 False \n",
"9 BIBLIOGRAPHICAL_REFERENCE 0.98 False \n",
".. ... ... ... \n",
"294 PATENT 0.34 False \n",
"295 BIBLIOGRAPHICAL_REFERENCE 0.65 False \n",
"296 BIBLIOGRAPHICAL_REFERENCE 0.97 False \n",
"297 BIBLIOGRAPHICAL_REFERENCE 0.97 False \n",
"299 BIBLIOGRAPHICAL_REFERENCE 0.96 False \n",
"\n",
" patcit_id \n",
"1 2ca504f11c3b378ce7be4619e2ee843f \n",
"4 0685ae955c71d728f69046458ac1db0f \n",
"5 c6de38a1aa0a879105ced194459f343e \n",
"7 29e27156420faa44ae01a9e1a6363781 \n",
"9 d2bea0db1c51b5ff13072c66202ba3fe \n",
".. ... \n",
"294 3e8f555055a762f69f1349d5df9887be \n",
"295 02bf7bfcd8b0c278b2f170a4863b1963 \n",
"296 e994b3f8571d466a279ba05b87eacf87 \n",
"297 6e88f4dabe30e52db23a420f8203a433 \n",
"299 891695ca1e05a56afc5987ea1ab7feb0 \n",
"\n",
"[228 rows x 12 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## load sample of ~300 citations manually classified by Kyle \n",
"\n",
"data1 = pd.read_excel('/home/fs01/spec1142/Emma/test/test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
"data1 = data1[data1['manual check'] == 'y']\n",
"data1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "755539d6-76bc-4fb0-ab4b-5d90c3afc009",
"metadata": {},
"outputs": [],
"source": [
"## clean the labels \n",
"\n",
"data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
"data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) & ( data1['category'] != '?' ) ]\n",
"data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
"data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
"data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "e9414d15-ec71-475a-8f65-7118af5d6928",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-05-20 18:07:01.923779: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2024-05-20 18:07:01.923908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2024-05-20 18:07:02.599782: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2024-05-20 18:07:03.449559: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
"To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"2024-05-20 18:07:34.661954: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
]
}
],
"source": [
"## load our own model \n",
"\n",
"from transformers import TextClassificationPipeline\n",
"\n",
"model_name = 'fine_tuned_model'\n",
"tokenizer = BertTokenizer.from_pretrained(model_name)\n",
"model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
"\n",
"pipe2 = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False)\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "5aebc9ae-7b9e-4498-9dc7-76a292824fdd",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"## classify the testing set with our model\n",
"\n",
"test = data1[data1['category'].isin(dic_labels.values())][['npl_biblio','category']].to_numpy()\n",
"\n",
"start = time.time()\n",
"pred_label = []\n",
"true_label = []\n",
"pred_label_raw = pipe2(list(test[:,0]), batch_size = 8)\n",
"\n",
"for k in tqdm(range(len(test))):\n",
" pred_label.append(dic_labels[int(pred_label_raw[k]['label'][6:])])\n",
" true_label.append(test[k][1])\n",
" \n",
"\n",
"end = time.time()\n",
"print(end - start)\n",
"\n",
"labels = list(set(true_label))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "6a3fc670-904f-45cd-8f42-d57b134f5126",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overall accuracy: 83.02752293577981\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>labels</th>\n",
" <th>accuracy</th>\n",
" <th>TPR</th>\n",
" <th>FPR</th>\n",
" <th>true</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>WEBPAGE</td>\n",
" <td>0.940367</td>\n",
" <td>0.777778</td>\n",
" <td>0.052632</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" <td>0.949541</td>\n",
" <td>0.905172</td>\n",
" <td>0.000000</td>\n",
" <td>116</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>THESIS</td>\n",
" <td>0.995413</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" <td>0.954128</td>\n",
" <td>0.533333</td>\n",
" <td>0.014778</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>SEARCH_REPORT</td>\n",
" <td>0.995413</td>\n",
" <td>1.000000</td>\n",
" <td>0.004608</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>BOOK</td>\n",
" <td>0.990826</td>\n",
" <td>0.833333</td>\n",
" <td>0.004717</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>DATABASE</td>\n",
" <td>0.995413</td>\n",
" <td>1.000000</td>\n",
" <td>0.004630</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>NORM_STANDARD</td>\n",
" <td>0.995413</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
" <td>0.972477</td>\n",
" <td>0.500000</td>\n",
" <td>0.023148</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>OFFICE_ACTION</td>\n",
" <td>0.995413</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>PATENT</td>\n",
" <td>0.940367</td>\n",
" <td>0.760870</td>\n",
" <td>0.011628</td>\n",
" <td>46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>WIKI</td>\n",
" <td>0.990826</td>\n",
" <td>1.000000</td>\n",
" <td>0.009259</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" <td>0.944954</td>\n",
" <td>0.937500</td>\n",
" <td>0.054455</td>\n",
" <td>16</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" labels accuracy TPR FPR \\\n",
"0 WEBPAGE 0.940367 0.777778 0.052632 \n",
"1 JOURNAL_ARTICLE 0.949541 0.905172 0.000000 \n",
"2 THESIS 0.995413 0.000000 0.000000 \n",
"3 PRODUCT_DOCUMENTATION 0.954128 0.533333 0.014778 \n",
"4 SEARCH_REPORT 0.995413 1.000000 0.004608 \n",
"5 BOOK 0.990826 0.833333 0.004717 \n",
"6 DATABASE 0.995413 1.000000 0.004630 \n",
"7 NORM_STANDARD 0.995413 0.000000 0.000000 \n",
"8 PREPRINT/WORKING_PAPER/TECHNICAL_REPORT 0.972477 0.500000 0.023148 \n",
"9 OFFICE_ACTION 0.995413 0.000000 0.000000 \n",
"10 PATENT 0.940367 0.760870 0.011628 \n",
"11 WIKI 0.990826 1.000000 0.009259 \n",
"12 CONFERENCE_PROCEEDINGS 0.944954 0.937500 0.054455 \n",
"\n",
" true \n",
"0 9 \n",
"1 116 \n",
"2 1 \n",
"3 15 \n",
"4 1 \n",
"5 6 \n",
"6 2 \n",
"7 1 \n",
"8 2 \n",
"9 1 \n",
"10 46 \n",
"11 2 \n",
"12 16 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## evaluate the model \n",
"\n",
"conf_matrix = multilabel_confusion_matrix(true_label , pred_label,labels=labels)#, labels=labels)\n",
"\n",
"df_metrics = pd.DataFrame()\n",
"\n",
"\n",
"missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(pred_label)\n",
"print('Overall accuracy: ', 100 - missclassifications_rate)\n",
"\n",
"\n",
"acc = []\n",
"tpr = []\n",
"fpr = []\n",
"count_true = [] \n",
"for elem in conf_matrix:\n",
" acc.append(accuracy(elem))\n",
" tpr.append(TPR(elem))\n",
" fpr.append(FPR(elem))\n",
" count_true.append(elem[1][1] + elem[1][0])\n",
"\n",
"\n",
"df_metrics['labels'] = [ k for k in labels]\n",
"\n",
"df_metrics['accuracy'] = acc\n",
"df_metrics['TPR'] = tpr\n",
"df_metrics['FPR'] = fpr\n",
"df_metrics['true'] = count_true\n",
"\n",
"df_metrics\n",
" "
]
},
{
"cell_type": "markdown",
"id": "2604672e-df2f-4674-9fcf-b5ee21a37352",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Classify the citations with the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2e29cc0-da34-4274-b700-a5236e427284",
"metadata": {},
"outputs": [],
"source": [
"# Load BERT model and tokenizer\n",
"\n",
"from transformers import BertTokenizer, BertForSequenceClassification, AdamW\n",
"from transformers import TextClassificationPipeline\n",
"\n",
"\n",
"## load classes names \n",
"dic_labels = {7: 'PATENT',\n",
" 4: 'LITIGATION',\n",
" 2: 'DATABASE',\n",
" 13: 'WIKI',\n",
" 12: 'WEBPAGE',\n",
" 1: 'CONFERENCE_PROCEEDINGS',\n",
" 3: 'JOURNAL_ARTICLE',\n",
" 6: 'OFFICE_ACTION',\n",
" 10: 'SEARCH_REPORT',\n",
" 9: 'PRODUCT_DOCUMENTATION',\n",
" 5: 'NORM_STANDARD',\n",
" 0: 'BOOK',\n",
" 8: 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT',\n",
" 11: 'THESIS'}\n",
"\n",
"encoded_labels = list(encoded_labels.values())\n",
"\n",
"\n",
"## load classification model \n",
"model_name = 'fine_tuned_model'\n",
"tokenizer = BertTokenizer.from_pretrained(model_name)\n",
"model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
"\n",
"pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False,truncation=True)\n",
"\n",
"\n",
"\n",
"## classify the citations and save the classes labels \n",
"files = glob.glob(path_base + 'oa_data_v1/*')\n",
"\n",
"\n",
"for k in range(162,len(files)):\n",
"\n",
" ## load citations\n",
" dic_result = {}\n",
" count = 0\n",
" file= files[k]\n",
" \n",
" with open(file) as lines:\n",
" for line_ in lines: \n",
" \n",
" dic_result[count] = line_.replace('\\n','')\n",
" count += 1\n",
"\n",
" print(len(dic_result))\n",
" \n",
" df = pd.DataFrame()\n",
" df['oa_citation'] = dic_result.values()\n",
" \n",
" ## classify citations\n",
" start = time.time()\n",
" result = pipe(list(df['oa_citation']), batch_size = 128)\n",
" end = time.time()\n",
" print(end - start)\n",
" \n",
" list_pred = [] \n",
" for elem in result:\n",
" list_pred.append(dic_labels[int(elem['label'][6:])])\n",
" \n",
" \n",
" \n",
" ## save classified citations\n",
" df['label'] = list_pred\n",
" df.to_csv(path_base + 'oa_data_v1_classified/' + file.split('/')[-1].split('.')[0] + '.tsv', sep = \"\\t\", index = False)\n",
" \n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "1a0da027-0880-49a3-a98d-ad99659b8440",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Sample classified citations"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "5d1725bf-f0e4-4b93-82ec-3134f5749db3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>oa_citation</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>An English Machine Translation Of Александр Ви...</td>\n",
" <td>DATABASE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dash And Konkimalla, Poly-Є-Caprolactone Based...</td>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2014045792 Wo A1 淳</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>А. А. Королев, Office Action For Russian Pate...</td>\n",
" <td>OFFICE_ACTION</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Xu Cn 104741552A, Cited In Ids Filed 6/29/18</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Legagneur Et Al \"Limbo3 (M = Mn, Fe, Co): Syn...</td>\n",
" <td>JOURNAL_ARTICLE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Wang Cn 1037663314</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Haeley Wo 02/41801</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Skoglund Wo 2010/027317</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Li Us Patent No 6,719,697</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Us 0101398 A</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Olivieri Ep 1257118</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Olowinsky Et Al Us Patent Application Publica...</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2003-249598 Jp A</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Therapeutic Potential Of Natural Killer Cells...</td>\n",
" <td>BOOK</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Sato Jp H11330112 With English Machine Transl...</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>M. Caissy Et Al., Coming Soon: The Internatio...</td>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Sugimoto ` 612</td>\n",
" <td>CONFERENCE_PROCEEDINGS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>Pentair, Fairbanks Nijhuis, Vertical Turbine ...</td>\n",
" <td>PRODUCT_DOCUMENTATION</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Liu Et Al Jp 57-023006</td>\n",
" <td>PATENT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" oa_citation label\n",
"0 An English Machine Translation Of Александр Ви... DATABASE\n",
"1 Dash And Konkimalla, Poly-Є-Caprolactone Based... JOURNAL_ARTICLE\n",
"2 2014045792 Wo A1 淳 PATENT\n",
"3 А. А. Королев, Office Action For Russian Pate... OFFICE_ACTION\n",
"4 Xu Cn 104741552A, Cited In Ids Filed 6/29/18 PATENT\n",
"5 Legagneur Et Al \"Limbo3 (M = Mn, Fe, Co): Syn... JOURNAL_ARTICLE\n",
"6 Wang Cn 1037663314 PATENT\n",
"7 Haeley Wo 02/41801 PATENT\n",
"8 Skoglund Wo 2010/027317 PATENT\n",
"9 Li Us Patent No 6,719,697 PATENT\n",
"10 Us 0101398 A PATENT\n",
"11 Olivieri Ep 1257118 PATENT\n",
"12 Olowinsky Et Al Us Patent Application Publica... PATENT\n",
"13 2003-249598 Jp A PATENT\n",
"14 Therapeutic Potential Of Natural Killer Cells... BOOK\n",
"15 Sato Jp H11330112 With English Machine Transl... PATENT\n",
"16 M. Caissy Et Al., Coming Soon: The Internatio... CONFERENCE_PROCEEDINGS\n",
"17 Sugimoto ` 612 CONFERENCE_PROCEEDINGS\n",
"18 Pentair, Fairbanks Nijhuis, Vertical Turbine ... PRODUCT_DOCUMENTATION\n",
"19 Liu Et Al Jp 57-023006 PATENT"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## load classified citations\n",
"\n",
"table = pd.read_csv(path_base + \"classificed_oa_data_v1.tsv\", delimiter = \"\\t\")\n",
"table.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7977d705-f669-42ce-becd-d484806eaaa6",
"metadata": {},
"outputs": [],
"source": [
"## save a sample of citations frm each class\n",
"\n",
"labels = list(set(list(table['label'])))\n",
"for label in labels:\n",
" sm_table = table[table['label'] == label].sample(n=30)\n",
" label = label.replace('/','')\n",
" sm_table.to_excel(path_base + 'test_files/oa_v2/sample_30_cat_' + label + '.xlsx')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e60c0822-9b90-4af5-9451-ba2eebfd151a",
"metadata": {},
"outputs": [],
"source": [
"## save a random sample of 300 citations\n",
"\n",
"sm_table = table.sample(n=300)\n",
"sm_table.to_excel(path_base + 'test_files/oa_v2/sample_300_all_cat.xlsx')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1