Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: moturesearch
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 968 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

office-action-citations

This repository contains code for consolidating office-action citations using OpenAlex (or Crossref when a match wasn’t found with OpenAlex).

Approach

We outline our approach below. We also provide a pipeline diagram summarising our approach (see below).

We used OpenAlex to consolidate citations. If a citation contained a title, we searched using the title. If a citation did not contain a title (or a match was not found using the title), we followed the steps below. We also followed the step below if the relevance score of a title-matched citation was below 600. We selected 600 as our threshold based on validating a sample of 100 title-matched citations.

  • We filtered out citations that were missing information for first author, journal, or publication year. To ensure a low false positive match rate, we further limited our sample to citations that contained at least one of the following pieces of information: volume number, issue number, first page number, last page number.
  • OpenAlex has a unique identifier for each author and journal. We searched for these unique identifiers, and only proceeded if a unique identifier was found for both the first author and the journal.
  • We used these unique identifiers along with publication year and any extra information (namely, volume number, issue number, first page number, last page number) to search OpenAlex. Because volume number, issue number, and page numbers are not formatted consistently across citations, it possible for Grobid to incorrectly assign these values (e.g., volume and issue number could be swapped around). For this reason, we searched OpenAlex using all permutations of the values assigned to these four pieces of information. Note that we permuted the values for all fields (even if a value for a given field was missing).
  • Unlike title searches, these searches do not return a relevance score. For this reason, we selected the first result returned by OpenAlex as we are unable to determine the best match in cases where more than one permutation returns a match (this is unlikely given our strict filtering criteria described above).

If a match was not found using OpenAlex, we used Crossref (which can be used via Grobid). Grobid extracts and consolidates citations using Crossref. We sent a reasonably small number of citations to Crossref. This is important because Crossref is unsuitable for processing large numbers of citations. If Crossref was able to consolidate the citation (i.e., a DOI was found), we sent the DOI to OpenAlex to retrieve the meta-data for the citation.

The pipeline diagram is below. Note that meta-data refers to a citation having information for: author, year, journal, and at least one of volume number, issue number, first or last page number.

Image Alt text

Code

Below is a flowchart showing our workflow. The flowchart shows what code files we ran and in what order.

Image Alt text

Data

Our data is available on figshare.

Office-action citation data. https://doi.org/10.6084/m9.figshare.25874452.v1

Classification data. https://doi.org/10.6084/m9.figshare.25874464.v1

Owner

  • Name: Motu Economic and Public Policy Research
  • Login: moturesearch
  • Kind: organization
  • Location: Wellington, New Zealand

Motu is New Zealand’s leading economic research institute.

Citation (citations_classification.ipynb)

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "80017f5f-1329-4bb7-ab2f-7f61ceebabde",
   "metadata": {},
   "source": [
    "# This notebook provides the code to:\n",
    "- evaluate GPT-3 and GPT-4's accuracy to classify the office action citations. \n",
    "- use GPT-4 to classify a sample of 5000 citations\n",
    "- train an LLM to classify the full set of citations \n",
    "- deploy the model on the full set of citations \n",
    "- sample the set of citations to manually label and classify them.\n",
    "\n",
    "If you have any questions on this notebook, please, feel free to contact me by email: scharfmann.emma@gmail.com"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7681cce-4a61-4b3a-88cc-f74a2f0aedc4",
   "metadata": {},
   "source": [
    "## Load packages "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "03819842-487e-4ec0-ae7b-dec30262d71b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob \n",
    "import pandas as pd\n",
    "import random \n",
    "import openai\n",
    "import time\n",
    "from tqdm import tqdm\n",
    "from collections import Counter\n",
    "from sklearn.metrics import multilabel_confusion_matrix\n",
    "from multiprocessing import Pool\n",
    "from functools import partial\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "\n",
    "path_base = \"/home/fs01/spec1142/Emma/test/\"\n",
    "\n",
    "f = open(path_base + \"openai_key.txt\", \"r\")\n",
    "openai.api_key = f.read()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fea846d0-0a46-4f84-93c8-591c879d5cb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Below is the code for calcultating the accuracy, the TPR and the FPR.\n",
    "def accuracy(cm):\n",
    "    accuracy = (cm.ravel()[0]+cm.ravel()[3])/sum(cm.ravel())\n",
    "    return accuracy\n",
    "\n",
    "def TPR(cm):\n",
    "    TPR = cm[1][1]/(cm[1][1]+cm[1][0])\n",
    "    return TPR\n",
    "\n",
    "def FPR(cm):\n",
    "    FPR = cm[0][1]/(cm[0][1]+cm[0][0])\n",
    "    return FPR"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6645e18e-8350-49a0-9897-8566b4ce08ac",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Sample oa citations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cc64398d-58c5-4f16-b6af-4489832580cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "## load oa citations and store the citations into a dictionary \n",
    "\n",
    "files= glob.glob(path_base + 'oa_data_v1/*')\n",
    "\n",
    "dic_result = {}\n",
    "count = 0\n",
    "\n",
    "for k in range(len(files)):\n",
    "    file= glob.glob(path_base + 'oa_data_v1/*')[k]\n",
    "    \n",
    "    with open(file) as lines:\n",
    "        for line_ in lines: \n",
    "            \n",
    "            dic_result[count] = line_.replace('\\n','')\n",
    "            count += 1\n",
    "\n",
    "## count number of citations\n",
    "print(len(dic_result))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c4fe0510-11e6-4d8b-a19c-9ab9dc4c39a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "## store data into a dataframe\n",
    "\n",
    "table_oa_citations = pd.DataFrame()\n",
    "table_oa_citations['citation'] = dic_result.values()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0391b601-c786-4ca8-8038-d87ca459df3c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>citation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>666891</th>\n",
       "      <td>Gao Et Al Us Publication No 2018/0293445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>726424</th>\n",
       "      <td>Gb-2291949-B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>619051</th>\n",
       "      <td>Jp-2012103941-A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>636592</th>\n",
       "      <td>English Translation Of Kr 10-1328742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199107</th>\n",
       "      <td>Ep-1919136-A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>609844</th>\n",
       "      <td>Jp-08244048-A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151055</th>\n",
       "      <td>Walter De 102007050797 A1 – Translation Used ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>797624</th>\n",
       "      <td>Oxford Dictionary, Https://En.Oxforddictionar...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>536083</th>\n",
       "      <td>Wo-2017200295-A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33670</th>\n",
       "      <td>Systemic Antibiotics Recommendations From The...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 citation\n",
       "666891           Gao Et Al Us Publication No 2018/0293445\n",
       "726424                                       Gb-2291949-B\n",
       "619051                                    Jp-2012103941-A\n",
       "636592               English Translation Of Kr 10-1328742\n",
       "199107                                      Ep-1919136-A1\n",
       "...                                                   ...\n",
       "609844                                      Jp-08244048-A\n",
       "151055   Walter De 102007050797 A1 – Translation Used ...\n",
       "797624   Oxford Dictionary, Https://En.Oxforddictionar...\n",
       "536083                                   Wo-2017200295-A1\n",
       "33670    Systemic Antibiotics Recommendations From The...\n",
       "\n",
       "[100 rows x 1 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## sample citations (100 citations sample)\n",
    "\n",
    "table_oa_citations.sample(n=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "838d4c83-0491-4e88-ad7e-206fa63df32e",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Evaluate GPT-3's accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "dc600de7-568a-46d7-bcf4-2bdb1b27574e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>manual check</th>\n",
       "      <th>GPT4 check</th>\n",
       "      <th>GPT3.5 check</th>\n",
       "      <th>Bib subcategory</th>\n",
       "      <th>npl_biblio</th>\n",
       "      <th>md5</th>\n",
       "      <th>language_is_reliable</th>\n",
       "      <th>language_code</th>\n",
       "      <th>npl_cat</th>\n",
       "      <th>npl_cat_score</th>\n",
       "      <th>npl_cat_language_flag</th>\n",
       "      <th>patcit_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.51</td>\n",
       "      <td>False</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.33</td>\n",
       "      <td>False</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.60</td>\n",
       "      <td>False</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  manual check GPT4 check GPT3.5 check  Bib subcategory  \\\n",
       "1            y          y            y              NaN   \n",
       "4            y          y            y  JOURNAL ARTICLE   \n",
       "5            y          y            y              NaN   \n",
       "7            y          y            y  JOURNAL ARTICLE   \n",
       "9            y          y            y  JOURNAL ARTICLE   \n",
       "\n",
       "                                          npl_biblio  \\\n",
       "1   Watanabe Et Al Us Patent Application Publicat...   \n",
       "4   Dzulkafli Et Al., \"Effects Of Talc On Fire Re...   \n",
       "5                   Zdepski Pub No Us 2017-0201784\\n   \n",
       "7   Nobori Et Al. (Cancer Research, 1997, 51:3193...   \n",
       "9   Fach Et Al, Neonatal Ovine Pulmonary Dendriti...   \n",
       "\n",
       "                                md5  language_is_reliable language_code  \\\n",
       "1  2ca504f11c3b378ce7be4619e2ee843f                  True            en   \n",
       "4  0685ae955c71d728f69046458ac1db0f                  True            en   \n",
       "5  c6de38a1aa0a879105ced194459f343e                  True            en   \n",
       "7  29e27156420faa44ae01a9e1a6363781                  True            en   \n",
       "9  d2bea0db1c51b5ff13072c66202ba3fe                  True            en   \n",
       "\n",
       "                     npl_cat  npl_cat_score  npl_cat_language_flag  \\\n",
       "1                     PATENT           0.51                  False   \n",
       "4  BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "5                     PATENT           0.33                  False   \n",
       "7  BIBLIOGRAPHICAL_REFERENCE           0.60                  False   \n",
       "9  BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "\n",
       "                          patcit_id  \n",
       "1  2ca504f11c3b378ce7be4619e2ee843f  \n",
       "4  0685ae955c71d728f69046458ac1db0f  \n",
       "5  c6de38a1aa0a879105ced194459f343e  \n",
       "7  29e27156420faa44ae01a9e1a6363781  \n",
       "9  d2bea0db1c51b5ff13072c66202ba3fe  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## load sample of ~300 citations manually classified by Kyle \n",
    "\n",
    "data1 = pd.read_excel(path_base + 'test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
    "data1 = data1[data1['manual check'] == 'y']\n",
    "data1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "587b6c31-0241-48d5-b6fa-20e53c3909d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean the labels \n",
    "\n",
    "data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
    "data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) &  ( data1['category'] != '?' ) ]\n",
    "data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
    "data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
    "data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "e5a233de-e66a-4e6c-bd52-db0d93404dbf",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████| 20/20 [00:28<00:00,  1.44s/it]\n"
     ]
    }
   ],
   "source": [
    "number = 10\n",
    "\n",
    "## prompt for GPT 3\n",
    "prompt = \"\"\"I am going to give you\"\"\" + str(number) + \"\"\" cited documents that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
    "WEBPAGE: Website\n",
    "PATENT: A patent or patent application\n",
    "PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
    "JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
    "CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
    "BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
    "THESIS: Thesis, sually archived by the degree-granting institution.\n",
    "NORM_STANDARD: An industrial norm or standard\n",
    "PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
    "OFFICE_ACTION: A different office action sent by the patent office\n",
    "WIKI: A wikipedia page (a subset of webpage)\n",
    "DATABASE: A database, such as a genetic or corporate database\n",
    "LITIGATION: A court case or formal opposition proceeding within the patent office\n",
    "SEARCH_REPORT: A search report issued by a patent office\n",
    "Only list the classes and the first word of the cited document. \n",
    "Be VERY carefull not to forget cited documents!\n",
    "\"\"\"\n",
    "\n",
    "true_labels = []\n",
    "results = []\n",
    "\n",
    "## ask GPT to classify the chunks of citations. Note that GPT tends to forget some citations. \n",
    "for k in tqdm(range(20)):\n",
    "\n",
    "    citations = data1[['npl_biblio','category']].to_numpy()[number*k:number*(k+1)]\n",
    "    texts = \"; \".join([ str(k+1) + \": \"  +citations[:,0][k] for k  in range(len(citations[:,0])) ] )\n",
    "    \n",
    "    completion = openai.ChatCompletion.create(\n",
    "        model='gpt-3.5-turbo-0125',\n",
    "        messages=[{\"role\": \"system\", \"content\": prompt},\n",
    "                  {\"role\": \"user\", \"content\": texts}],\n",
    "        temperature= 0.1)\n",
    "    \n",
    "    true_labels += list(citations[:,1])\n",
    "    res = completion['choices'][0]['message']['content'].split('\\n')\n",
    "    \n",
    "    results += res\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "3890f407-5cbc-417a-899f-68b5b108f717",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean GPT's classification\n",
    "\n",
    "labels = set(list(data1['category']))\n",
    "predicted_labels = [ list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) &  labels)[0] if list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) &  labels) != [] else 'OTHER' for elem in results if elem != '']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "7f8299ff-c739-427f-b401-0ebc47362eca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>number of elements</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PATENT</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>BOOK</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>THESIS</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>DATABASE</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NORM_STANDARD</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>SEARCH_REPORT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>WIKI</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      class  number of elements\n",
       "0                                    PATENT                  43\n",
       "1                           JOURNAL_ARTICLE                 111\n",
       "2                    CONFERENCE_PROCEEDINGS                  14\n",
       "3                                      BOOK                   4\n",
       "4                     PRODUCT_DOCUMENTATION                  13\n",
       "5                                   WEBPAGE                   6\n",
       "6   PREPRINT/WORKING_PAPER/TECHNICAL_REPORT                   1\n",
       "7                             OFFICE_ACTION                   1\n",
       "8                                    THESIS                   1\n",
       "9                                  DATABASE                   2\n",
       "10                            NORM_STANDARD                   1\n",
       "11                            SEARCH_REPORT                   1\n",
       "12                                     WIKI                   2"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## count citations in each class\n",
    "\n",
    "df_counter = pd.DataFrame()\n",
    "df_counter['class'] = Counter(true_labels).keys()\n",
    "df_counter['number of elements'] = Counter(true_labels).values()\n",
    "df_counter\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "8d685efe-03e2-40e1-9bab-29c90b3c2b7b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall accuracy:  92.5\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>TPR</th>\n",
       "      <th>FPR</th>\n",
       "      <th>number of elements</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0.980</td>\n",
       "      <td>0.972973</td>\n",
       "      <td>0.011236</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>0.965</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.036082</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>BOOK</td>\n",
       "      <td>0.990</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.010204</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SEARCH_REPORT</td>\n",
       "      <td>0.995</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.005025</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "      <td>0.995</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "      <td>0.990</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.005376</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
       "      <td>0.990</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.005025</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.980</td>\n",
       "      <td>0.953488</td>\n",
       "      <td>0.012739</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>WIKI</td>\n",
       "      <td>0.995</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "      <td>0.970</td>\n",
       "      <td>0.538462</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NORM_STANDARD</td>\n",
       "      <td>0.995</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>THESIS</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>DATABASE</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      class  accuracy       TPR       FPR  \\\n",
       "0                           JOURNAL_ARTICLE     0.980  0.972973  0.011236   \n",
       "1                                   WEBPAGE     0.965  1.000000  0.036082   \n",
       "2                                      BOOK     0.990  1.000000  0.010204   \n",
       "3                             SEARCH_REPORT     0.995  1.000000  0.005025   \n",
       "4                             OFFICE_ACTION     0.995  0.000000  0.000000   \n",
       "5                    CONFERENCE_PROCEEDINGS     0.990  0.928571  0.005376   \n",
       "6   PREPRINT/WORKING_PAPER/TECHNICAL_REPORT     0.990  0.000000  0.005025   \n",
       "7                                    PATENT     0.980  0.953488  0.012739   \n",
       "8                                      WIKI     0.995  0.500000  0.000000   \n",
       "9                     PRODUCT_DOCUMENTATION     0.970  0.538462  0.000000   \n",
       "10                            NORM_STANDARD     0.995  0.000000  0.000000   \n",
       "11                                   THESIS     1.000  1.000000  0.000000   \n",
       "12                                 DATABASE     1.000  1.000000  0.000000   \n",
       "\n",
       "    number of elements  \n",
       "0                  111  \n",
       "1                    6  \n",
       "2                    4  \n",
       "3                    1  \n",
       "4                    1  \n",
       "5                   14  \n",
       "6                    1  \n",
       "7                   43  \n",
       "8                    2  \n",
       "9                   13  \n",
       "10                   1  \n",
       "11                   1  \n",
       "12                   2  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## evaluate GPT's accuracy \n",
    "\n",
    "labels = list(set(list(data1['category'])))\n",
    "conf_matrix = multilabel_confusion_matrix(true_labels , predicted_labels, labels=labels)\n",
    "\n",
    "\n",
    "df_metrics = pd.DataFrame()\n",
    "\n",
    "missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(predicted_labels)\n",
    "print('Overall accuracy: ', 100 - missclassifications_rate)\n",
    "\n",
    "\n",
    "acc = []\n",
    "tpr = []\n",
    "fpr = []\n",
    "for elem in conf_matrix:\n",
    "    acc.append(accuracy(elem))\n",
    "    tpr.append(TPR(elem))\n",
    "    fpr.append(FPR(elem))\n",
    "\n",
    "\n",
    "df_metrics['class'] = labels\n",
    "df_metrics['accuracy'] = acc\n",
    "df_metrics['TPR'] = tpr\n",
    "df_metrics['FPR'] = fpr\n",
    "\n",
    "df_metrics.merge(df_counter, on='class')\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dedb7bb-0c7a-46bc-bd20-e3173a58cfc5",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Evaluate GPT-4's accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "d8bb40d5-1eab-4b37-9334-0ce751ef6b59",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>manual check</th>\n",
       "      <th>GPT4 check</th>\n",
       "      <th>GPT3.5 check</th>\n",
       "      <th>Bib subcategory</th>\n",
       "      <th>npl_biblio</th>\n",
       "      <th>md5</th>\n",
       "      <th>language_is_reliable</th>\n",
       "      <th>language_code</th>\n",
       "      <th>npl_cat</th>\n",
       "      <th>npl_cat_score</th>\n",
       "      <th>npl_cat_language_flag</th>\n",
       "      <th>patcit_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.51</td>\n",
       "      <td>False</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.33</td>\n",
       "      <td>False</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.60</td>\n",
       "      <td>False</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>294</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Machine Translation Of Jp-2007224953 (Year: 2...</td>\n",
       "      <td>3e8f555055a762f69f1349d5df9887be</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.34</td>\n",
       "      <td>False</td>\n",
       "      <td>3e8f555055a762f69f1349d5df9887be</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>295</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...</td>\n",
       "      <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.65</td>\n",
       "      <td>False</td>\n",
       "      <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>296</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Mastaloudis, A., Et Al., “Antioxidant Supplem...</td>\n",
       "      <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.97</td>\n",
       "      <td>False</td>\n",
       "      <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>297</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CONFERENCE PROCEEDINGS</td>\n",
       "      <td>Chang Et Al., Motion Registration And Correct...</td>\n",
       "      <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.97</td>\n",
       "      <td>False</td>\n",
       "      <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>299</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....</td>\n",
       "      <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.96</td>\n",
       "      <td>False</td>\n",
       "      <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>228 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    manual check GPT4 check GPT3.5 check         Bib subcategory  \\\n",
       "1              y          y            y                     NaN   \n",
       "4              y          y            y         JOURNAL ARTICLE   \n",
       "5              y          y            y                     NaN   \n",
       "7              y          y            y         JOURNAL ARTICLE   \n",
       "9              y          y            y         JOURNAL ARTICLE   \n",
       "..           ...        ...          ...                     ...   \n",
       "294            y        NaN          NaN                     NaN   \n",
       "295            y        NaN          NaN                     NaN   \n",
       "296            y        NaN          NaN         JOURNAL ARTICLE   \n",
       "297            y        NaN          NaN  CONFERENCE PROCEEDINGS   \n",
       "299            y        NaN          NaN         JOURNAL ARTICLE   \n",
       "\n",
       "                                            npl_biblio  \\\n",
       "1     Watanabe Et Al Us Patent Application Publicat...   \n",
       "4     Dzulkafli Et Al., \"Effects Of Talc On Fire Re...   \n",
       "5                     Zdepski Pub No Us 2017-0201784\\n   \n",
       "7     Nobori Et Al. (Cancer Research, 1997, 51:3193...   \n",
       "9     Fach Et Al, Neonatal Ovine Pulmonary Dendriti...   \n",
       "..                                                 ...   \n",
       "294   Machine Translation Of Jp-2007224953 (Year: 2...   \n",
       "295   Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...   \n",
       "296   Mastaloudis, A., Et Al., “Antioxidant Supplem...   \n",
       "297   Chang Et Al., Motion Registration And Correct...   \n",
       "299   Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....   \n",
       "\n",
       "                                  md5  language_is_reliable language_code  \\\n",
       "1    2ca504f11c3b378ce7be4619e2ee843f                  True            en   \n",
       "4    0685ae955c71d728f69046458ac1db0f                  True            en   \n",
       "5    c6de38a1aa0a879105ced194459f343e                  True            en   \n",
       "7    29e27156420faa44ae01a9e1a6363781                  True            en   \n",
       "9    d2bea0db1c51b5ff13072c66202ba3fe                  True            en   \n",
       "..                                ...                   ...           ...   \n",
       "294  3e8f555055a762f69f1349d5df9887be                  True            en   \n",
       "295  02bf7bfcd8b0c278b2f170a4863b1963                  True            en   \n",
       "296  e994b3f8571d466a279ba05b87eacf87                  True            en   \n",
       "297  6e88f4dabe30e52db23a420f8203a433                  True            en   \n",
       "299  891695ca1e05a56afc5987ea1ab7feb0                  True            en   \n",
       "\n",
       "                       npl_cat  npl_cat_score  npl_cat_language_flag  \\\n",
       "1                       PATENT           0.51                  False   \n",
       "4    BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "5                       PATENT           0.33                  False   \n",
       "7    BIBLIOGRAPHICAL_REFERENCE           0.60                  False   \n",
       "9    BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "..                         ...            ...                    ...   \n",
       "294                     PATENT           0.34                  False   \n",
       "295  BIBLIOGRAPHICAL_REFERENCE           0.65                  False   \n",
       "296  BIBLIOGRAPHICAL_REFERENCE           0.97                  False   \n",
       "297  BIBLIOGRAPHICAL_REFERENCE           0.97                  False   \n",
       "299  BIBLIOGRAPHICAL_REFERENCE           0.96                  False   \n",
       "\n",
       "                            patcit_id  \n",
       "1    2ca504f11c3b378ce7be4619e2ee843f  \n",
       "4    0685ae955c71d728f69046458ac1db0f  \n",
       "5    c6de38a1aa0a879105ced194459f343e  \n",
       "7    29e27156420faa44ae01a9e1a6363781  \n",
       "9    d2bea0db1c51b5ff13072c66202ba3fe  \n",
       "..                                ...  \n",
       "294  3e8f555055a762f69f1349d5df9887be  \n",
       "295  02bf7bfcd8b0c278b2f170a4863b1963  \n",
       "296  e994b3f8571d466a279ba05b87eacf87  \n",
       "297  6e88f4dabe30e52db23a420f8203a433  \n",
       "299  891695ca1e05a56afc5987ea1ab7feb0  \n",
       "\n",
       "[228 rows x 12 columns]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## load sample of ~300 citations manually classified by Kyle \n",
    "\n",
    "data1 = pd.read_excel('/home/fs01/spec1142/Emma/test/test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
    "data1 = data1[data1['manual check'] == 'y']\n",
    "data1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "610e08e9-ba78-45f9-9b77-7d124c9f6a77",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean the labels \n",
    "\n",
    "data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
    "data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) &  ( data1['category'] != '?' ) ]\n",
    "data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
    "data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
    "data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "79c939e4-6d33-4bbf-8753-4a88a6f7bed0",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████| 20/20 [00:58<00:00,  2.94s/it]\n"
     ]
    }
   ],
   "source": [
    "import openai\n",
    "import time\n",
    "from tqdm import tqdm\n",
    "\n",
    "number = 10\n",
    "\n",
    "\n",
    "## prompt for GPT 3\n",
    "prompt = \"\"\"I am going to give you\"\"\" + str(number) + \"\"\" cited documents that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
    "WEBPAGE: Website\n",
    "PATENT: A patent or patent application\n",
    "PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
    "JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
    "CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
    "BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
    "THESIS: Thesis, sually archived by the degree-granting institution.\n",
    "NORM_STANDARD: An industrial norm or standard\n",
    "PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
    "OFFICE_ACTION: A different office action sent by the patent office\n",
    "WIKI: A wikipedia page (a subset of webpage)\n",
    "DATABASE: A database, such as a genetic or corporate database\n",
    "LITIGATION: A court case or formal opposition proceeding within the patent office\n",
    "SEARCH_REPORT: A search report issued by a patent office\n",
    "Only list the classes and the first word of the cited document. \n",
    "Be VERY carefull not to forget cited documents!\n",
    "\"\"\"\n",
    "\n",
    "true_labels = []\n",
    "results = []\n",
    "\n",
    "## ask GPT to classify the chunks of citations. Note that GPT tends to forget some citations. \n",
    "for k in tqdm(range(20)):\n",
    "\n",
    "    citations = data1[['npl_biblio','category']].to_numpy()[number*k:number*(k+1)]\n",
    "    texts = \"; \".join([ str(k+1) + \": \"  +citations[:,0][k] for k  in range(len(citations[:,0])) ] )\n",
    "    \n",
    "    completion = openai.ChatCompletion.create(\n",
    "        #model=\"gpt-4-0125-preview\", \n",
    "        model='gpt-4-0125-preview',\n",
    "        messages=[{\"role\": \"system\", \"content\": prompt},\n",
    "                  {\"role\": \"user\", \"content\": texts}],\n",
    "        temperature= 0.1)\n",
    "    \n",
    "    true_labels += list(citations[:,1])\n",
    "    res = completion['choices'][0]['message']['content'].split('\\n')\n",
    "    \n",
    "    results += res\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "8f82ef1b-39f2-4632-aaa0-d3259b171e9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean GPT's classification\n",
    "\n",
    "labels = set(list(data1['category']))\n",
    "predicted_labels = [ list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) &  labels)[0] if list(set(elem.replace('\\r', '').replace(':','').replace('TECHNICAL_REPORT/WORKING_PAPER','PREPRINT/WORKING_PAPER/TECHNICAL_REPORT').split()) &  labels) != [] else 'OTHER' for elem in results if elem != '']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "8fb4f549-715a-49d8-908b-52c1261a1470",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>number of elements</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PATENT</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>BOOK</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>THESIS</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>DATABASE</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NORM_STANDARD</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>SEARCH_REPORT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>WIKI</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      class  number of elements\n",
       "0                                    PATENT                  43\n",
       "1                           JOURNAL_ARTICLE                 111\n",
       "2                    CONFERENCE_PROCEEDINGS                  14\n",
       "3                                      BOOK                   4\n",
       "4                     PRODUCT_DOCUMENTATION                  13\n",
       "5                                   WEBPAGE                   6\n",
       "6   PREPRINT/WORKING_PAPER/TECHNICAL_REPORT                   1\n",
       "7                             OFFICE_ACTION                   1\n",
       "8                                    THESIS                   1\n",
       "9                                  DATABASE                   2\n",
       "10                            NORM_STANDARD                   1\n",
       "11                            SEARCH_REPORT                   1\n",
       "12                                     WIKI                   2"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## count citations in each class\n",
    "\n",
    "df_counter = pd.DataFrame()\n",
    "df_counter['class'] = Counter(true_labels).keys()\n",
    "df_counter['number of elements'] = Counter(true_labels).values()\n",
    "df_counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "20999d6b-0191-4d58-add1-7a1222d338a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall accuracy:  93.5\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>TPR</th>\n",
       "      <th>FPR</th>\n",
       "      <th>number of elements</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0.985</td>\n",
       "      <td>0.972973</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>0.975</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.025773</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>BOOK</td>\n",
       "      <td>0.990</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.010204</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SEARCH_REPORT</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "      <td>0.995</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "      <td>0.985</td>\n",
       "      <td>0.857143</td>\n",
       "      <td>0.005376</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
       "      <td>0.980</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.020101</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.995</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.006369</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>WIKI</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "      <td>0.970</td>\n",
       "      <td>0.538462</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NORM_STANDARD</td>\n",
       "      <td>0.995</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>THESIS</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>DATABASE</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      class  accuracy       TPR       FPR  \\\n",
       "0                           JOURNAL_ARTICLE     0.985  0.972973  0.000000   \n",
       "1                                   WEBPAGE     0.975  1.000000  0.025773   \n",
       "2                                      BOOK     0.990  1.000000  0.010204   \n",
       "3                             SEARCH_REPORT     1.000  1.000000  0.000000   \n",
       "4                             OFFICE_ACTION     0.995  0.000000  0.000000   \n",
       "5                    CONFERENCE_PROCEEDINGS     0.985  0.857143  0.005376   \n",
       "6   PREPRINT/WORKING_PAPER/TECHNICAL_REPORT     0.980  1.000000  0.020101   \n",
       "7                                    PATENT     0.995  1.000000  0.006369   \n",
       "8                                      WIKI     1.000  1.000000  0.000000   \n",
       "9                     PRODUCT_DOCUMENTATION     0.970  0.538462  0.000000   \n",
       "10                            NORM_STANDARD     0.995  0.000000  0.000000   \n",
       "11                                   THESIS     1.000  1.000000  0.000000   \n",
       "12                                 DATABASE     1.000  1.000000  0.000000   \n",
       "\n",
       "    number of elements  \n",
       "0                  111  \n",
       "1                    6  \n",
       "2                    4  \n",
       "3                    1  \n",
       "4                    1  \n",
       "5                   14  \n",
       "6                    1  \n",
       "7                   43  \n",
       "8                    2  \n",
       "9                   13  \n",
       "10                   1  \n",
       "11                   1  \n",
       "12                   2  "
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## evaluate GPT's accuracy \n",
    "\n",
    "labels = list(set(list(data1['category'])))\n",
    "conf_matrix = multilabel_confusion_matrix(true_labels , predicted_labels, labels=labels)\n",
    "\n",
    "df_metrics = pd.DataFrame()\n",
    "\n",
    "\n",
    "missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(predicted_labels)\n",
    "print('Overall accuracy: ', 100 - missclassifications_rate)\n",
    "\n",
    "\n",
    "acc = []\n",
    "tpr = []\n",
    "fpr = []\n",
    "for elem in conf_matrix:\n",
    "    acc.append(accuracy(elem))\n",
    "    tpr.append(TPR(elem))\n",
    "    fpr.append(FPR(elem))\n",
    "\n",
    "\n",
    "df_metrics['class'] = labels\n",
    "df_metrics['accuracy'] = acc\n",
    "df_metrics['TPR'] = tpr\n",
    "df_metrics['FPR'] = fpr\n",
    "\n",
    "df_metrics.merge(df_counter, on='class')\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04682790-430e-4b6c-af80-40a9572e450e",
   "metadata": {},
   "source": [
    "## Use GPT-4 to classify a 5000 citations sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "4f24a6b9-8d66-42e1-b2ba-719a1a271908",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5000\n"
     ]
    }
   ],
   "source": [
    "dic_result = {}\n",
    "count = 0\n",
    "path = '/home/fs01/spec1142/Emma/test/for_grobid_all_v0/'\n",
    "file = path + 'oa_crosswalk_without_sp_levi1_bq_citations0.txt'\n",
    "\n",
    "with open(file) as lines:\n",
    "    for line_ in lines: \n",
    "        \n",
    "        dic_result[count] = line_.replace('\\n','')\n",
    "        count += 1\n",
    "\n",
    "print(len(dic_result))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "b48d1f79-cbed-40ab-8125-17735a84cfb4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oa_citation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Myositis Association Retrieved From On-Line We...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Wiegert ` 259</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Trakadis, Y.J. \"Patient-Controlled Encrypted ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         oa_citation\n",
       "0  Myositis Association Retrieved From On-Line We...\n",
       "1   Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...\n",
       "2   Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...\n",
       "3                                      Wiegert ` 259\n",
       "4   Trakadis, Y.J. \"Patient-Controlled Encrypted ..."
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df5000 = pd.DataFrame()\n",
    "df5000['oa_citation'] = dic_result.values()\n",
    "df5000.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "e3d6bbaf-e110-4429-afb7-d9e11475c5e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "## function to classify a chunk of 50 citations using GPT-4\n",
    "\n",
    "def classify_50_oa_citations(citations, openai_api_key):\n",
    "\n",
    "    \"\"\"\n",
    "    This function uses the OpenAI API to classify a series of 50 citations made in office actions by the US patent office.\n",
    "\n",
    "    Parameters:\n",
    "    citations (list): A list of citations to be classified.\n",
    "    openai.api_key (str): The API key for the OpenAI API.\n",
    "\n",
    "    Note:\n",
    "    - The function constructs a query string that includes the citations to be classified and a set of instructions for the OpenAI API.\n",
    "    - The function returns a list of classification results, with each result including the class of the cited document and the number of the cited document.\n",
    "    \"\"\"\n",
    "    \n",
    "    number = 50\n",
    "    results = []\n",
    "    \n",
    "    query = \"\"\"I am going to give you a series of \"\"\" + str(number) + \"\"\" citations that have been made in office actions by the US patent office. I want you to classify each cited document as being one of the following:\n",
    "    WEBPAGE: Website\n",
    "    PATENT: A patent or patent application\n",
    "    PREPRINT/WORKING_PAPER/TECHNICAL_REPORT: Any public, non-peer reviewed technical document. These can be published on preprint servers, institute/personal websites, or even governmental archives.\n",
    "    JOURNAL_ARTICLE: A peer reviewed article published in a journal.\n",
    "    CONFERENCE_PROCEEDINGS: An article published as part of conference proceedings. The peer review process for such proceedings varies significantly, and differs from journal article in that it is a one-off publication.\n",
    "    BOOK: A book or chapter in a book. Book chapters are a common outlet for academic research, but are often not peer reviewed by independent parties, and are usually less accessible than the average journal article.\n",
    "    THESIS: Thesis, sually archived by the degree-granting institution.\n",
    "    NORM_STANDARD: An industrial norm or standard\n",
    "    PRODUCT_DOCUMENTATION: documentation for a product, such as a user manual or catalogue\n",
    "    OFFICE_ACTION: A different office action sent by the patent office\n",
    "    WIKI: A wikipedia page (a subset of webpage)\n",
    "    DATABASE: A database, such as a genetic or corporate database\n",
    "    LITIGATION: A court case or formal opposition proceeding within the patent office\n",
    "    SEARCH_REPORT: A search report issued by a patent office\n",
    "    Only list the classes of the cited document and the number of the cited document.\"\"\"\n",
    "    \n",
    "    \n",
    "    texts = \"; \".join([ str(k) + ': ' + citations[k] for k  in range(len(citations)) ] )\n",
    "        \n",
    "    completion = openai.ChatCompletion.create(\n",
    "            model=\"gpt-4-0125-preview\", \n",
    "            messages=[{\"role\": \"system\", \"content\": query},\n",
    "                      {\"role\": \"user\", \"content\": texts}],\n",
    "            temperature= 0.2)\n",
    "        \n",
    "    results += completion['choices'][0]['message']['content'].split('\\n')\n",
    "        \n",
    "\n",
    "    return results "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "d517a584-2ec5-4817-93fb-f8918f9824fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "## function \n",
    "\n",
    "def multi_gpt(openai_api_key,i):\n",
    "\n",
    "    \"\"\"\n",
    "    This function uses the OpenAI API to classify a series of citations in a given dataframe, in batches of 50.\n",
    "\n",
    "    Parameters:\n",
    "    openai.api_key (str): The API key for the OpenAI API.\n",
    "    i (int): The index of the dataframe to be classified.\n",
    "\n",
    "    Note:\n",
    "    - The function selects a subset of the dataframe `df5000` based on the given index `i`.\n",
    "    - The function then divides the subset into smaller batches of 50 citations and uses the `classify_50_oa_citations` function to classify each batch.\n",
    "    - The function returns a list of classification results for the entire subset of the dataframe.\n",
    "    \"\"\"\n",
    "    \n",
    "    result = []\n",
    "    medium_df = df5000[200*i:200*(i+1)]\n",
    "    for k in range(4):\n",
    "        small_df = medium_df[50*k:50*(k+1)]\n",
    "        citations = list(small_df['oa_citation'])\n",
    "        res = classify_50_oa_citations(citations,openai_api_key)\n",
    "\n",
    "        if len(res) == 50:\n",
    "            result += res\n",
    "        else:\n",
    "            res = classify_50_oa_citations(citations,openai_api_key)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60ea422b-c18e-4eef-affd-76e08eff500b",
   "metadata": {},
   "outputs": [],
   "source": [
    "## classify the citations (uses 12 cpus)\n",
    "\n",
    "openai_api_key = openai.api_key\n",
    "\n",
    "p = Pool(processes=12)\n",
    "func = partial(multi_gpt,openai_api_key)\n",
    "results = p.map(func, [ i  for i in range(12)])\n",
    "p.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "160f18ab-87a4-4e3a-b043-ed1e9ddc7db0",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1450\n",
      "2349\n",
      "4748\n"
     ]
    }
   ],
   "source": [
    "## clean the results. Note that GPT-4 tends to forget some citations. \n",
    "\n",
    "labels = []\n",
    "clean_list = []\n",
    "count = 0 \n",
    "\n",
    "for elem in results:\n",
    "    k = 0 \n",
    "    for line in elem:\n",
    "        count += 1\n",
    "        if line.split(':')[0] == str(k):\n",
    "            k += 1\n",
    "            if len(line.split(': ')) == 1:\n",
    "                labels.append('None')\n",
    "            else:\n",
    "                labels.append(line.split(': ')[1]) \n",
    "            \n",
    "            if k == 50:\n",
    "                k = 0\n",
    "        else:\n",
    "            \n",
    "            labels.append('None') \n",
    "            k += 1\n",
    "            if len(line.split(': ')) == 1:\n",
    "                labels.append('None')\n",
    "            else:\n",
    "                labels.append(line.split(': ')[1]) \n",
    "            \n",
    "            if k == 50:\n",
    "                k = 0\n",
    "            k += 1\n",
    "            if k == 50:\n",
    "                k = 0\n",
    "            \n",
    "            \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "id": "2d3a6c15-c963-48fe-8bbf-b75f35f7d94d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oa_citation</th>\n",
       "      <th>labels</th>\n",
       "      <th>flag</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Myositis Association Retrieved From On-Line We...</td>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...</td>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...</td>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Wiegert ` 259</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Trakadis, Y.J. \"Patient-Controlled Encrypted ...</td>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4995</th>\n",
       "      <td>Us 0057553 A</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4996</th>\n",
       "      <td>De-102017004043-A1</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4997</th>\n",
       "      <td>Ting Et Al. (Cn 105151567) Machine Translatio...</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4998</th>\n",
       "      <td>Ca2829631</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4999</th>\n",
       "      <td>Wo-2017079461-A2</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5000 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            oa_citation           labels  flag\n",
       "0     Myositis Association Retrieved From On-Line We...          WEBPAGE     0\n",
       "1      Hirsh Et Al. Weekly Nab-Paclitaxel In Combina...  JOURNAL_ARTICLE     0\n",
       "2      Vaidyanathan Et Al. Bioconjugate Chem. 1990, ...  JOURNAL_ARTICLE     0\n",
       "3                                         Wiegert ` 259           PATENT     0\n",
       "4      Trakadis, Y.J. \"Patient-Controlled Encrypted ...  JOURNAL_ARTICLE     0\n",
       "...                                                 ...              ...   ...\n",
       "4995                                       Us 0057553 A           PATENT     0\n",
       "4996                                 De-102017004043-A1           PATENT     0\n",
       "4997   Ting Et Al. (Cn 105151567) Machine Translatio...           PATENT     0\n",
       "4998                                          Ca2829631           PATENT     0\n",
       "4999                                   Wo-2017079461-A2           PATENT     0\n",
       "\n",
       "[5000 rows x 3 columns]"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## save the data classified by GPT-4 and flag the potential errors\n",
    "\n",
    "df5000['labels'] = labels\n",
    "df5000[df5000['labels'] == 'None']\n",
    "df5000['flag'] = [ 1 if index in range(1400,1450) else 1 if  index in range(2300,2350) else 1 if index in range(4700,4750) else 0 for index in df5000.index]\n",
    "df5000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "id": "8a0fabc3-3887-4eba-ba3d-bd2955fc4e46",
   "metadata": {},
   "outputs": [],
   "source": [
    "## save the 5000 citations sample\n",
    "\n",
    "df5000.to_csv('/home/fs01/spec1142/Emma/test/' + 'gpt4_5000sample.csv', index = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18ddb37b-1d95-487c-970d-feaca841d433",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Train our own model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e75ebd3-54e6-4516-aece-443dc4e90c0b",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2559990b-f03e-4382-a577-64dae9da6cf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "## load manually classified citations and clean the labels \n",
    "\n",
    "files = glob.glob('/home/fs01/spec1142/Emma/test/test_files/oa/*')\n",
    "data = pd.concat( [ pd.read_excel(elem) for elem in files])\n",
    "data = data[ ( data['manual check'] == 'y' ) |  ( data['manual_check'] == 'y')]\n",
    "\n",
    "data['category'] = [  elem for elem in data['npl_cat']]\n",
    "\n",
    "data = data[ ( data['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) &  ( data['category'] != '?' ) ]\n",
    "data['category'] = data['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
    "data['category'] = data['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
    "data['category'] = data['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8a5bb789-c973-4cfa-b5ea-5ae2f63e3482",
   "metadata": {},
   "outputs": [],
   "source": [
    "## load  citations classified by GPT-4 and clean the labels \n",
    "\n",
    "gpt_data = pd.read_csv('/home/fs01/spec1142/Emma/test/test_files/gpt4_5000sample.csv')\n",
    "gpt_data = gpt_data[gpt_data['flag'] == 0]\n",
    "gpt_data = gpt_data.rename(columns = { 'oa_citation':'npl_biblio' , 'labels' : 'category' })\n",
    "gpt_data  = gpt_data[['npl_biblio','category']]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4c584e21-706c-42bd-b734-fa440571a409",
   "metadata": {},
   "outputs": [],
   "source": [
    "## merge the two files\n",
    "\n",
    "data  = data[['npl_biblio','category']]\n",
    "data = pd.concat([data,gpt_data])\n",
    "data = data[(data['category'] != 'None Cited')&(data['category'] != 'GOVERNMENT_REPORT') ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4ce5185a-d8f2-43c3-93cf-e5fa041de256",
   "metadata": {},
   "outputs": [],
   "source": [
    "## sample the citations classified as patent, journal article and webpage to have a more balanced dataset.\n",
    "\n",
    "data_patents = data[data['category'] == 'PATENT'].sample(frac=0.12)\n",
    "data_articles = data[data['category'] == 'JOURNAL_ARTICLE'].sample(frac=0.4)\n",
    "data_webpage = data[data['category'] == 'WEBPAGE'].sample(frac=0.3)\n",
    "\n",
    "data_no_patents = data[(data['category'] != 'PATENT') & (data['category'] != 'JOURNAL_ARTICLE') & (data['category'] != 'WEBPAGE')]\n",
    "data = pd.concat([data_patents,data_articles,data_webpage,data_no_patents]).sample(frac=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "def53362-505f-4345-b9ef-eeaa683d2a0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "## keep classes with more than 20 datapoints\n",
    "\n",
    "df = data.groupby('category').count()\n",
    "data = data[data['category'].isin(list(df[df['npl_biblio'] > 20].index))]\n",
    "data['labels'] = pd.factorize(data['category'], sort=True)[0]\n",
    "data = data.sample(frac=1)\n",
    "dic_labels = { elem[1] : elem[0] for elem in data[['category','labels']].drop_duplicates().to_numpy() } \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "f39daf5f-efcd-4577-9763-61317fc9fcc8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1908\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>npl_biblio</th>\n",
       "      <th>labels</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BOOK</th>\n",
       "      <td>86</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CONFERENCE_PROCEEDINGS</th>\n",
       "      <td>229</td>\n",
       "      <td>229</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DATABASE</th>\n",
       "      <td>141</td>\n",
       "      <td>141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JOURNAL_ARTICLE</th>\n",
       "      <td>399</td>\n",
       "      <td>399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LITIGATION</th>\n",
       "      <td>34</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NORM_STANDARD</th>\n",
       "      <td>51</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>OFFICE_ACTION</th>\n",
       "      <td>103</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PATENT</th>\n",
       "      <td>347</td>\n",
       "      <td>347</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</th>\n",
       "      <td>101</td>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PRODUCT_DOCUMENTATION</th>\n",
       "      <td>103</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SEARCH_REPORT</th>\n",
       "      <td>94</td>\n",
       "      <td>94</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>THESIS</th>\n",
       "      <td>41</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WEBPAGE</th>\n",
       "      <td>104</td>\n",
       "      <td>104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WIKI</th>\n",
       "      <td>75</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         npl_biblio  labels\n",
       "category                                                   \n",
       "BOOK                                             86      86\n",
       "CONFERENCE_PROCEEDINGS                          229     229\n",
       "DATABASE                                        141     141\n",
       "JOURNAL_ARTICLE                                 399     399\n",
       "LITIGATION                                       34      34\n",
       "NORM_STANDARD                                    51      51\n",
       "OFFICE_ACTION                                   103     103\n",
       "PATENT                                          347     347\n",
       "PREPRINT/WORKING_PAPER/TECHNICAL_REPORT         101     101\n",
       "PRODUCT_DOCUMENTATION                           103     103\n",
       "SEARCH_REPORT                                    94      94\n",
       "THESIS                                           41      41\n",
       "WEBPAGE                                         104     104\n",
       "WIKI                                             75      75"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## print dataset by classes\n",
    "\n",
    "print(len(data['category']))\n",
    "data.groupby('category').count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ebb5e95-c21b-4cfe-8db2-bee25a91e008",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "### Train model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ffb924a0-0fdc-4baf-96ef-0fe3788bfcf3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/fs01/spec1142/anaconda3/envs/patents2/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
      "/home/fs01/spec1142/anaconda3/envs/patents2/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
      "  warnings.warn(\n",
      "Epoch 1: 100%|██████████████████████████████████| 24/24 [04:02<00:00, 10.08s/it]\n",
      "Validation - Epoch 1: 100%|███████████████████████| 6/6 [00:08<00:00,  1.40s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1, Loss: 53.3975, Validation Accuracy: 0.5131\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 2: 100%|██████████████████████████████████| 24/24 [02:19<00:00,  5.83s/it]\n",
      "Validation - Epoch 2: 100%|███████████████████████| 6/6 [00:09<00:00,  1.57s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2, Loss: 38.5001, Validation Accuracy: 0.6047\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 3: 100%|██████████████████████████████████| 24/24 [02:28<00:00,  6.19s/it]\n",
      "Validation - Epoch 3: 100%|███████████████████████| 6/6 [00:08<00:00,  1.38s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 3, Loss: 30.2190, Validation Accuracy: 0.6990\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 4: 100%|██████████████████████████████████| 24/24 [02:33<00:00,  6.38s/it]\n",
      "Validation - Epoch 4: 100%|███████████████████████| 6/6 [00:08<00:00,  1.36s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 4, Loss: 23.6015, Validation Accuracy: 0.7592\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 5: 100%|██████████████████████████████████| 24/24 [02:19<00:00,  5.83s/it]\n",
      "Validation - Epoch 5: 100%|███████████████████████| 6/6 [00:09<00:00,  1.61s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 5, Loss: 18.6547, Validation Accuracy: 0.7775\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 6: 100%|██████████████████████████████████| 24/24 [02:37<00:00,  6.58s/it]\n",
      "Validation - Epoch 6: 100%|███████████████████████| 6/6 [00:08<00:00,  1.38s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 6, Loss: 15.1288, Validation Accuracy: 0.7592\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 7: 100%|██████████████████████████████████| 24/24 [02:23<00:00,  5.97s/it]\n",
      "Validation - Epoch 7: 100%|███████████████████████| 6/6 [00:08<00:00,  1.48s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 7, Loss: 12.4134, Validation Accuracy: 0.7775\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 8: 100%|██████████████████████████████████| 24/24 [02:22<00:00,  5.93s/it]\n",
      "Validation - Epoch 8: 100%|███████████████████████| 6/6 [00:08<00:00,  1.45s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 8, Loss: 10.0054, Validation Accuracy: 0.7984\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "('fine_tuned_model/tokenizer_config.json',\n",
       " 'fine_tuned_model/special_tokens_map.json',\n",
       " 'fine_tuned_model/vocab.txt',\n",
       " 'fine_tuned_model/added_tokens.json')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import BertTokenizer, BertForSequenceClassification, AdamW\n",
    "from torch.utils.data import DataLoader, Dataset\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.metrics import accuracy_score\n",
    "from tqdm import tqdm\n",
    "\n",
    "# Dummy data (replace this with your dataset)\n",
    "\n",
    "texts = list(data['npl_biblio'])\n",
    "labels = list(data['labels'])\n",
    "\n",
    "# Encoding labels\n",
    "label_encoder = LabelEncoder()\n",
    "encoded_labels = label_encoder.fit_transform(labels)\n",
    "\n",
    "# Splitting the data\n",
    "train_texts, val_texts, train_labels, val_labels = train_test_split(\n",
    "    texts, encoded_labels, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "# Custom Dataset class\n",
    "class CustomDataset(Dataset):\n",
    "    def __init__(self, texts, labels, tokenizer, max_length=128):\n",
    "        self.texts = texts\n",
    "        self.labels = labels\n",
    "        self.tokenizer = tokenizer\n",
    "        self.max_length = max_length\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.texts)\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        text = str(self.texts[idx])\n",
    "        label = self.labels[idx]\n",
    "\n",
    "        encoding = self.tokenizer(\n",
    "            text,\n",
    "            truncation=True,\n",
    "            padding='max_length',\n",
    "            max_length=self.max_length,\n",
    "            return_tensors='pt',\n",
    "        )\n",
    "\n",
    "        return {\n",
    "            'input_ids': encoding['input_ids'].flatten(),\n",
    "            'attention_mask': encoding['attention_mask'].flatten(),\n",
    "            'labels': torch.tensor(label, dtype=torch.long)\n",
    "        }\n",
    "\n",
    "# Load BERT model and tokenizer\n",
    "model_name = 'bert-base-multilingual-cased'\n",
    "tokenizer = BertTokenizer.from_pretrained(model_name)\n",
    "model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
    "\n",
    "# Create Dataset instances\n",
    "train_dataset = CustomDataset(train_texts, train_labels, tokenizer)\n",
    "val_dataset = CustomDataset(val_texts, val_labels, tokenizer)\n",
    "\n",
    "# DataLoader instances\n",
    "train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)\n",
    "val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n",
    "\n",
    "# Set device\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "model.to(device)\n",
    "\n",
    "# Optimizer and loss function\n",
    "optimizer = AdamW(model.parameters(), lr=2e-5)\n",
    "loss_fn = torch.nn.CrossEntropyLoss()\n",
    "\n",
    "# Training loop\n",
    "num_epochs = 8\n",
    "for epoch in range(num_epochs):\n",
    "    model.train()\n",
    "    total_loss = 0\n",
    "    for batch in tqdm(train_dataloader, desc=f\"Epoch {epoch + 1}\"):\n",
    "        optimizer.zero_grad()\n",
    "        input_ids = batch['input_ids'].to(device)\n",
    "        attention_mask = batch['attention_mask'].to(device)\n",
    "        labels = batch['labels'].to(device)\n",
    "        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n",
    "        loss = outputs.loss\n",
    "        total_loss += loss.item()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "\n",
    "    # Validation\n",
    "    model.eval()\n",
    "    val_predictions = []\n",
    "    val_true_labels = []\n",
    "    with torch.no_grad():\n",
    "        for batch in tqdm(val_dataloader, desc=f\"Validation - Epoch {epoch + 1}\"):\n",
    "            input_ids = batch['input_ids'].to(device)\n",
    "            attention_mask = batch['attention_mask'].to(device)\n",
    "            labels = batch['labels'].to(device)\n",
    "            outputs = model(input_ids, attention_mask=attention_mask)\n",
    "            logits = outputs.logits\n",
    "            predictions = torch.argmax(logits, dim=1)\n",
    "            val_predictions.extend(predictions.cpu().numpy())\n",
    "            val_true_labels.extend(labels.cpu().numpy())\n",
    "\n",
    "    # Calculate accuracy\n",
    "    accuracy = accuracy_score(val_true_labels, val_predictions)\n",
    "    print(f\"Epoch {epoch + 1}, Loss: {total_loss:.4f}, Validation Accuracy: {accuracy:.4f}\")\n",
    "\n",
    "# Save the fine-tuned model\n",
    "model.save_pretrained(\"fine_tuned_model\")\n",
    "tokenizer.save_pretrained(\"fine_tuned_model\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78b87c8f-8135-4fff-b09a-80f2b156746b",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "### Evaluate the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "06a7e3d9-a9f4-45e0-be31-627817ddf56c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>manual check</th>\n",
       "      <th>GPT4 check</th>\n",
       "      <th>GPT3.5 check</th>\n",
       "      <th>Bib subcategory</th>\n",
       "      <th>npl_biblio</th>\n",
       "      <th>md5</th>\n",
       "      <th>language_is_reliable</th>\n",
       "      <th>language_code</th>\n",
       "      <th>npl_cat</th>\n",
       "      <th>npl_cat_score</th>\n",
       "      <th>npl_cat_language_flag</th>\n",
       "      <th>patcit_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Watanabe Et Al Us Patent Application Publicat...</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.51</td>\n",
       "      <td>False</td>\n",
       "      <td>2ca504f11c3b378ce7be4619e2ee843f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Dzulkafli Et Al., \"Effects Of Talc On Fire Re...</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>0685ae955c71d728f69046458ac1db0f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Zdepski Pub No Us 2017-0201784\\n</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.33</td>\n",
       "      <td>False</td>\n",
       "      <td>c6de38a1aa0a879105ced194459f343e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Nobori Et Al. (Cancer Research, 1997, 51:3193...</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.60</td>\n",
       "      <td>False</td>\n",
       "      <td>29e27156420faa44ae01a9e1a6363781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Fach Et Al, Neonatal Ovine Pulmonary Dendriti...</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.98</td>\n",
       "      <td>False</td>\n",
       "      <td>d2bea0db1c51b5ff13072c66202ba3fe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>294</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Machine Translation Of Jp-2007224953 (Year: 2...</td>\n",
       "      <td>3e8f555055a762f69f1349d5df9887be</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.34</td>\n",
       "      <td>False</td>\n",
       "      <td>3e8f555055a762f69f1349d5df9887be</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>295</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...</td>\n",
       "      <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.65</td>\n",
       "      <td>False</td>\n",
       "      <td>02bf7bfcd8b0c278b2f170a4863b1963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>296</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Mastaloudis, A., Et Al., “Antioxidant Supplem...</td>\n",
       "      <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.97</td>\n",
       "      <td>False</td>\n",
       "      <td>e994b3f8571d466a279ba05b87eacf87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>297</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CONFERENCE PROCEEDINGS</td>\n",
       "      <td>Chang Et Al., Motion Registration And Correct...</td>\n",
       "      <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.97</td>\n",
       "      <td>False</td>\n",
       "      <td>6e88f4dabe30e52db23a420f8203a433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>299</th>\n",
       "      <td>y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>JOURNAL ARTICLE</td>\n",
       "      <td>Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....</td>\n",
       "      <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
       "      <td>True</td>\n",
       "      <td>en</td>\n",
       "      <td>BIBLIOGRAPHICAL_REFERENCE</td>\n",
       "      <td>0.96</td>\n",
       "      <td>False</td>\n",
       "      <td>891695ca1e05a56afc5987ea1ab7feb0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>228 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    manual check GPT4 check GPT3.5 check         Bib subcategory  \\\n",
       "1              y          y            y                     NaN   \n",
       "4              y          y            y         JOURNAL ARTICLE   \n",
       "5              y          y            y                     NaN   \n",
       "7              y          y            y         JOURNAL ARTICLE   \n",
       "9              y          y            y         JOURNAL ARTICLE   \n",
       "..           ...        ...          ...                     ...   \n",
       "294            y        NaN          NaN                     NaN   \n",
       "295            y        NaN          NaN                     NaN   \n",
       "296            y        NaN          NaN         JOURNAL ARTICLE   \n",
       "297            y        NaN          NaN  CONFERENCE PROCEEDINGS   \n",
       "299            y        NaN          NaN         JOURNAL ARTICLE   \n",
       "\n",
       "                                            npl_biblio  \\\n",
       "1     Watanabe Et Al Us Patent Application Publicat...   \n",
       "4     Dzulkafli Et Al., \"Effects Of Talc On Fire Re...   \n",
       "5                     Zdepski Pub No Us 2017-0201784\\n   \n",
       "7     Nobori Et Al. (Cancer Research, 1997, 51:3193...   \n",
       "9     Fach Et Al, Neonatal Ovine Pulmonary Dendriti...   \n",
       "..                                                 ...   \n",
       "294   Machine Translation Of Jp-2007224953 (Year: 2...   \n",
       "295   Ibm, Translucent Drag Icons (Tdb Acc. No. Nn9...   \n",
       "296   Mastaloudis, A., Et Al., “Antioxidant Supplem...   \n",
       "297   Chang Et Al., Motion Registration And Correct...   \n",
       "299   Bevan Br. J. Pharmacol. (1992), Vol. 107, Pp....   \n",
       "\n",
       "                                  md5  language_is_reliable language_code  \\\n",
       "1    2ca504f11c3b378ce7be4619e2ee843f                  True            en   \n",
       "4    0685ae955c71d728f69046458ac1db0f                  True            en   \n",
       "5    c6de38a1aa0a879105ced194459f343e                  True            en   \n",
       "7    29e27156420faa44ae01a9e1a6363781                  True            en   \n",
       "9    d2bea0db1c51b5ff13072c66202ba3fe                  True            en   \n",
       "..                                ...                   ...           ...   \n",
       "294  3e8f555055a762f69f1349d5df9887be                  True            en   \n",
       "295  02bf7bfcd8b0c278b2f170a4863b1963                  True            en   \n",
       "296  e994b3f8571d466a279ba05b87eacf87                  True            en   \n",
       "297  6e88f4dabe30e52db23a420f8203a433                  True            en   \n",
       "299  891695ca1e05a56afc5987ea1ab7feb0                  True            en   \n",
       "\n",
       "                       npl_cat  npl_cat_score  npl_cat_language_flag  \\\n",
       "1                       PATENT           0.51                  False   \n",
       "4    BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "5                       PATENT           0.33                  False   \n",
       "7    BIBLIOGRAPHICAL_REFERENCE           0.60                  False   \n",
       "9    BIBLIOGRAPHICAL_REFERENCE           0.98                  False   \n",
       "..                         ...            ...                    ...   \n",
       "294                     PATENT           0.34                  False   \n",
       "295  BIBLIOGRAPHICAL_REFERENCE           0.65                  False   \n",
       "296  BIBLIOGRAPHICAL_REFERENCE           0.97                  False   \n",
       "297  BIBLIOGRAPHICAL_REFERENCE           0.97                  False   \n",
       "299  BIBLIOGRAPHICAL_REFERENCE           0.96                  False   \n",
       "\n",
       "                            patcit_id  \n",
       "1    2ca504f11c3b378ce7be4619e2ee843f  \n",
       "4    0685ae955c71d728f69046458ac1db0f  \n",
       "5    c6de38a1aa0a879105ced194459f343e  \n",
       "7    29e27156420faa44ae01a9e1a6363781  \n",
       "9    d2bea0db1c51b5ff13072c66202ba3fe  \n",
       "..                                ...  \n",
       "294  3e8f555055a762f69f1349d5df9887be  \n",
       "295  02bf7bfcd8b0c278b2f170a4863b1963  \n",
       "296  e994b3f8571d466a279ba05b87eacf87  \n",
       "297  6e88f4dabe30e52db23a420f8203a433  \n",
       "299  891695ca1e05a56afc5987ea1ab7feb0  \n",
       "\n",
       "[228 rows x 12 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## load sample of ~300 citations manually classified by Kyle \n",
    "\n",
    "data1 = pd.read_excel('/home/fs01/spec1142/Emma/test/test_files/Copy of oa_300_sample_checked_kyle.xlsx')\n",
    "data1 = data1[data1['manual check'] == 'y']\n",
    "data1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "755539d6-76bc-4fb0-ab4b-5d90c3afc009",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean the labels \n",
    "\n",
    "data1['category'] = [ elem[0] if pd.isna(elem[0]) == False else elem[1] for elem in data1[['Bib subcategory','npl_cat']].to_numpy()]\n",
    "data1 = data1[ ( data1['category'] != 'BIBLIOGRAPHICAL_REFERENCE' ) &  ( data1['category'] != '?' ) ]\n",
    "data1['category'] = data1['category'].replace('JOURNAL ARTICLE', 'JOURNAL_ARTICLE')\n",
    "data1['category'] = data1['category'].replace('CONFERENCE PROCEEDINGS', 'CONFERENCE_PROCEEDINGS')\n",
    "data1['category'] = data1['category'].replace('PREPRINT/WORKING PAPER/TECHNICAL REPORT', 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT')\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e9414d15-ec71-475a-8f65-7118af5d6928",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-05-20 18:07:01.923779: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-05-20 18:07:01.923908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-05-20 18:07:02.599782: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2024-05-20 18:07:03.449559: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2024-05-20 18:07:34.661954: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
     ]
    }
   ],
   "source": [
    "## load our own model \n",
    "\n",
    "from transformers import TextClassificationPipeline\n",
    "\n",
    "model_name = 'fine_tuned_model'\n",
    "tokenizer = BertTokenizer.from_pretrained(model_name)\n",
    "model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
    "\n",
    "pipe2 = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5aebc9ae-7b9e-4498-9dc7-76a292824fdd",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "## classify the testing set with our model\n",
    "\n",
    "test = data1[data1['category'].isin(dic_labels.values())][['npl_biblio','category']].to_numpy()\n",
    "\n",
    "start = time.time()\n",
    "pred_label = []\n",
    "true_label = []\n",
    "pred_label_raw = pipe2(list(test[:,0]), batch_size = 8)\n",
    "\n",
    "for k in tqdm(range(len(test))):\n",
    "    pred_label.append(dic_labels[int(pred_label_raw[k]['label'][6:])])\n",
    "    true_label.append(test[k][1])\n",
    "    \n",
    "\n",
    "end = time.time()\n",
    "print(end - start)\n",
    "\n",
    "labels = list(set(true_label))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "6a3fc670-904f-45cd-8f42-d57b134f5126",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall accuracy:  83.02752293577981\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>labels</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>TPR</th>\n",
       "      <th>FPR</th>\n",
       "      <th>true</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>WEBPAGE</td>\n",
       "      <td>0.940367</td>\n",
       "      <td>0.777778</td>\n",
       "      <td>0.052632</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "      <td>0.949541</td>\n",
       "      <td>0.905172</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>THESIS</td>\n",
       "      <td>0.995413</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "      <td>0.954128</td>\n",
       "      <td>0.533333</td>\n",
       "      <td>0.014778</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SEARCH_REPORT</td>\n",
       "      <td>0.995413</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.004608</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>BOOK</td>\n",
       "      <td>0.990826</td>\n",
       "      <td>0.833333</td>\n",
       "      <td>0.004717</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>DATABASE</td>\n",
       "      <td>0.995413</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.004630</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NORM_STANDARD</td>\n",
       "      <td>0.995413</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>PREPRINT/WORKING_PAPER/TECHNICAL_REPORT</td>\n",
       "      <td>0.972477</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.023148</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "      <td>0.995413</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>PATENT</td>\n",
       "      <td>0.940367</td>\n",
       "      <td>0.760870</td>\n",
       "      <td>0.011628</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>WIKI</td>\n",
       "      <td>0.990826</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.009259</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "      <td>0.944954</td>\n",
       "      <td>0.937500</td>\n",
       "      <td>0.054455</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     labels  accuracy       TPR       FPR  \\\n",
       "0                                   WEBPAGE  0.940367  0.777778  0.052632   \n",
       "1                           JOURNAL_ARTICLE  0.949541  0.905172  0.000000   \n",
       "2                                    THESIS  0.995413  0.000000  0.000000   \n",
       "3                     PRODUCT_DOCUMENTATION  0.954128  0.533333  0.014778   \n",
       "4                             SEARCH_REPORT  0.995413  1.000000  0.004608   \n",
       "5                                      BOOK  0.990826  0.833333  0.004717   \n",
       "6                                  DATABASE  0.995413  1.000000  0.004630   \n",
       "7                             NORM_STANDARD  0.995413  0.000000  0.000000   \n",
       "8   PREPRINT/WORKING_PAPER/TECHNICAL_REPORT  0.972477  0.500000  0.023148   \n",
       "9                             OFFICE_ACTION  0.995413  0.000000  0.000000   \n",
       "10                                   PATENT  0.940367  0.760870  0.011628   \n",
       "11                                     WIKI  0.990826  1.000000  0.009259   \n",
       "12                   CONFERENCE_PROCEEDINGS  0.944954  0.937500  0.054455   \n",
       "\n",
       "    true  \n",
       "0      9  \n",
       "1    116  \n",
       "2      1  \n",
       "3     15  \n",
       "4      1  \n",
       "5      6  \n",
       "6      2  \n",
       "7      1  \n",
       "8      2  \n",
       "9      1  \n",
       "10    46  \n",
       "11     2  \n",
       "12    16  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## evaluate the model \n",
    "\n",
    "conf_matrix = multilabel_confusion_matrix(true_label , pred_label,labels=labels)#, labels=labels)\n",
    "\n",
    "df_metrics = pd.DataFrame()\n",
    "\n",
    "\n",
    "missclassifications_rate = 100*sum( [ conf_matrix[k][0][1] for k in range(len(conf_matrix)) ]) / len(pred_label)\n",
    "print('Overall accuracy: ', 100 - missclassifications_rate)\n",
    "\n",
    "\n",
    "acc = []\n",
    "tpr = []\n",
    "fpr = []\n",
    "count_true = [] \n",
    "for elem in conf_matrix:\n",
    "    acc.append(accuracy(elem))\n",
    "    tpr.append(TPR(elem))\n",
    "    fpr.append(FPR(elem))\n",
    "    count_true.append(elem[1][1] + elem[1][0])\n",
    "\n",
    "\n",
    "df_metrics['labels'] = [ k for k in labels]\n",
    "\n",
    "df_metrics['accuracy'] = acc\n",
    "df_metrics['TPR'] = tpr\n",
    "df_metrics['FPR'] = fpr\n",
    "df_metrics['true'] = count_true\n",
    "\n",
    "df_metrics\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2604672e-df2f-4674-9fcf-b5ee21a37352",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Classify the citations with the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2e29cc0-da34-4274-b700-a5236e427284",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load BERT model and tokenizer\n",
    "\n",
    "from transformers import BertTokenizer, BertForSequenceClassification, AdamW\n",
    "from transformers import TextClassificationPipeline\n",
    "\n",
    "\n",
    "## load classes names \n",
    "dic_labels = {7: 'PATENT',\n",
    " 4: 'LITIGATION',\n",
    " 2: 'DATABASE',\n",
    " 13: 'WIKI',\n",
    " 12: 'WEBPAGE',\n",
    " 1: 'CONFERENCE_PROCEEDINGS',\n",
    " 3: 'JOURNAL_ARTICLE',\n",
    " 6: 'OFFICE_ACTION',\n",
    " 10: 'SEARCH_REPORT',\n",
    " 9: 'PRODUCT_DOCUMENTATION',\n",
    " 5: 'NORM_STANDARD',\n",
    " 0: 'BOOK',\n",
    " 8: 'PREPRINT/WORKING_PAPER/TECHNICAL_REPORT',\n",
    " 11: 'THESIS'}\n",
    "\n",
    "encoded_labels = list(encoded_labels.values())\n",
    "\n",
    "\n",
    "## load classification model \n",
    "model_name = 'fine_tuned_model'\n",
    "tokenizer = BertTokenizer.from_pretrained(model_name)\n",
    "model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(set(encoded_labels)))\n",
    "\n",
    "pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False,truncation=True)\n",
    "\n",
    "\n",
    "\n",
    "## classify the citations and save the classes labels \n",
    "files = glob.glob(path_base + 'oa_data_v1/*')\n",
    "\n",
    "\n",
    "for k in range(162,len(files)):\n",
    "\n",
    "    ## load citations\n",
    "    dic_result = {}\n",
    "    count = 0\n",
    "    file= files[k]\n",
    "    \n",
    "    with open(file) as lines:\n",
    "        for line_ in lines: \n",
    "            \n",
    "            dic_result[count] = line_.replace('\\n','')\n",
    "            count += 1\n",
    "\n",
    "    print(len(dic_result))\n",
    "    \n",
    "    df = pd.DataFrame()\n",
    "    df['oa_citation'] = dic_result.values()\n",
    "    \n",
    "    ## classify citations\n",
    "    start = time.time()\n",
    "    result = pipe(list(df['oa_citation']), batch_size = 128)\n",
    "    end = time.time()\n",
    "    print(end - start)\n",
    "    \n",
    "    list_pred = [] \n",
    "    for elem in result:\n",
    "        list_pred.append(dic_labels[int(elem['label'][6:])])\n",
    "    \n",
    "    \n",
    "    \n",
    "    ## save classified citations\n",
    "    df['label'] = list_pred\n",
    "    df.to_csv(path_base + 'oa_data_v1_classified/' + file.split('/')[-1].split('.')[0] + '.tsv', sep = \"\\t\", index = False)\n",
    "    \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a0da027-0880-49a3-a98d-ad99659b8440",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Sample classified citations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "5d1725bf-f0e4-4b93-82ec-3134f5749db3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oa_citation</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>An English Machine Translation Of Александр Ви...</td>\n",
       "      <td>DATABASE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Dash And Konkimalla, Poly-Є-Caprolactone Based...</td>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014045792 Wo A1 淳</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>А. А. Королев, Office Action For Russian Pate...</td>\n",
       "      <td>OFFICE_ACTION</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Xu Cn 104741552A, Cited In Ids Filed 6/29/18</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Legagneur Et Al \"Limbo3 (M = Mn, Fe, Co): Syn...</td>\n",
       "      <td>JOURNAL_ARTICLE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Wang Cn 1037663314</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Haeley Wo 02/41801</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Skoglund Wo 2010/027317</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Li Us Patent No 6,719,697</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Us 0101398 A</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Olivieri Ep 1257118</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Olowinsky Et Al Us Patent Application Publica...</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2003-249598 Jp A</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Therapeutic Potential Of Natural Killer Cells...</td>\n",
       "      <td>BOOK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Sato Jp H11330112 With English Machine Transl...</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>M. Caissy Et Al., Coming Soon: The Internatio...</td>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Sugimoto ` 612</td>\n",
       "      <td>CONFERENCE_PROCEEDINGS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Pentair, Fairbanks Nijhuis, Vertical Turbine ...</td>\n",
       "      <td>PRODUCT_DOCUMENTATION</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Liu Et Al Jp 57-023006</td>\n",
       "      <td>PATENT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          oa_citation                   label\n",
       "0   An English Machine Translation Of Александр Ви...                DATABASE\n",
       "1   Dash And Konkimalla, Poly-Є-Caprolactone Based...         JOURNAL_ARTICLE\n",
       "2                                  2014045792 Wo A1 淳                  PATENT\n",
       "3    А. А. Королев, Office Action For Russian Pate...           OFFICE_ACTION\n",
       "4        Xu Cn 104741552A, Cited In Ids Filed 6/29/18                  PATENT\n",
       "5    Legagneur Et Al \"Limbo3 (M = Mn, Fe, Co): Syn...         JOURNAL_ARTICLE\n",
       "6                                  Wang Cn 1037663314                  PATENT\n",
       "7                                  Haeley Wo 02/41801                  PATENT\n",
       "8                             Skoglund Wo 2010/027317                  PATENT\n",
       "9                           Li Us Patent No 6,719,697                  PATENT\n",
       "10                                       Us 0101398 A                  PATENT\n",
       "11                                Olivieri Ep 1257118                  PATENT\n",
       "12   Olowinsky Et Al Us Patent Application Publica...                  PATENT\n",
       "13                                   2003-249598 Jp A                  PATENT\n",
       "14   Therapeutic Potential Of Natural Killer Cells...                    BOOK\n",
       "15   Sato Jp H11330112 With English Machine Transl...                  PATENT\n",
       "16   M. Caissy Et Al., Coming Soon: The Internatio...  CONFERENCE_PROCEEDINGS\n",
       "17                                     Sugimoto ` 612  CONFERENCE_PROCEEDINGS\n",
       "18   Pentair, Fairbanks Nijhuis, Vertical Turbine ...   PRODUCT_DOCUMENTATION\n",
       "19                             Liu Et Al Jp 57-023006                  PATENT"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## load classified citations\n",
    "\n",
    "table = pd.read_csv(path_base + \"classificed_oa_data_v1.tsv\", delimiter = \"\\t\")\n",
    "table.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7977d705-f669-42ce-becd-d484806eaaa6",
   "metadata": {},
   "outputs": [],
   "source": [
    "## save a sample of citations frm each class\n",
    "\n",
    "labels = list(set(list(table['label'])))\n",
    "for label in labels:\n",
    "    sm_table = table[table['label'] == label].sample(n=30)\n",
    "    label = label.replace('/','')\n",
    "    sm_table.to_excel(path_base + 'test_files/oa_v2/sample_30_cat_' + label + '.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e60c0822-9b90-4af5-9451-ba2eebfd151a",
   "metadata": {},
   "outputs": [],
   "source": [
    "## save a random sample of 300 citations\n",
    "\n",
    "sm_table = table.sample(n=300)\n",
    "sm_table.to_excel(path_base + 'test_files/oa_v2/sample_300_all_cat.xlsx')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1