Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: Manojh23
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 50.8 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme Citation

README.md

AnomCite

This repository contains two primary Jupyter Notebooks for analyzing citations in research papers:

  1. citationcontext.ipynb

    • Description: Extracts citation contexts from PDF files and stores the results in a CSV file. The CSV includes the number of times a particular reference is cited, all citation contexts for each reference, along with the title of the paper and the author names.
  2. detection.ipynb

    • Description: Finds whether a particular self-citation is essential or non-essential based on the citation context and summaries of the paper. This notebook requires the use of your own OpenAI API key.

Dataset Availability

  • The dataset required for this project has not been uploaded to the repository due to its large size. To obtain the dataset, please email me at hmanojnarayan@gmail.com.

How to Use

Environment Setup

  • Ensure all required libraries are installed.

Running the Notebooks

  1. citationcontext.ipynb

    • Purpose: Extracts citation contexts from PDF files.
    • Instructions:
      • Open the citationcontext.ipynb notebook in Jupyter Notebook or JupyterLab.
      • Execute the notebook cells sequentially to extract citation contexts and generate the citation_contexts.csv file.
  2. detection.ipynb

    • Purpose: Detects anomalous self-citations based on extracted contexts and paper summaries.
    • Instructions:
      • Open the detection.ipynb notebook in Jupyter Notebook or JupyterLab.
      • Insert your OpenAI API key in the designated section of the notebook.
      • Execute the notebook cells sequentially to analyze and detect anomalous self-citations.

    Note: Replace the placeholder for the OpenAI API key with your actual key to enable the anomaly detection functionality.

Results

  • citationcontext.ipynb:

    • Outputs a CSV file (citation_contexts.csv) containing citation contexts, the number of citations per reference, paper titles, and author names.
  • detection.ipynb:

    • Outputs an analysis indicating whether specific self-citations are essential or non-essential based on the provided contexts and summaries.

Contact

For any queries or to request access to the dataset, please reach out to me at:

Email: hmanojnarayan@gmail.com

Owner

  • Name: Manoj h
  • Login: Manojh23
  • Kind: user
  • Location: BANGALORE

Citation (citationcontext.ipynb)

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "e:\\os\\envs\\project\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using Gemini to get titles and authors\n",
      "OK, citation context extraction is done\n",
      "Mapping and storing done\n",
      "Citation contexts were successfully extracted for all processed PDFs.\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import fitz  # PyMuPDF\n",
    "import requests\n",
    "import os\n",
    "import xml.etree.ElementTree as ET\n",
    "from collections import defaultdict\n",
    "import pandas as pd\n",
    "import nltk\n",
    "from nltk.tokenize import sent_tokenize\n",
    "import unicodedata\n",
    "import google.generativeai as genai\n",
    "import time\n",
    "import json\n",
    "\n",
    "# Ensure NLTK's punkt tokenizer is available\n",
    "try:\n",
    "    nltk.data.find('tokenizers/punkt')\n",
    "except LookupError:\n",
    "    nltk.download('punkt')\n",
    "\n",
    "\n",
    "# Set your GenAI API key\n",
    "genai_api_key = 'XYZ'  # Replace 'YOUR_API_KEY' with your actual API key\n",
    "genai.configure(api_key=genai_api_key)\n",
    "\n",
    "# Set the path to the input directory containing PDFs\n",
    "input_dir = r\"E:\\capstone\\ParsCit-master\\pp\\numbered\"\n",
    "\n",
    "# Set the path to the output directory for CSV files\n",
    "output_dir = r\"E:\\capstone\\ParsCit-master\\pp\\lebron20\"\n",
    "\n",
    "# Set the path to a specific PDF file to process, or set to None\n",
    "pdf_file = r\"E:\\capstone\\ParsCit-master\\pp\\pdf_numbered\\10.1002@jgt.20600.pdf\"  # Set to None to process all PDFs in the input_dir\n",
    "# pdf_file = None  # Uncomment and set to None to process all PDFs in the input_dir\n",
    "\n",
    "\n",
    "def read_pdf_file(file_path):\n",
    "    \"\"\"\n",
    "    Reads a PDF file and extracts its text content using PyMuPDF.\n",
    "\n",
    "    Parameters:\n",
    "        file_path (str): Path to the input PDF file.\n",
    "\n",
    "    Returns:\n",
    "        str: Extracted text from the PDF.\n",
    "\n",
    "    Raises:\n",
    "        Exception: If there is an error reading the PDF file.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        with fitz.open(file_path) as doc:\n",
    "            text = \"\"\n",
    "            for page in doc:\n",
    "                page_text = page.get_text()\n",
    "                if page_text:\n",
    "                    text += page_text + \"\\n\"\n",
    "        return text\n",
    "    except Exception as e:\n",
    "        raise Exception(f\"Error reading PDF file: {e}\")\n",
    "\n",
    "\n",
    "def pre_process_text(text):\n",
    "    \"\"\"\n",
    "    Preprocesses the extracted text by removing hyphenations, extra spaces,\n",
    "    correcting common OCR errors, and trimming whitespace.\n",
    "    Also normalizes dash characters to standard hyphen '-'.\n",
    "\n",
    "    Parameters:\n",
    "        text (str): Raw text extracted from the PDF.\n",
    "\n",
    "    Returns:\n",
    "        str: Cleaned and preprocessed text.\n",
    "    \"\"\"\n",
    "    # Remove hyphenation at line breaks\n",
    "    text = re.sub(r'-\\s*\\n\\s*', '', text)\n",
    "    # Normalize various dash characters to standard hyphen\n",
    "    text = re.sub(r'[–—−‑‒―]', '-', text)\n",
    "    # Replace multiple spaces and tabs with a single space\n",
    "    text = re.sub(r'[ \\t]+', ' ', text)\n",
    "    # Replace multiple newlines with double newline to preserve paragraphs\n",
    "    text = re.sub(r'\\n{2,}', '\\n\\n', text)\n",
    "    # Correct common OCR errors (e.g., 'l' misread for '1')\n",
    "    text = re.sub(r'\\[l\\]', '[1]', text, flags=re.IGNORECASE)\n",
    "    text = re.sub(r'\\[I\\]', '[1]', text, flags=re.IGNORECASE)\n",
    "    # Normalize Unicode characters\n",
    "    text = unicodedata.normalize('NFKC', text)\n",
    "    # Strip leading and trailing whitespace\n",
    "    text = text.strip()\n",
    "    return text\n",
    "\n",
    "\n",
    "def find_reference_section(text):\n",
    "    \"\"\"\n",
    "    Identifies the references section in the text based on common headings.\n",
    "    If not found, assumes references start after 70% of the text.\n",
    "    Returns a tuple of (reference_section, main_text).\n",
    "\n",
    "    Parameters:\n",
    "        text (str): The preprocessed text extracted from the PDF.\n",
    "\n",
    "    Returns:\n",
    "        tuple: (reference_section (str), main_text (str))\n",
    "    \"\"\"\n",
    "    # Patterns to detect the \"References\" heading\n",
    "    reference_section_patterns = [\n",
    "        r'(?i)^\\s*References\\s*$',\n",
    "        r'(?i)^\\s*Bibliography\\s*$',\n",
    "        r'(?i)^\\s*Works Cited\\s*$',\n",
    "        r'(?i)^\\s*Literature Cited\\s*$',\n",
    "    ]\n",
    "    ref_start = None\n",
    "    for pattern in reference_section_patterns:\n",
    "        match = re.search(pattern, text, re.MULTILINE)\n",
    "        if match:\n",
    "            ref_start = match.end()\n",
    "            break\n",
    "    if ref_start is None:\n",
    "        # Assume references start after 70% of the text\n",
    "        ref_start = int(len(text) * 0.7)\n",
    "    # Reference section is from ref_start to the end\n",
    "    reference_section = text[ref_start:].strip()\n",
    "    main_text = text[:ref_start].strip()\n",
    "    return reference_section, main_text\n",
    "\n",
    "\n",
    "def detect_reference_style(reference_section):\n",
    "    \"\"\"\n",
    "    Detects the style of references used in the reference section.\n",
    "    Returns one of 'numbered_brackets', 'numbered', or 'unknown'.\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        str: The detected reference style.\n",
    "    \"\"\"\n",
    "    # Check for numbered references with brackets [1]\n",
    "    if re.search(r'(?m)^\\s*\\[\\d+\\]', reference_section):\n",
    "        return 'numbered_brackets'\n",
    "    # Check for numbered references without brackets 1. or 1)\n",
    "    elif re.search(r'(?m)^\\s*\\d+[\\.\\)]\\s+', reference_section):\n",
    "        return 'numbered'\n",
    "    else:\n",
    "        return 'unknown'\n",
    "\n",
    "\n",
    "def segment_references_numbered_brackets(reference_section):\n",
    "    \"\"\"\n",
    "    Segments references that start with [number].\n",
    "    Returns a list of tuples (number, reference).\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    references = []\n",
    "    # Correct common OCR errors in reference numbers\n",
    "    reference_section = re.sub(r'\\[l\\]', '[1]', reference_section, flags=re.IGNORECASE)\n",
    "    reference_section = re.sub(r'\\[I\\]', '[1]', reference_section, flags=re.IGNORECASE)\n",
    "    # Split references based on [number]\n",
    "    split_refs = re.split(r'\\[\\d+\\]', reference_section)\n",
    "    numbers = re.findall(r'\\[(\\d+)\\]', reference_section)\n",
    "    for num, ref in zip(numbers, split_refs[1:]):  # first split_refs[0] is before first [number]\n",
    "        ref = ref.replace('\\n', ' ').strip()\n",
    "        references.append((num, ref))\n",
    "    return references\n",
    "\n",
    "\n",
    "def segment_references_numbered(reference_section):\n",
    "    \"\"\"\n",
    "    Segments references that start with number. or number)\n",
    "    Returns a list of tuples (number, reference).\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    references = []\n",
    "    # Split references based on numbers followed by dot or parenthesis\n",
    "    pattern = r'(?m)^\\s*(\\d+)[\\.\\)]\\s+(.+?)(?=^\\s*\\d+[\\.\\)]\\s+|\\Z)'\n",
    "    matches = re.findall(pattern, reference_section, re.DOTALL)\n",
    "    for num, ref in matches:\n",
    "        ref = ref.replace('\\n', ' ').strip()\n",
    "        references.append((num, ref))\n",
    "    return references\n",
    "\n",
    "\n",
    "def segment_references(reference_section, reference_style):\n",
    "    \"\"\"\n",
    "    Segments the references section into individual references based on the reference style.\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "        reference_style (str): Detected reference style ('numbered_brackets', 'numbered', or 'unknown').\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    if not reference_section:\n",
    "        return []\n",
    "    if reference_style == 'numbered_brackets':\n",
    "        return segment_references_numbered_brackets(reference_section)\n",
    "    elif reference_style == 'numbered':\n",
    "        return segment_references_numbered(reference_section)\n",
    "    else:\n",
    "        # Attempt both segmentation methods if reference style is unknown\n",
    "        segmented_refs = segment_references_numbered_brackets(reference_section)\n",
    "        if not segmented_refs:\n",
    "            segmented_refs = segment_references_numbered(reference_section)\n",
    "        return segmented_refs\n",
    "\n",
    "\n",
    "def handle_multiple_citations(citation_numbers):\n",
    "    \"\"\"\n",
    "    Expands citation ranges and lists like [2-4,6] or [8]-[12] into ['2', '3', '4', '6', '8', '9', '10', '11', '12']\n",
    "    Handles various dash types.\n",
    "\n",
    "    Parameters:\n",
    "        citation_numbers (list of str): List of citation strings to expand.\n",
    "\n",
    "    Returns:\n",
    "        list of str: Expanded list of individual citation numbers as strings.\n",
    "    \"\"\"\n",
    "    expanded = []\n",
    "    for citation in citation_numbers:\n",
    "        citation = citation.replace(' ', '')  # Remove any spaces\n",
    "        # Split by comma\n",
    "        parts = citation.split(',')\n",
    "        for part in parts:\n",
    "            # Check if the part is a range (e.g., '8-12')\n",
    "            range_match = re.match(r'^(\\d+)[\\-–—](\\d+)$', part)\n",
    "            if range_match:\n",
    "                start, end = int(range_match.group(1)), int(range_match.group(2))\n",
    "                if start <= end:\n",
    "                    expanded.extend([str(num) for num in range(start, end + 1)])\n",
    "            else:\n",
    "                # Single citation\n",
    "                if part.isdigit():\n",
    "                    expanded.append(part)\n",
    "    return expanded\n",
    "\n",
    "\n",
    "def map_citations_to_references_numbered_unified(main_text, references):\n",
    "    \"\"\"\n",
    "    Maps numerical citations (with possible multiple citations within brackets) to their corresponding references.\n",
    "    Returns a dictionary mapping reference numbers to their contexts and counts.\n",
    "\n",
    "    Parameters:\n",
    "        main_text (str): The main text of the PDF excluding the references section.\n",
    "        references (list of dict): List of references with 'id', 'title', 'authors', and 'raw_reference'.\n",
    "\n",
    "    Returns:\n",
    "        dict: Mapping from reference number (str) to a dictionary containing 'contexts' (list) and 'count' (int).\n",
    "    \"\"\"\n",
    "    contexts = defaultdict(lambda: {'contexts': set(), 'count': 0})\n",
    "    sentences = sent_tokenize(main_text)\n",
    "    # Precompile the citation pattern regex\n",
    "    citation_pattern = re.compile(r'\\[(\\d+(?:\\s*[,-–—]\\s*\\d+)*)\\]')\n",
    "\n",
    "    for idx, sentence in enumerate(sentences):\n",
    "        matches = citation_pattern.findall(sentence)\n",
    "        if matches:\n",
    "            # Build context sentences: four before, current, four after\n",
    "            start_idx = max(0, idx - 5)\n",
    "            end_idx = min(len(sentences), idx + 6)\n",
    "            context_sentences = sentences[start_idx:end_idx]\n",
    "            context_text = ' '.join(context_sentences)\n",
    "            for match in matches:\n",
    "                # Expand the citation numbers\n",
    "                citation_nums = handle_multiple_citations([match])\n",
    "                for num in citation_nums:\n",
    "                    contexts[num]['contexts'].add(context_text.strip())\n",
    "                    contexts[num]['count'] += 1\n",
    "\n",
    "    # Convert sets to lists\n",
    "    return {num: {'contexts': list(data['contexts']), 'count': data['count']} for num, data in contexts.items()}\n",
    "\n",
    "\n",
    "def save_citation_contexts_to_csv(references, citation_map, output_csv_path):\n",
    "    \"\"\"\n",
    "    Saves the citation contexts mapped to references into a CSV file.\n",
    "\n",
    "    Parameters:\n",
    "        references (list of dict): List of references with 'id', 'title', 'authors', and 'raw_reference'.\n",
    "        citation_map (dict): Mapping from reference number to their contexts and counts.\n",
    "        output_csv_path (str): Path to save the output CSV file.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for ref in references:\n",
    "        ref_id = ref['id']\n",
    "        title = ref['title'] if ref['title'] else ref['raw_reference']  # Use raw reference if title not found\n",
    "        authors = ref.get('authors', [])\n",
    "        authors_str = '; '.join(authors) if authors else 'No authors extracted.'\n",
    "\n",
    "        contexts_data = citation_map.get(ref_id, {'contexts': [], 'count': 0})\n",
    "        contexts = contexts_data.get('contexts', [])\n",
    "        count = contexts_data.get('count', 0)\n",
    "        # Remove duplicates by converting to set and back to list\n",
    "        unique_contexts = list(set(contexts))\n",
    "        if unique_contexts:\n",
    "            # Join contexts with a delimiter\n",
    "            joined_contexts = ' | '.join(unique_contexts)\n",
    "            # Create the custom reference sentence\n",
    "            custom_sentence = f\"Reference [{ref_id}]: {title}. This citation context talks about reference no [{ref_id}] titled '{title}' by authors {authors_str}. It was cited {count} times.\"\n",
    "            # Prepend the custom sentence to the citation contexts\n",
    "            combined_contexts = f\"{custom_sentence} {joined_contexts}\"\n",
    "        else:\n",
    "            combined_contexts = f\"Reference [{ref_id}]: {title}. This citation context talks about reference no [{ref_id}] titled '{title}' by authors {authors_str}. It was cited {count} times. No citation contexts found.\"\n",
    "\n",
    "        data.append({\n",
    "            'Reference ID': ref_id,\n",
    "            'Title': title,\n",
    "            'Authors': authors_str,\n",
    "            'Citation Count': count,\n",
    "            'Citation Contexts': combined_contexts\n",
    "        })\n",
    "\n",
    "    # Sort data by Reference ID numerically\n",
    "    try:\n",
    "        data_sorted = sorted(data, key=lambda x: int(x['Reference ID']))\n",
    "    except ValueError:\n",
    "        # If Reference ID is not purely numeric, sort as strings\n",
    "        data_sorted = sorted(data, key=lambda x: x['Reference ID'])\n",
    "\n",
    "    df = pd.DataFrame(data_sorted)\n",
    "    df.to_csv(output_csv_path, index=False)\n",
    "\n",
    "\n",
    "def parse_reference_string_with_llm(ref_text):\n",
    "    \"\"\"\n",
    "    Parses a reference string using LLM to extract the title and authors.\n",
    "\n",
    "    Parameters:\n",
    "        ref_text (str): The reference string.\n",
    "\n",
    "    Returns:\n",
    "        dict: Parsed reference data, including 'title' and 'authors'.\n",
    "    \"\"\"\n",
    "    # The genai API should already be configured in the main function\n",
    "\n",
    "    # Construct the prompt\n",
    "    prompt = f\"\"\"\n",
    "Extract the title and authors from the following reference:\n",
    "\n",
    "{ref_text}\n",
    "\n",
    "Return the result in JSON format, with keys 'title' and 'authors'. The 'authors' should be a list of author names.\n",
    "\n",
    "Example Output:\n",
    "{{\n",
    "    \"title\": \"Title of the paper\",\n",
    "    \"authors\": [\"Author One\", \"Author Two\", \"Author Three\"]\n",
    "}}\n",
    "\"\"\"\n",
    "\n",
    "    model = genai.GenerativeModel(model_name=\"gemini-1.5-flash\")\n",
    "    try:\n",
    "        response = model.generate_content(prompt)\n",
    "        # Sleep to respect rate limits\n",
    "        time.sleep(13)  # Adjust based on your rate limit requirements\n",
    "        # Attempt to find the JSON in the response\n",
    "        response_text = response.text.strip()\n",
    "        # Extract the JSON part from the response\n",
    "        match = re.search(r'\\{.*\\}', response_text, re.DOTALL)\n",
    "        if match:\n",
    "            json_text = match.group(0)\n",
    "            parsed_ref = json.loads(json_text)\n",
    "            return parsed_ref  # Should contain 'title' and 'authors'\n",
    "        else:\n",
    "            print(f\"No JSON found in LLM response: {response_text}\")\n",
    "            return {'title': '', 'authors': []}\n",
    "    except Exception as e:\n",
    "        print(f\"Error parsing reference with LLM: {e}\")\n",
    "        return {'title': '', 'authors': []}\n",
    "\n",
    "\n",
    "def map_references(actual_references):\n",
    "    \"\"\"\n",
    "    Parses each reference string using LLM to extract titles and authors.\n",
    "\n",
    "    Parameters:\n",
    "        actual_references (list of tuple): List of tuples (ref_num, ref_text) from actual references.\n",
    "\n",
    "    Returns:\n",
    "        list of dict: References with IDs, raw_reference, 'title', and 'authors'.\n",
    "    \"\"\"\n",
    "    merged_references = []\n",
    "    for ref_num, ref_text in actual_references:\n",
    "        parsed_ref = parse_reference_string_with_llm(ref_text)\n",
    "        title = parsed_ref.get('title', \"No title extracted.\")\n",
    "        authors = parsed_ref.get('authors', [])\n",
    "        if not title:\n",
    "            title = ref_text.strip()  # Use the entire reference string as the title\n",
    "        merged_references.append({\n",
    "            'id': ref_num,\n",
    "            'raw_reference': ref_text,\n",
    "            'title': title,\n",
    "            'authors': authors\n",
    "        })\n",
    "    return merged_references\n",
    "\n",
    "\n",
    "def main():\n",
    "    # Create output directory if it doesn't exist\n",
    "    if not os.path.exists(output_dir):\n",
    "        os.makedirs(output_dir)\n",
    "\n",
    "    if pdf_file:\n",
    "        # Process specific PDF file\n",
    "        if not os.path.exists(pdf_file):\n",
    "            print(f\"Specified PDF file {pdf_file} does not exist.\")\n",
    "            return\n",
    "        pdf_files = [pdf_file]\n",
    "    else:\n",
    "        # Get list of PDF files in the input directory\n",
    "        pdf_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.lower().endswith('.pdf')]\n",
    "\n",
    "    if not pdf_files:\n",
    "        print(\"No PDF files found to process.\")\n",
    "        return\n",
    "\n",
    "    no_citation_context_pdfs = []\n",
    "\n",
    "    for pdf_path in pdf_files:\n",
    "        pdf_file_name = os.path.basename(pdf_path)\n",
    "        try:\n",
    "            # Step 1: Extract main text from PDF using PyMuPDF\n",
    "            raw_text = read_pdf_file(pdf_path)\n",
    "\n",
    "            # Step 2: Preprocess the extracted text\n",
    "            cleaned_text = pre_process_text(raw_text)\n",
    "\n",
    "            # Step 3: Find the reference section\n",
    "            reference_section, main_text = find_reference_section(cleaned_text)\n",
    "\n",
    "            # Step 4: Detect reference style\n",
    "            reference_style = detect_reference_style(reference_section)\n",
    "\n",
    "            # Step 5: Segment references\n",
    "            segmented_references = segment_references(reference_section, reference_style)\n",
    "            segmented_references = [ref for ref in segmented_references if ref[1].strip()]\n",
    "\n",
    "            # Proceed even if reference_style is 'unknown' and segmented_references is empty\n",
    "\n",
    "            if not segmented_references:\n",
    "                # Attempt alternative segmentation if possible\n",
    "                # Here, you can add more segmentation strategies if needed\n",
    "                # For now, if no references are found, consider citation contexts not extracted\n",
    "                no_citation_context_pdfs.append(pdf_file_name)\n",
    "                continue  # Skip to next PDF\n",
    "\n",
    "            # Print statement before using Gemini to get titles and authors\n",
    "            print(\"Using Gemini to get titles and authors\")\n",
    "\n",
    "            # Step 6: Map references (process each reference string with LLM to extract titles and authors)\n",
    "            merged_references = map_references(segmented_references)\n",
    "\n",
    "            # Step 7: Extract citations from main text\n",
    "            citation_mapping = map_citations_to_references_numbered_unified(main_text, merged_references)\n",
    "\n",
    "            # Print statement after citation context extraction\n",
    "            print(\"OK, citation context extraction is done\")\n",
    "\n",
    "            # Check if citation contexts are found\n",
    "            if not citation_mapping:\n",
    "                no_citation_context_pdfs.append(pdf_file_name)\n",
    "                continue  # Skip to next PDF\n",
    "\n",
    "            # Step 8: Save to CSV\n",
    "            # Name the CSV file the same as the PDF file but with .csv extension\n",
    "            csv_file_name = os.path.splitext(pdf_file_name)[0] + '.csv'\n",
    "            output_csv = os.path.join(output_dir, csv_file_name)\n",
    "            save_citation_contexts_to_csv(merged_references, citation_mapping, output_csv)\n",
    "\n",
    "            # Print statement after mapping and storing\n",
    "            print(\"Mapping and storing done\")\n",
    "\n",
    "        except Exception as e:\n",
    "            # If any error occurs during processing, consider citation contexts not extracted\n",
    "            print(f\"Error processing {pdf_file_name}: {e}\")\n",
    "            no_citation_context_pdfs.append(pdf_file_name)\n",
    "            continue  # Continue to next PDF\n",
    "\n",
    "    # After processing all PDFs, print the list of PDFs with missing citation contexts\n",
    "    if no_citation_context_pdfs:\n",
    "        print(\"PDFs for which citation contexts were not extracted:\")\n",
    "        for pdf in no_citation_context_pdfs:\n",
    "            print(f\"- {pdf}\")\n",
    "    else:\n",
    "        print(\"Citation contexts were successfully extracted for all processed PDFs.\")\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

GitHub Events

Total
  • Public event: 1
Last Year
  • Public event: 1