AnomCite

https://github.com/Manojh23/AnomCite

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: Manojh23
Language: Jupyter Notebook
Default Branch: main
Size: 50.8 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme Citation

README.md

AnomCite

This repository contains two primary Jupyter Notebooks for analyzing citations in research papers:

citationcontext.ipynb
- Description: Extracts citation contexts from PDF files and stores the results in a CSV file. The CSV includes the number of times a particular reference is cited, all citation contexts for each reference, along with the title of the paper and the author names.
detection.ipynb
- Description: Finds whether a particular self-citation is essential or non-essential based on the citation context and summaries of the paper. This notebook requires the use of your own OpenAI API key.

Dataset Availability

The dataset required for this project has not been uploaded to the repository due to its large size. To obtain the dataset, please email me at hmanojnarayan@gmail.com.

How to Use

Environment Setup

Ensure all required libraries are installed.

Running the Notebooks

citationcontext.ipynb
- Purpose: Extracts citation contexts from PDF files.
- Instructions:
  - Open the citationcontext.ipynb notebook in Jupyter Notebook or JupyterLab.
  - Execute the notebook cells sequentially to extract citation contexts and generate the citation_contexts.csv file.
detection.ipynb
- Purpose: Detects anomalous self-citations based on extracted contexts and paper summaries.
- Instructions:
  - Open the detection.ipynb notebook in Jupyter Notebook or JupyterLab.
  - Insert your OpenAI API key in the designated section of the notebook.
  - Execute the notebook cells sequentially to analyze and detect anomalous self-citations.
Note: Replace the placeholder for the OpenAI API key with your actual key to enable the anomaly detection functionality.

Results

citationcontext.ipynb:
- Outputs a CSV file (citation_contexts.csv) containing citation contexts, the number of citations per reference, paper titles, and author names.
detection.ipynb:
- Outputs an analysis indicating whether specific self-citations are essential or non-essential based on the provided contexts and summaries.

Contact

For any queries or to request access to the dataset, please reach out to me at:

Email: hmanojnarayan@gmail.com

Owner

Name: Manoj h
Login: Manojh23
Kind: user
Location: BANGALORE

Repositories: 1
Profile: https://github.com/Manojh23

Citation (citationcontext.ipynb)

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "e:\\os\\envs\\project\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using Gemini to get titles and authors\n",
      "OK, citation context extraction is done\n",
      "Mapping and storing done\n",
      "Citation contexts were successfully extracted for all processed PDFs.\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import fitz  # PyMuPDF\n",
    "import requests\n",
    "import os\n",
    "import xml.etree.ElementTree as ET\n",
    "from collections import defaultdict\n",
    "import pandas as pd\n",
    "import nltk\n",
    "from nltk.tokenize import sent_tokenize\n",
    "import unicodedata\n",
    "import google.generativeai as genai\n",
    "import time\n",
    "import json\n",
    "\n",
    "# Ensure NLTK's punkt tokenizer is available\n",
    "try:\n",
    "    nltk.data.find('tokenizers/punkt')\n",
    "except LookupError:\n",
    "    nltk.download('punkt')\n",
    "\n",
    "\n",
    "# Set your GenAI API key\n",
    "genai_api_key = 'XYZ'  # Replace 'YOUR_API_KEY' with your actual API key\n",
    "genai.configure(api_key=genai_api_key)\n",
    "\n",
    "# Set the path to the input directory containing PDFs\n",
    "input_dir = r\"E:\\capstone\\ParsCit-master\\pp\\numbered\"\n",
    "\n",
    "# Set the path to the output directory for CSV files\n",
    "output_dir = r\"E:\\capstone\\ParsCit-master\\pp\\lebron20\"\n",
    "\n",
    "# Set the path to a specific PDF file to process, or set to None\n",
    "pdf_file = r\"E:\\capstone\\ParsCit-master\\pp\\pdf_numbered\\10.1002@jgt.20600.pdf\"  # Set to None to process all PDFs in the input_dir\n",
    "# pdf_file = None  # Uncomment and set to None to process all PDFs in the input_dir\n",
    "\n",
    "\n",
    "def read_pdf_file(file_path):\n",
    "    \"\"\"\n",
    "    Reads a PDF file and extracts its text content using PyMuPDF.\n",
    "\n",
    "    Parameters:\n",
    "        file_path (str): Path to the input PDF file.\n",
    "\n",
    "    Returns:\n",
    "        str: Extracted text from the PDF.\n",
    "\n",
    "    Raises:\n",
    "        Exception: If there is an error reading the PDF file.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        with fitz.open(file_path) as doc:\n",
    "            text = \"\"\n",
    "            for page in doc:\n",
    "                page_text = page.get_text()\n",
    "                if page_text:\n",
    "                    text += page_text + \"\\n\"\n",
    "        return text\n",
    "    except Exception as e:\n",
    "        raise Exception(f\"Error reading PDF file: {e}\")\n",
    "\n",
    "\n",
    "def pre_process_text(text):\n",
    "    \"\"\"\n",
    "    Preprocesses the extracted text by removing hyphenations, extra spaces,\n",
    "    correcting common OCR errors, and trimming whitespace.\n",
    "    Also normalizes dash characters to standard hyphen '-'.\n",
    "\n",
    "    Parameters:\n",
    "        text (str): Raw text extracted from the PDF.\n",
    "\n",
    "    Returns:\n",
    "        str: Cleaned and preprocessed text.\n",
    "    \"\"\"\n",
    "    # Remove hyphenation at line breaks\n",
    "    text = re.sub(r'-\\s*\\n\\s*', '', text)\n",
    "    # Normalize various dash characters to standard hyphen\n",
    "    text = re.sub(r'[–—−‑‒―]', '-', text)\n",
    "    # Replace multiple spaces and tabs with a single space\n",
    "    text = re.sub(r'[ \\t]+', ' ', text)\n",
    "    # Replace multiple newlines with double newline to preserve paragraphs\n",
    "    text = re.sub(r'\\n{2,}', '\\n\\n', text)\n",
    "    # Correct common OCR errors (e.g., 'l' misread for '1')\n",
    "    text = re.sub(r'\\[l\\]', '[1]', text, flags=re.IGNORECASE)\n",
    "    text = re.sub(r'\\[I\\]', '[1]', text, flags=re.IGNORECASE)\n",
    "    # Normalize Unicode characters\n",
    "    text = unicodedata.normalize('NFKC', text)\n",
    "    # Strip leading and trailing whitespace\n",
    "    text = text.strip()\n",
    "    return text\n",
    "\n",
    "\n",
    "def find_reference_section(text):\n",
    "    \"\"\"\n",
    "    Identifies the references section in the text based on common headings.\n",
    "    If not found, assumes references start after 70% of the text.\n",
    "    Returns a tuple of (reference_section, main_text).\n",
    "\n",
    "    Parameters:\n",
    "        text (str): The preprocessed text extracted from the PDF.\n",
    "\n",
    "    Returns:\n",
    "        tuple: (reference_section (str), main_text (str))\n",
    "    \"\"\"\n",
    "    # Patterns to detect the \"References\" heading\n",
    "    reference_section_patterns = [\n",
    "        r'(?i)^\\s*References\\s*$',\n",
    "        r'(?i)^\\s*Bibliography\\s*$',\n",
    "        r'(?i)^\\s*Works Cited\\s*$',\n",
    "        r'(?i)^\\s*Literature Cited\\s*$',\n",
    "    ]\n",
    "    ref_start = None\n",
    "    for pattern in reference_section_patterns:\n",
    "        match = re.search(pattern, text, re.MULTILINE)\n",
    "        if match:\n",
    "            ref_start = match.end()\n",
    "            break\n",
    "    if ref_start is None:\n",
    "        # Assume references start after 70% of the text\n",
    "        ref_start = int(len(text) * 0.7)\n",
    "    # Reference section is from ref_start to the end\n",
    "    reference_section = text[ref_start:].strip()\n",
    "    main_text = text[:ref_start].strip()\n",
    "    return reference_section, main_text\n",
    "\n",
    "\n",
    "def detect_reference_style(reference_section):\n",
    "    \"\"\"\n",
    "    Detects the style of references used in the reference section.\n",
    "    Returns one of 'numbered_brackets', 'numbered', or 'unknown'.\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        str: The detected reference style.\n",
    "    \"\"\"\n",
    "    # Check for numbered references with brackets [1]\n",
    "    if re.search(r'(?m)^\\s*\\[\\d+\\]', reference_section):\n",
    "        return 'numbered_brackets'\n",
    "    # Check for numbered references without brackets 1. or 1)\n",
    "    elif re.search(r'(?m)^\\s*\\d+[\\.\\)]\\s+', reference_section):\n",
    "        return 'numbered'\n",
    "    else:\n",
    "        return 'unknown'\n",
    "\n",
    "\n",
    "def segment_references_numbered_brackets(reference_section):\n",
    "    \"\"\"\n",
    "    Segments references that start with [number].\n",
    "    Returns a list of tuples (number, reference).\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    references = []\n",
    "    # Correct common OCR errors in reference numbers\n",
    "    reference_section = re.sub(r'\\[l\\]', '[1]', reference_section, flags=re.IGNORECASE)\n",
    "    reference_section = re.sub(r'\\[I\\]', '[1]', reference_section, flags=re.IGNORECASE)\n",
    "    # Split references based on [number]\n",
    "    split_refs = re.split(r'\\[\\d+\\]', reference_section)\n",
    "    numbers = re.findall(r'\\[(\\d+)\\]', reference_section)\n",
    "    for num, ref in zip(numbers, split_refs[1:]):  # first split_refs[0] is before first [number]\n",
    "        ref = ref.replace('\\n', ' ').strip()\n",
    "        references.append((num, ref))\n",
    "    return references\n",
    "\n",
    "\n",
    "def segment_references_numbered(reference_section):\n",
    "    \"\"\"\n",
    "    Segments references that start with number. or number)\n",
    "    Returns a list of tuples (number, reference).\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    references = []\n",
    "    # Split references based on numbers followed by dot or parenthesis\n",
    "    pattern = r'(?m)^\\s*(\\d+)[\\.\\)]\\s+(.+?)(?=^\\s*\\d+[\\.\\)]\\s+|\\Z)'\n",
    "    matches = re.findall(pattern, reference_section, re.DOTALL)\n",
    "    for num, ref in matches:\n",
    "        ref = ref.replace('\\n', ' ').strip()\n",
    "        references.append((num, ref))\n",
    "    return references\n",
    "\n",
    "\n",
    "def segment_references(reference_section, reference_style):\n",
    "    \"\"\"\n",
    "    Segments the references section into individual references based on the reference style.\n",
    "\n",
    "    Parameters:\n",
    "        reference_section (str): The extracted references section from the text.\n",
    "        reference_style (str): Detected reference style ('numbered_brackets', 'numbered', or 'unknown').\n",
    "\n",
    "    Returns:\n",
    "        list of tuple: List containing tuples of (reference_number, reference_text).\n",
    "    \"\"\"\n",
    "    if not reference_section:\n",
    "        return []\n",
    "    if reference_style == 'numbered_brackets':\n",
    "        return segment_references_numbered_brackets(reference_section)\n",
    "    elif reference_style == 'numbered':\n",
    "        return segment_references_numbered(reference_section)\n",
    "    else:\n",
    "        # Attempt both segmentation methods if reference style is unknown\n",
    "        segmented_refs = segment_references_numbered_brackets(reference_section)\n",
    "        if not segmented_refs:\n",
    "            segmented_refs = segment_references_numbered(reference_section)\n",
    "        return segmented_refs\n",
    "\n",
    "\n",
    "def handle_multiple_citations(citation_numbers):\n",
    "    \"\"\"\n",
    "    Expands citation ranges and lists like [2-4,6] or [8]-[12] into ['2', '3', '4', '6', '8', '9', '10', '11', '12']\n",
    "    Handles various dash types.\n",
    "\n",
    "    Parameters:\n",
    "        citation_numbers (list of str): List of citation strings to expand.\n",
    "\n",
    "    Returns:\n",
    "        list of str: Expanded list of individual citation numbers as strings.\n",
    "    \"\"\"\n",
    "    expanded = []\n",
    "    for citation in citation_numbers:\n",
    "        citation = citation.replace(' ', '')  # Remove any spaces\n",
    "        # Split by comma\n",
    "        parts = citation.split(',')\n",
    "        for part in parts:\n",
    "            # Check if the part is a range (e.g., '8-12')\n",
    "            range_match = re.match(r'^(\\d+)[\\-–—](\\d+)$', part)\n",
    "            if range_match:\n",
    "                start, end = int(range_match.group(1)), int(range_match.group(2))\n",
    "                if start <= end:\n",
    "                    expanded.extend([str(num) for num in range(start, end + 1)])\n",
    "            else:\n",
    "                # Single citation\n",
    "                if part.isdigit():\n",
    "                    expanded.append(part)\n",
    "    return expanded\n",
    "\n",
    "\n",
    "def map_citations_to_references_numbered_unified(main_text, references):\n",
    "    \"\"\"\n",
    "    Maps numerical citations (with possible multiple citations within brackets) to their corresponding references.\n",
    "    Returns a dictionary mapping reference numbers to their contexts and counts.\n",
    "\n",
    "    Parameters:\n",
    "        main_text (str): The main text of the PDF excluding the references section.\n",
    "        references (list of dict): List of references with 'id', 'title', 'authors', and 'raw_reference'.\n",
    "\n",
    "    Returns:\n",
    "        dict: Mapping from reference number (str) to a dictionary containing 'contexts' (list) and 'count' (int).\n",
    "    \"\"\"\n",
    "    contexts = defaultdict(lambda: {'contexts': set(), 'count': 0})\n",
    "    sentences = sent_tokenize(main_text)\n",
    "    # Precompile the citation pattern regex\n",
    "    citation_pattern = re.compile(r'\\[(\\d+(?:\\s*[,-–—]\\s*\\d+)*)\\]')\n",
    "\n",
    "    for idx, sentence in enumerate(sentences):\n",
    "        matches = citation_pattern.findall(sentence)\n",
    "        if matches:\n",
    "            # Build context sentences: four before, current, four after\n",
    "            start_idx = max(0, idx - 5)\n",
    "            end_idx = min(len(sentences), idx + 6)\n",
    "            context_sentences = sentences[start_idx:end_idx]\n",
    "            context_text = ' '.join(context_sentences)\n",
    "            for match in matches:\n",
    "                # Expand the citation numbers\n",
    "                citation_nums = handle_multiple_citations([match])\n",
    "                for num in citation_nums:\n",
    "                    contexts[num]['contexts'].add(context_text.strip())\n",
    "                    contexts[num]['count'] += 1\n",
    "\n",
    "    # Convert sets to lists\n",
    "    return {num: {'contexts': list(data['contexts']), 'count': data['count']} for num, data in contexts.items()}\n",
    "\n",
    "\n",
    "def save_citation_contexts_to_csv(references, citation_map, output_csv_path):\n",
    "    \"\"\"\n",
    "    Saves the citation contexts mapped to references into a CSV file.\n",
    "\n",
    "    Parameters:\n",
    "        references (list of dict): List of references with 'id', 'title', 'authors', and 'raw_reference'.\n",
    "        citation_map (dict): Mapping from reference number to their contexts and counts.\n",
    "        output_csv_path (str): Path to save the output CSV file.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for ref in references:\n",
    "        ref_id = ref['id']\n",
    "        title = ref['title'] if ref['title'] else ref['raw_reference']  # Use raw reference if title not found\n",
    "        authors = ref.get('authors', [])\n",
    "        authors_str = '; '.join(authors) if authors else 'No authors extracted.'\n",
    "\n",
    "        contexts_data = citation_map.get(ref_id, {'contexts': [], 'count': 0})\n",
    "        contexts = contexts_data.get('contexts', [])\n",
    "        count = contexts_data.get('count', 0)\n",
    "        # Remove duplicates by converting to set and back to list\n",
    "        unique_contexts = list(set(contexts))\n",
    "        if unique_contexts:\n",
    "            # Join contexts with a delimiter\n",
    "            joined_contexts = ' | '.join(unique_contexts)\n",
    "            # Create the custom reference sentence\n",
    "            custom_sentence = f\"Reference [{ref_id}]: {title}. This citation context talks about reference no [{ref_id}] titled '{title}' by authors {authors_str}. It was cited {count} times.\"\n",
    "            # Prepend the custom sentence to the citation contexts\n",
    "            combined_contexts = f\"{custom_sentence} {joined_contexts}\"\n",
    "        else:\n",
    "            combined_contexts = f\"Reference [{ref_id}]: {title}. This citation context talks about reference no [{ref_id}] titled '{title}' by authors {authors_str}. It was cited {count} times. No citation contexts found.\"\n",
    "\n",
    "        data.append({\n",
    "            'Reference ID': ref_id,\n",
    "            'Title': title,\n",
    "            'Authors': authors_str,\n",
    "            'Citation Count': count,\n",
    "            'Citation Contexts': combined_contexts\n",
    "        })\n",
    "\n",
    "    # Sort data by Reference ID numerically\n",
    "    try:\n",
    "        data_sorted = sorted(data, key=lambda x: int(x['Reference ID']))\n",
    "    except ValueError:\n",
    "        # If Reference ID is not purely numeric, sort as strings\n",
    "        data_sorted = sorted(data, key=lambda x: x['Reference ID'])\n",
    "\n",
    "    df = pd.DataFrame(data_sorted)\n",
    "    df.to_csv(output_csv_path, index=False)\n",
    "\n",
    "\n",
    "def parse_reference_string_with_llm(ref_text):\n",
    "    \"\"\"\n",
    "    Parses a reference string using LLM to extract the title and authors.\n",
    "\n",
    "    Parameters:\n",
    "        ref_text (str): The reference string.\n",
    "\n",
    "    Returns:\n",
    "        dict: Parsed reference data, including 'title' and 'authors'.\n",
    "    \"\"\"\n",
    "    # The genai API should already be configured in the main function\n",
    "\n",
    "    # Construct the prompt\n",
    "    prompt = f\"\"\"\n",
    "Extract the title and authors from the following reference:\n",
    "\n",
    "{ref_text}\n",
    "\n",
    "Return the result in JSON format, with keys 'title' and 'authors'. The 'authors' should be a list of author names.\n",
    "\n",
    "Example Output:\n",
    "{{\n",
    "    \"title\": \"Title of the paper\",\n",
    "    \"authors\": [\"Author One\", \"Author Two\", \"Author Three\"]\n",
    "}}\n",
    "\"\"\"\n",
    "\n",
    "    model = genai.GenerativeModel(model_name=\"gemini-1.5-flash\")\n",
    "    try:\n",
    "        response = model.generate_content(prompt)\n",
    "        # Sleep to respect rate limits\n",
    "        time.sleep(13)  # Adjust based on your rate limit requirements\n",
    "        # Attempt to find the JSON in the response\n",
    "        response_text = response.text.strip()\n",
    "        # Extract the JSON part from the response\n",
    "        match = re.search(r'\\{.*\\}', response_text, re.DOTALL)\n",
    "        if match:\n",
    "            json_text = match.group(0)\n",
    "            parsed_ref = json.loads(json_text)\n",
    "            return parsed_ref  # Should contain 'title' and 'authors'\n",
    "        else:\n",
    "            print(f\"No JSON found in LLM response: {response_text}\")\n",
    "            return {'title': '', 'authors': []}\n",
    "    except Exception as e:\n",
    "        print(f\"Error parsing reference with LLM: {e}\")\n",
    "        return {'title': '', 'authors': []}\n",
    "\n",
    "\n",
    "def map_references(actual_references):\n",
    "    \"\"\"\n",
    "    Parses each reference string using LLM to extract titles and authors.\n",
    "\n",
    "    Parameters:\n",
    "        actual_references (list of tuple): List of tuples (ref_num, ref_text) from actual references.\n",
    "\n",
    "    Returns:\n",
    "        list of dict: References with IDs, raw_reference, 'title', and 'authors'.\n",
    "    \"\"\"\n",
    "    merged_references = []\n",
    "    for ref_num, ref_text in actual_references:\n",
    "        parsed_ref = parse_reference_string_with_llm(ref_text)\n",
    "        title = parsed_ref.get('title', \"No title extracted.\")\n",
    "        authors = parsed_ref.get('authors', [])\n",
    "        if not title:\n",
    "            title = ref_text.strip()  # Use the entire reference string as the title\n",
    "        merged_references.append({\n",
    "            'id': ref_num,\n",
    "            'raw_reference': ref_text,\n",
    "            'title': title,\n",
    "            'authors': authors\n",
    "        })\n",
    "    return merged_references\n",
    "\n",
    "\n",
    "def main():\n",
    "    # Create output directory if it doesn't exist\n",
    "    if not os.path.exists(output_dir):\n",
    "        os.makedirs(output_dir)\n",
    "\n",
    "    if pdf_file:\n",
    "        # Process specific PDF file\n",
    "        if not os.path.exists(pdf_file):\n",
    "            print(f\"Specified PDF file {pdf_file} does not exist.\")\n",
    "            return\n",
    "        pdf_files = [pdf_file]\n",
    "    else:\n",
    "        # Get list of PDF files in the input directory\n",
    "        pdf_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.lower().endswith('.pdf')]\n",
    "\n",
    "    if not pdf_files:\n",
    "        print(\"No PDF files found to process.\")\n",
    "        return\n",
    "\n",
    "    no_citation_context_pdfs = []\n",
    "\n",
    "    for pdf_path in pdf_files:\n",
    "        pdf_file_name = os.path.basename(pdf_path)\n",
    "        try:\n",
    "            # Step 1: Extract main text from PDF using PyMuPDF\n",
    "            raw_text = read_pdf_file(pdf_path)\n",
    "\n",
    "            # Step 2: Preprocess the extracted text\n",
    "            cleaned_text = pre_process_text(raw_text)\n",
    "\n",
    "            # Step 3: Find the reference section\n",
    "            reference_section, main_text = find_reference_section(cleaned_text)\n",
    "\n",
    "            # Step 4: Detect reference style\n",
    "            reference_style = detect_reference_style(reference_section)\n",
    "\n",
    "            # Step 5: Segment references\n",
    "            segmented_references = segment_references(reference_section, reference_style)\n",
    "            segmented_references = [ref for ref in segmented_references if ref[1].strip()]\n",
    "\n",
    "            # Proceed even if reference_style is 'unknown' and segmented_references is empty\n",
    "\n",
    "            if not segmented_references:\n",
    "                # Attempt alternative segmentation if possible\n",
    "                # Here, you can add more segmentation strategies if needed\n",
    "                # For now, if no references are found, consider citation contexts not extracted\n",
    "                no_citation_context_pdfs.append(pdf_file_name)\n",
    "                continue  # Skip to next PDF\n",
    "\n",
    "            # Print statement before using Gemini to get titles and authors\n",
    "            print(\"Using Gemini to get titles and authors\")\n",
    "\n",
    "            # Step 6: Map references (process each reference string with LLM to extract titles and authors)\n",
    "            merged_references = map_references(segmented_references)\n",
    "\n",
    "            # Step 7: Extract citations from main text\n",
    "            citation_mapping = map_citations_to_references_numbered_unified(main_text, merged_references)\n",
    "\n",
    "            # Print statement after citation context extraction\n",
    "            print(\"OK, citation context extraction is done\")\n",
    "\n",
    "            # Check if citation contexts are found\n",
    "            if not citation_mapping:\n",
    "                no_citation_context_pdfs.append(pdf_file_name)\n",
    "                continue  # Skip to next PDF\n",
    "\n",
    "            # Step 8: Save to CSV\n",
    "            # Name the CSV file the same as the PDF file but with .csv extension\n",
    "            csv_file_name = os.path.splitext(pdf_file_name)[0] + '.csv'\n",
    "            output_csv = os.path.join(output_dir, csv_file_name)\n",
    "            save_citation_contexts_to_csv(merged_references, citation_mapping, output_csv)\n",
    "\n",
    "            # Print statement after mapping and storing\n",
    "            print(\"Mapping and storing done\")\n",
    "\n",
    "        except Exception as e:\n",
    "            # If any error occurs during processing, consider citation contexts not extracted\n",
    "            print(f\"Error processing {pdf_file_name}: {e}\")\n",
    "            no_citation_context_pdfs.append(pdf_file_name)\n",
    "            continue  # Continue to next PDF\n",
    "\n",
    "    # After processing all PDFs, print the list of PDFs with missing citation contexts\n",
    "    if no_citation_context_pdfs:\n",
    "        print(\"PDFs for which citation contexts were not extracted:\")\n",
    "        for pdf in no_citation_context_pdfs:\n",
    "            print(f\"- {pdf}\")\n",
    "    else:\n",
    "        print(\"Citation contexts were successfully extracted for all processed PDFs.\")\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

AnomCite

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

AnomCite

Dataset Availability

How to Use

Environment Setup

Running the Notebooks

Results

Contact

Owner

Citation (citationcontext.ipynb)

GitHub Events

Total

Last Year