grant_ai

https://github.com/mitomac/grant_ai

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: mitomac
Language: Jupyter Notebook
Default Branch: main
Size: 2.49 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme Citation

Grant AI

Getting started

To make it easy for you to get started with GitLab, here's a list of recommended next steps.

Already a pro? Just edit this README.md and make it your own. Want to make it easy? Use the template at the bottom!

Add your files

[ ] Create or upload files
[ ] Add files using the command line or push an existing Git repository with the following command:

cd existing_repo git remote add origin https://gitlab.oit.duke.edu/dmm29/grant-ai.git git branch -M main git push -uf origin main

Integrate with your tools

[ ] Set up project integrations

Collaborate with your team

[ ] Invite team members and collaborators
[ ] Create a new merge request
[ ] Automatically close issues from merge requests
[ ] Enable merge request approvals
[ ] Set auto-merge

Test and Deploy

Use the built-in continuous integration in GitLab.

Editing this README

When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to makeareadme.com for this template.

Suggestions for a good README

Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.

Name

Choose a self-explaining name for your project.

Description

Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.

Badges

On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.

Visuals

Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.

Installation

Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.

Usage

Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.

Support

Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.

Roadmap

If you have ideas for releases in the future, it is a good idea to list them in the README.

Contributing

State if you are open to contributions and what your requirements are for accepting them.

For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.

You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.

Authors and acknowledgment

Show your appreciation to those who have contributed to the project.

License

For open source projects, say how it is licensed.

Project status

If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.

Owner

Name: David MacAlpine
Login: mitomac
Kind: user

Repositories: 1
Profile: https://github.com/mitomac

Citation (citation.ipynb)

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LLM-Based Citation Extraction \n",
    "\n",
    "This notebook handles extraction of citations from scientific documents:\n",
    "1. Loading the processed document from JSON\n",
    "2. Using an LLM to extract citations section-by-section\n",
    "3. For each citation, identifying:\n",
    "   - The scientific claim being supported\n",
    "   - The citation text\n",
    "   - Expanded citation keys\n",
    "   - The paragraph context\n",
    "4. Saving the structured citation data for further analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, let's import the necessary libraries and set up our environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "from pathlib import Path\n",
    "import json\n",
    "import pandas as pd\n",
    "from typing import List, Dict, Optional, Any, Union\n",
    "from pydantic import BaseModel, Field\n",
    "from datetime import datetime\n",
    "import matplotlib.pyplot as plt\n",
    "from openai import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Data Models\n",
    "\n",
    "We'll use Pydantic models to structure our citation data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExtractedCitation(BaseModel):\n",
    "    \"\"\"A citation extracted by the LLM with expanded citation keys\"\"\"\n",
    "    claim: str = Field(..., description=\"The scientific claim being made and supported by the citation\")\n",
    "    citation_text: str = Field(..., description=\"The exact LaTeX citation text (e.g., '$^{1}$' or '$^{2-5}$')\")\n",
    "    citation_keys: List[int] = Field(..., description=\"Expanded list of citation keys as integers\")\n",
    "    paragraph: str = Field(..., description=\"The paragraph containing the citation for context\")\n",
    "    section_id: str = Field(..., description=\"ID of the section containing the citation\")\n",
    "\n",
    "class Reference(BaseModel):\n",
    "    \"\"\"A reference from the bibliography\"\"\"\n",
    "    reference_id: int = Field(..., description=\"The numeric ID of the reference (e.g., 1 from [1])\")\n",
    "    reference_text: str = Field(..., description=\"The full text of the reference\")\n",
    "    pmcid: Optional[str] = Field(None, description=\"PMCID if present in the reference\")\n",
    "\n",
    "class ReferenceList(BaseModel):\n",
    "    \"\"\"List of references extracted from the bibliography\"\"\"\n",
    "    references: List[Reference] = Field(..., description=\"List of references extracted from the bibliography\")\n",
    "    \n",
    "class CitationAnalysisResults(BaseModel):\n",
    "    \"\"\"Results of the citation extraction process\"\"\"\n",
    "    document_id: str = Field(..., description=\"ID of the processed document\")\n",
    "    citations: List[ExtractedCitation] = Field(..., description=\"Citations extracted from the document\")\n",
    "    total_citations: int = Field(..., description=\"Total number of citations extracted\")\n",
    "    citations_by_section: Dict[str, int] = Field(..., description=\"Count of citations by section ID\")\n",
    "    references: Optional[List[Reference]] = Field(None, description=\"References extracted from the bibliography\")\n",
    "    processed_date: str = Field(..., description=\"Date the analysis was performed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class LLMCitationExtractor:\n",
    "    \"\"\"\n",
    "    Uses LLM to extract citations from document sections with expanded citation keys\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, api_key=None, model=\"o3-mini\"):\n",
    "        \"\"\"Initialize with OpenAI API key\"\"\"\n",
    "        self.api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n",
    "        if not self.api_key:\n",
    "            raise ValueError(\"OpenAI API key is required\")\n",
    "        \n",
    "        self.client = OpenAI(api_key=self.api_key)\n",
    "        self.model = model\n",
    "        \n",
    "    def extract_citations_from_section(self, section_id, section_title, section_content):\n",
    "        \"\"\"\n",
    "        Extract citations from a single section using LLM with expanded citation keys\n",
    "        \n",
    "        Args:\n",
    "            section_id: ID of the section\n",
    "            section_title: Title of the section\n",
    "            section_content: Content of the section\n",
    "            \n",
    "        Returns:\n",
    "            List of ExtractedCitation objects\n",
    "        \"\"\"\n",
    "        # DEBUG: Print section info to verify data is being sent to the LLM\n",
    "        print(f\"DEBUG: Processing section_id={section_id}, title='{section_title}'\")\n",
    "        print(f\"DEBUG: Content sample: {section_content[:100]}...\")\n",
    "        \n",
    "        # Define a model for the response format\n",
    "        class CitationResponse(BaseModel):\n",
    "            citations: List[ExtractedCitation] = Field(\n",
    "                default_factory=list,\n",
    "                description=\"Extracted citations with their claims, citation text, expanded keys, and paragraph context\"\n",
    "            )\n",
    "        \n",
    "        # Define the prompt for citation extraction\n",
    "        prompt = f\"\"\"\n",
    "        You are an expert at identifying citations in scientific papers. Analyze the following section of text and extract all citations with their context.\n",
    "\n",
    "        SECTION: {section_title}\n",
    "\n",
    "        TEXT:\n",
    "        ```\n",
    "        {section_content}\n",
    "        ```\n",
    "\n",
    "        TASK:\n",
    "        1. Identify ALL citation callouts that match these LaTeX patterns:\n",
    "           - $^{{n}}$ (Single citation, e.g., $^{{1}}$)\n",
    "           - $^{{m-n}}$ (Range citation, e.g., $^{{2-5}}$)\n",
    "           - $^{{a,b,c}}$ (List citation, e.g., $^{{6,7,8}}$)\n",
    "           - $^{{a-b,c,d-e}}$ (Complex citation, e.g., $^{{2-4,7,9-10}}$)\n",
    "           - Any variations with spaces like ${{ }}^{{1}}$\n",
    "\n",
    "        2. For each citation, extract:\n",
    "           - The scientific claim being supported by the citation\n",
    "           - The exact citation text as it appears (e.g., \"$^{{2-5}}$\")\n",
    "           - The citation keys, expanded to individual numbers:\n",
    "              * For a range like \"2-5\", expand to [2,3,4,5]\n",
    "              * For a list like \"6,7,8\", expand to [6,7,8]\n",
    "              * For complex citations like \"2-4,7,9-10\", expand to [2,3,4,7,9,10]\n",
    "           - The full paragraph containing the citation for context\n",
    "           \n",
    "        IMPORTANT: Be sure to check figure captions for citations as well.\n",
    "        \"\"\"\n",
    "        \n",
    "        try:\n",
    "            # Use the parse method to get structured output directly into our Pydantic model\n",
    "            completion = self.client.beta.chat.completions.parse(\n",
    "                model=self.model,\n",
    "                messages=[\n",
    "                    {\"role\": \"system\", \"content\": \"You are a citation extraction specialist. Extract all citation callouts from scientific texts with high precision.\"},\n",
    "                    {\"role\": \"user\", \"content\": prompt}\n",
    "                ],\n",
    "                response_format=CitationResponse\n",
    "            )\n",
    "            \n",
    "            # Get the extracted citations directly from the parsed response\n",
    "            extracted_citations = completion.choices[0].message.parsed.citations\n",
    "            \n",
    "            # Set the section_id for all extracted citations\n",
    "            for citation in extracted_citations:\n",
    "                citation.section_id = section_id\n",
    "            \n",
    "            print(f\"Extracted {len(extracted_citations)} citations from section '{section_title}'\")\n",
    "            return extracted_citations\n",
    "            \n",
    "        except Exception as e:\n",
    "            print(f\"Error extracting citations from section '{section_title}': {e}\")\n",
    "            return []\n",
    "    \n",
    "    def extract_all_citations(self, document_data):\n",
    "        \"\"\"\n",
    "        Extract all citations from document sections\n",
    "        \n",
    "        Args:\n",
    "            document_data: Processed document data from JSON\n",
    "            \n",
    "        Returns:\n",
    "            List of ExtractedCitation objects\n",
    "        \"\"\"\n",
    "        all_citations = []\n",
    "        processed_section_ids = set()  # Track sections we've processed\n",
    "        \n",
    "        # Process all sections and their subsections\n",
    "        def process_section(section, parent_id=\"\"):\n",
    "            section_id = section['section_id']\n",
    "            processed_section_ids.add(section_id)\n",
    "            \n",
    "            # Skip the bibliography/references section\n",
    "            if any(ref_term in section['title'].lower() for ref_term in ['literature cited', 'references', 'bibliography']):\n",
    "                print(f\"Skipping bibliography section: {section['title']}\")\n",
    "                return\n",
    "            \n",
    "            # Extract citations from this section\n",
    "            section_citations = self.extract_citations_from_section(\n",
    "                section_id, \n",
    "                section['title'], \n",
    "                section['content']\n",
    "            )\n",
    "            all_citations.extend(section_citations)\n",
    "            \n",
    "            # Process subsections recursively\n",
    "            if 'subsections' in section and section['subsections']:\n",
    "                for subsection in section['subsections']:\n",
    "                    process_section(subsection, section_id)\n",
    "        \n",
    "        # Process all top-level sections\n",
    "        for section in document_data['sections']:\n",
    "            process_section(section)\n",
    "        \n",
    "        # Process figure captions\n",
    "        if 'figures' in document_data:\n",
    "            for figure in document_data['figures']:\n",
    "                figure_id = figure['figure_id']\n",
    "                caption = figure['caption']\n",
    "                \n",
    "                # Extract citations from the figure caption\n",
    "                caption_citations = self.extract_citations_from_section(\n",
    "                    figure_id,\n",
    "                    f\"Figure {figure.get('figure_number', '')}\",\n",
    "                    caption\n",
    "                )\n",
    "                all_citations.extend(caption_citations)\n",
    "                processed_section_ids.add(figure_id)\n",
    "        \n",
    "        print(f\"Extracted {len(all_citations)} total citations from document using LLM\")\n",
    "        return all_citations, processed_section_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_citations_with_llm(document_json_path, openai_api_key=None):\n",
    "    \"\"\"\n",
    "    Extract and analyze citations in a document using LLM.\n",
    "    \n",
    "    Args:\n",
    "        document_json_path: Path to the processed document JSON file\n",
    "        openai_api_key: OpenAI API key (optional, will use environment variable if not provided)\n",
    "        \n",
    "    Returns:\n",
    "        CitationAnalysisResults object\n",
    "    \"\"\"\n",
    "    # Load the document data\n",
    "    with open(document_json_path, 'r', encoding='utf-8') as f:\n",
    "        document_data = json.load(f)\n",
    "    \n",
    "    document_id = Path(document_json_path).stem\n",
    "    \n",
    "    print(f\"Processing document: {document_id}\")\n",
    "    \n",
    "    # Extract citations using LLM\n",
    "    print(\"\\n=== EXTRACTING CITATIONS USING LLM ===\")\n",
    "    extractor = LLMCitationExtractor(api_key=openai_api_key)\n",
    "    citations, processed_section_ids = extractor.extract_all_citations(document_data)\n",
    "    \n",
    "    # Count citations by section\n",
    "    citations_by_section = {}\n",
    "    \n",
    "    # Initialize counts for all processed sections (even those with zero citations)\n",
    "    for section_id in processed_section_ids:\n",
    "        citations_by_section[section_id] = 0\n",
    "    \n",
    "    # Count citations for each section\n",
    "    for citation in citations:\n",
    "        section_id = citation.section_id\n",
    "        citations_by_section[section_id] = citations_by_section.get(section_id, 0) + 1\n",
    "    \n",
    "    # Extract bibliography references\n",
    "    references = extract_bibliography(document_data, openai_api_key)\n",
    "    \n",
    "    # Create the results object\n",
    "    results = CitationAnalysisResults(\n",
    "        document_id=document_id,\n",
    "        citations=citations,\n",
    "        total_citations=len(citations),\n",
    "        citations_by_section=citations_by_section,\n",
    "        references=references,\n",
    "        processed_date=datetime.now().isoformat()\n",
    "    )\n",
    "    \n",
    "    # Print summary statistics\n",
    "    print(\"\\n===== CITATION ANALYSIS SUMMARY =====\")\n",
    "    print(f\"Total citations extracted: {results.total_citations}\")\n",
    "    print(f\"Total references extracted: {len(references)}\")\n",
    "    \n",
    "    print(\"\\nCitation distribution by section:\")\n",
    "    # Get section titles from document data\n",
    "    section_titles = {}\n",
    "    \n",
    "    def collect_section_titles(section, parent_title=\"\"):\n",
    "        section_id = section['section_id']\n",
    "        section_title = section['title']\n",
    "        section_titles[section_id] = section_title\n",
    "        \n",
    "        if 'subsections' in section and section['subsections']:\n",
    "            for subsection in section['subsections']:\n",
    "                collect_section_titles(subsection, section_title)\n",
    "    \n",
    "    # Collect section titles from document data\n",
    "    for section in document_data['sections']:\n",
    "        collect_section_titles(section)\n",
    "    \n",
    "    # Add figure captions to section titles\n",
    "    if 'figures' in document_data:\n",
    "        for figure in document_data['figures']:\n",
    "            figure_id = figure['figure_id']\n",
    "            section_titles[figure_id] = f\"Figure {figure.get('figure_number', '')}\"\n",
    "    \n",
    "    # Print citation counts by section with titles\n",
    "    for section_id, count in sorted(citations_by_section.items(), key=lambda x: x[1], reverse=True):\n",
    "        section_title = section_titles.get(section_id, \"Unknown\")\n",
    "        print(f\"  - {section_title}: {count} citations\")\n",
    "    \n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def save_citation_results(results, output_dir='citation_analysis'):\n",
    "    \"\"\"\n",
    "    Save the citation analysis results to files.\n",
    "    \n",
    "    Args:\n",
    "        results: CitationAnalysisResults object\n",
    "        output_dir: Directory to save the files\n",
    "    \"\"\"\n",
    "    # Create directory if it doesn't exist\n",
    "    output_path = Path(output_dir)\n",
    "    output_path.mkdir(exist_ok=True, parents=True)\n",
    "    \n",
    "    # Save the complete results as JSON\n",
    "    with open(output_path / 'citations.json', 'w', encoding='utf-8') as f:\n",
    "        f.write(results.model_dump_json(indent=2))\n",
    "    \n",
    "    # Save citations as CSV\n",
    "    citations_df = pd.DataFrame([{\n",
    "        'claim': citation.claim,\n",
    "        'citation_text': citation.citation_text,\n",
    "        'citation_keys': ','.join(map(str, citation.citation_keys)),\n",
    "        'section_id': citation.section_id,\n",
    "        'paragraph_length': len(citation.paragraph) \n",
    "    } for citation in results.citations])\n",
    "    \n",
    "    citations_df.to_csv(output_path / 'citations.csv', index=False)\n",
    "    \n",
    "    # Save references as CSV if available\n",
    "    if results.references:\n",
    "        references_df = pd.DataFrame([{\n",
    "            'reference_id': ref.reference_id,\n",
    "            'reference_text': ref.reference_text,\n",
    "            'pmcid': ref.pmcid or ''\n",
    "        } for ref in results.references])\n",
    "        \n",
    "        references_df.to_csv(output_path / 'bibliography.csv', index=False)\n",
    "        print(f\"Saved {len(results.references)} references to {output_path}/bibliography.csv\")\n",
    "    \n",
    "    print(f\"\\nAnalysis results saved to {output_path}/\")\n",
    "    \n",
    "    return citations_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_bibliography(document_data, openai_api_key=None):\n",
    "    \"\"\"\n",
    "    Extract structured bibliography from document.\n",
    "    \n",
    "    Args:\n",
    "        document_data: Processed document data from JSON\n",
    "        openai_api_key: OpenAI API key (optional, will use environment variable if not provided)\n",
    "        \n",
    "    Returns:\n",
    "        List of Reference objects\n",
    "    \"\"\"\n",
    "    # Find the bibliography section\n",
    "    bibliography_section = None\n",
    "    for section in document_data['sections']:\n",
    "        if any(ref_term in section['title'].lower() for ref_term in ['literature cited', 'references', 'bibliography']):\n",
    "            bibliography_section = section\n",
    "            break\n",
    "    \n",
    "    if not bibliography_section:\n",
    "        print(\"Bibliography section not found\")\n",
    "        return []\n",
    "    \n",
    "    # Get the bibliography text\n",
    "    bibliography_text = bibliography_section['content']\n",
    "    \n",
    "    print(f\"\\n=== EXTRACTING BIBLIOGRAPHY ===\")\n",
    "    print(f\"Processing bibliography section: {bibliography_section['title']}\")\n",
    "    \n",
    "    # Use the LLM to extract references\n",
    "    client = OpenAI(api_key=openai_api_key or os.environ.get(\"OPENAI_API_KEY\"))\n",
    "    \n",
    "    prompt = f\"\"\"\n",
    "    Parse this bibliography into a structured format with the following fields:\n",
    "    \n",
    "    1. reference_id: The numeric ID of the reference (e.g., 1 from [1])\n",
    "    2. reference_text: The full text of the reference\n",
    "    3. pmcid: Extract any PMCID if present (e.g., \"PMC1234567\"), otherwise leave empty\n",
    "    \n",
    "    Bibliography:\n",
    "    ```\n",
    "    {bibliography_text}\n",
    "    ```\n",
    "    \n",
    "    IMPORTANT:\n",
    "    - Make sure to extract ALL references\n",
    "    - Capture the entire reference text, including any URLs or DOIs\n",
    "    - Pay special attention to extracting PMCIDs correctly\n",
    "    - If there are no PMCIDs, it's fine to leave them empty\n",
    "    - Verify that the reference numbers form a complete sequence (1,2,3,...) with no gaps\n",
    "    \"\"\"\n",
    "    \n",
    "    try:\n",
    "        completion = client.beta.chat.completions.parse(\n",
    "            model=\"gpt-4.1\",\n",
    "            messages=[\n",
    "                {\"role\": \"system\", \"content\": \"You are an expert at parsing academic bibliographies into structured data.\"},\n",
    "                {\"role\": \"user\", \"content\": prompt}\n",
    "            ],\n",
    "            response_format=ReferenceList\n",
    "        )\n",
    "        \n",
    "        references = completion.choices[0].message.parsed.references\n",
    "        \n",
    "        # Verify we have all references in sequence\n",
    "        extracted_ids = sorted([ref.reference_id for ref in references])\n",
    "        expected_ids = list(range(1, max(extracted_ids) + 1))\n",
    "        \n",
    "        if extracted_ids != expected_ids:\n",
    "            missing_ids = set(expected_ids) - set(extracted_ids)\n",
    "            print(f\"WARNING: Missing references: {missing_ids}\")\n",
    "        \n",
    "        print(f\"Extracted {len(references)} references from bibliography\")\n",
    "        return references\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"Error extracting bibliography: {e}\")\n",
    "        return []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing document: R35_MIRA_document\n",
      "\n",
      "=== EXTRACTING CITATIONS USING LLM ===\n",
      "DEBUG: Processing section_id=section_0, title='# A Background'\n",
      "DEBUG: Content sample: A Background \n",
      "\n",
      "Our research is focused on identifying and understanding the mechanisms that ensure t...\n",
      "Extracted 22 citations from section '# A Background'\n",
      "DEBUG: Processing section_id=section_1, title='# B Recent Research Progress'\n",
      "DEBUG: Content sample: B Recent Research Progress \n",
      "\n",
      "Our work over the last four years has described, at nucleotide resoluti...\n",
      "Extracted 19 citations from section '# B Recent Research Progress'\n",
      "DEBUG: Processing section_id=section_2, title='# C Overview of Future Research'\n",
      "DEBUG: Content sample: C Overview of Future Research \n",
      "\n",
      "We will continue to work on and address major questions in DNA repli...\n",
      "Extracted 0 citations from section '# C Overview of Future Research'\n",
      "DEBUG: Processing section_id=subsection_0, title='## C. 1 Chromatin assembly behind the fork'\n",
      "DEBUG: Content sample: C. 1 Chromatin assembly behind the fork\n",
      "\n",
      "Passage of the DNA replication fork through chromatin resul...\n",
      "Extracted 16 citations from section '## C. 1 Chromatin assembly behind the fork'\n",
      "DEBUG: Processing section_id=subsection_1, title='## C. 2 DNA replication and genome integrity'\n",
      "DEBUG: Content sample: C. 2 DNA replication and genome integrity \n",
      "\n",
      "Coupling of helicase activity with the replisome and act...\n",
      "Extracted 13 citations from section '## C. 2 DNA replication and genome integrity'\n",
      "DEBUG: Processing section_id=subsection_2, title='## C. 3 Gene regulation'\n",
      "DEBUG: Content sample: C. 3 Gene regulation \n",
      "\n",
      "A major challenge in genome biology is deciphering the complex regulatory cod...\n",
      "Extracted 3 citations from section '## C. 3 Gene regulation'\n",
      "DEBUG: Processing section_id=subsection_3, title='## C. 4 Technology development'\n",
      "DEBUG: Content sample: C. 4 Technology development\n",
      "\n",
      "We have continued to develop and extend our GCOP assay to describe the ...\n",
      "Extracted 8 citations from section '## C. 4 Technology development'\n",
      "Skipping bibliography section: # Literature Cited\n",
      "DEBUG: Processing section_id=figure_0, title='Figure 1'\n",
      "DEBUG: Content sample: Figure 1. GCOP of the GCV1 locus in Control and bas $1 \\Delta$ cells. Gene bodies are depicted in li...\n",
      "Extracted 0 citations from section 'Figure 1'\n",
      "DEBUG: Processing section_id=figure_1, title='Figure 2'\n",
      "DEBUG: Content sample: Figure 2. Model for helicase activation and stalling in the absence of active DNA replication. Follo...\n",
      "Extracted 0 citations from section 'Figure 2'\n",
      "DEBUG: Processing section_id=figure_2, title='Figure 3'\n",
      "DEBUG: Content sample: Figure 3. Nascent GCOPs reveal heterogeneous nucleosome deposition in cac1 $\\Delta$ cells. Nucleosom...\n",
      "Extracted 0 citations from section 'Figure 3'\n",
      "DEBUG: Processing section_id=figure_3, title='Figure 4'\n",
      "DEBUG: Content sample: Figure 4. Strand specific nascent GCOPs reveal preferential deposition of nucleosomes on the leading...\n",
      "Extracted 0 citations from section 'Figure 4'\n",
      "DEBUG: Processing section_id=figure_4, title='Figure 5'\n",
      "DEBUG: Content sample: Figure 5. BAS1 chromatin-based regulatory network. Nodes represent genes with altered chromatin in t...\n",
      "Extracted 0 citations from section 'Figure 5'\n",
      "Extracted 81 total citations from document using LLM\n",
      "\n",
      "=== EXTRACTING BIBLIOGRAPHY ===\n",
      "Processing bibliography section: # Literature Cited\n",
      "Extracted 81 references from bibliography\n",
      "\n",
      "===== CITATION ANALYSIS SUMMARY =====\n",
      "Total citations extracted: 81\n",
      "Total references extracted: 81\n",
      "\n",
      "Citation distribution by section:\n",
      "  - # A Background: 22 citations\n",
      "  - # B Recent Research Progress: 19 citations\n",
      "  - ## C. 1 Chromatin assembly behind the fork: 16 citations\n",
      "  - ## C. 2 DNA replication and genome integrity: 13 citations\n",
      "  - ## C. 4 Technology development: 8 citations\n",
      "  - ## C. 3 Gene regulation: 3 citations\n",
      "  - Figure 5: 0 citations\n",
      "  - # C Overview of Future Research: 0 citations\n",
      "  - # Literature Cited: 0 citations\n",
      "  - Figure 3: 0 citations\n",
      "  - Figure 4: 0 citations\n",
      "  - Figure 1: 0 citations\n",
      "  - Figure 2: 0 citations\n",
      "Saved 81 references to citation_analysis/bibliography.csv\n",
      "\n",
      "Analysis results saved to citation_analysis/\n",
      "\n",
      "=== SAMPLE OF EXTRACTED CITATIONS ===\n",
      "                                               claim citation_text  \\\n",
      "0  Duplication of the genome, once and only once ...        $^{1}$   \n",
      "1  The origin recognition complex (ORC) bound to ...        $^{2}$   \n",
      "2  Helicase loading is restricted to G1 and, in e...        $^{3}$   \n",
      "3  DDK and CDK promote the recruitment of additio...        $^{4}$   \n",
      "4  Helicase progression and coupling with the rep...        $^{5}$   \n",
      "\n",
      "  citation_keys section_id  paragraph_length  \n",
      "0             1  section_0              1117  \n",
      "1             2  section_0              1117  \n",
      "2             3  section_0              1117  \n",
      "3             4  section_0              1117  \n",
      "4             5  section_0              1018  \n",
      "\n",
      "=== EXAMPLE CLAIMS WITH EXPANDED CITATION KEYS ===\n",
      "\n",
      "1. Claim: Duplication of the genome, once and only once per cell cycle, is accomplished by licensing and activ...\n",
      "   Citation: $^{1}$\n",
      "   Expanded keys: [1]\n",
      "\n",
      "2. Claim: The origin recognition complex (ORC) bound to defined origin sequences in the budding yeast, S. cere...\n",
      "   Citation: $^{2}$\n",
      "   Expanded keys: [2]\n",
      "\n",
      "3. Claim: Helicase loading is restricted to G1 and, in effect, 'licenses' the origin for activation in the com...\n",
      "   Citation: $^{3}$\n",
      "   Expanded keys: [3]\n",
      "\n",
      "4. Claim: DDK and CDK promote the recruitment of additional proteins including Cdc45 and the GINS complex culm...\n",
      "   Citation: $^{4}$\n",
      "   Expanded keys: [4]\n",
      "\n",
      "5. Claim: Helicase progression and coupling with the replisome at each fork is a tightly regulated process con...\n",
      "   Citation: $^{5}$\n",
      "   Expanded keys: [5]\n",
      "\n",
      "=== SAMPLE OF EXTRACTED REFERENCES ===\n",
      "\n",
      "1. Reference ID: 1\n",
      "   Text: J. F. X. Diffley. Regulation of early events in chromosome replication. Curr. Biol., 14(18):R778-86,...\n",
      "\n",
      "2. Reference ID: 2\n",
      "   Text: S. P. Bell and J. M. Kaguni. Helicase loading at chromosomal origins of replication. Cold Spring Har...\n",
      "   PMCID: PMC3660832\n",
      "\n",
      "3. Reference ID: 3\n",
      "   Text: K. Siddiqui, K. F. On, and J. F. X. Diffley. Regulating DNA replication in eukarya. Cold Spring Harb...\n",
      "   PMCID: PMC3753713\n",
      "\n",
      "4. Reference ID: 4\n",
      "   Text: S. Tanaka and H. Araki. Helicase activation and establishment of replication forks at chromosomal or...\n",
      "   PMCID: PMC3839609\n",
      "\n",
      "5. Reference ID: 5\n",
      "   Text: A. W. McClure and J. F. Diffley. Rad53 checkpoint kinase regulation of DNA replication fork rate via...\n",
      "   PMCID: PMC8387023\n"
     ]
    }
   ],
   "source": [
    "# Enter the path to your processed document JSON\n",
    "document_json_path = \"R35_MIRA_document.json\"\n",
    "\n",
    "# Enter your OpenAI API key here or set it as an environment variable\n",
    "openai_api_key = os.environ.get(\"OPENAI_API_KEY\")\n",
    "\n",
    "try:\n",
    "    # Check if the document JSON exists\n",
    "    if not Path(document_json_path).exists():\n",
    "        print(f\"Document JSON not found at {document_json_path}\")\n",
    "        print(\"Run the ocr.ipynb notebook first to create the document JSON.\")\n",
    "    elif not openai_api_key:\n",
    "        print(\"OpenAI API key not provided. Set the OPENAI_API_KEY environment variable or provide it directly.\")\n",
    "    else:\n",
    "        # Analyze the document's citations using LLM\n",
    "        results = analyze_citations_with_llm(document_json_path, openai_api_key)\n",
    "        \n",
    "        # Save the results\n",
    "        citations_df = save_citation_results(results)\n",
    "        \n",
    "        # Display a sample of the results\n",
    "        print(\"\\n=== SAMPLE OF EXTRACTED CITATIONS ===\")\n",
    "        print(citations_df.head())\n",
    "        \n",
    "        # Print some example claims and their citation keys\n",
    "        print(\"\\n=== EXAMPLE CLAIMS WITH EXPANDED CITATION KEYS ===\")\n",
    "        for i, citation in enumerate(results.citations[:5]):  # Show first 5 citations\n",
    "            print(f\"\\n{i+1}. Claim: {citation.claim[:100]}...\" if len(citation.claim) > 100 else f\"\\n{i+1}. Claim: {citation.claim}\")\n",
    "            print(f\"   Citation: {citation.citation_text}\")\n",
    "            print(f\"   Expanded keys: {citation.citation_keys}\")\n",
    "        \n",
    "        # Print sample of extracted references if available\n",
    "        if results.references:\n",
    "            print(\"\\n=== SAMPLE OF EXTRACTED REFERENCES ===\")\n",
    "            for i, ref in enumerate(results.references[:5]):  # Show first 5 references\n",
    "                print(f\"\\n{i+1}. Reference ID: {ref.reference_id}\")\n",
    "                print(f\"   Text: {ref.reference_text[:100]}...\" if len(ref.reference_text) > 100 else f\"   Text: {ref.reference_text}\")\n",
    "                if ref.pmcid:\n",
    "                    print(f\"   PMCID: {ref.pmcid}\")\n",
    "        \n",
    "except Exception as e:\n",
    "    print(f\"Error analyzing document: {e}\")\n",
    "    import traceback\n",
    "    traceback.print_exc()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this notebook, we:\n",
    "1. Loaded the processed document from JSON\n",
    "2. Used an LLM to extract citations section-by-section\n",
    "3. For each citation, the LLM:\n",
    "   - Identified the scientific claim being supported\n",
    "   - Extracted the exact citation text\n",
    "   - Expanded citation keys (ranges and lists) into individual numbers\n",
    "   - Captured the surrounding paragraph for context\n",
    "4. Extracted bibliography references with PMCIDs when available\n",
    "5. Saved the structured citation data for further analysis\n",
    "\n",
    "The LLM-based approach provides a more intelligent extraction of citations by understanding the context and automatically handling the expansion of citation ranges and lists. This eliminates the need for complex regex patterns and separate post-processing steps, resulting in more accurate identification of claims and their supporting citations.\n",
    "\n",
    "### Future Improvements\n",
    "\n",
    "- **Document Structure Enhancement**: When processing markdown for the document structure JSON, add a specific flag for bibliography sections to make them easier to identify and skip during citation extraction.\n",
    "- **Subsection Processing Optimization**: Improve how subsections are processed to ensure more efficient citation extraction throughout the document hierarchy.\n",
    "- **Citation Pattern Recognition**: Further refine the LLM prompt to handle various citation formats including uncommon LaTeX variations.\n",
    "- **Bibliography Chunking**: If bibliography extraction completeness is an issue, implement a chunking strategy to process larger bibliographies.\n",
    "- **PMCID Validation**: Add additional validation for PMCIDs, including format checking and verification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science