assessment-tool

A tool that supports assessment, especially for checking references in the report

https://github.com/janetyc/assessment-tool

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Last synced: 7 months ago · JSON representation ·

Repository

A tool that supports assessment, especially for checking references in the report

Basic Info

Host: GitHub
Owner: janetyc
License: mit
Language: Python
Default Branch: main
Size: 38.1 KB

Statistics

Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

Assessment Tool

OVERVIEW

The Assessment Tool includes a PDF Citation Analyzer, which is designed for academic researchers, students, and educators to analyze and validate citations in PDF documents. It provides automated citation style detection, format validation, and comprehensive reporting capabilities.

FEATURES

🔍 Advanced Citation Analysis:

Automatic detection of 5 major citation styles (APA, MLA, Chicago, IEEE, ACM)
Confidence scoring for style detection accuracy
Citation format validation with error reporting
Component extraction (authors, titles, years, journals, etc.)

📝 Enhanced Text Processing:

PDF text extraction with page-by-page breakdown
Advanced reference section detection with flexible header recognition
Support for multiple reference formats: [1], 1., plain numbers, Author et al.
Smart separation of body text and references section
In-text citation detection (numbered and author-year formats)
Context sentence extraction for citations
Interactive content editing with real-time re-analysis

📊 Quality Assessment:

Style consistency analysis across document
Format compliance checking
Missing component identification
Comprehensive validation reporting

🔗 Reference Validation:

Google Scholar integration for manual verification
Rate-limited external service calls
Professional error handling and fallback options

🎛️ Interactive Features:

Real-time content editing and re-analysis
Session state management with automatic updates
Force refresh functionality for widget issues
Enhanced debugging information for troubleshooting
Smart pattern filtering for non-reference content

📈 Reporting & Export:

Detailed citation analysis reports
Style distribution statistics
Downloadable analysis results
Professional formatting and presentation

SUPPORTED CITATION STYLES

APA (American Psychological Association)
- Author, A. A. (Year). Title. Journal, volume(issue), pages.
MLA (Modern Language Association)
- Author, First. "Title." Journal, vol. #, no. #, Year, pp. ##-##.
Chicago Manual of Style
- Author, First. "Title." Journal vol. #, no. # (Year): pages.
IEEE (Institute of Electrical and Electronics Engineers)
- [#] A. Author, "Title," Journal, vol. #, no. #, pp. ##-##, Year.
ACM (Association for Computing Machinery)
- [#] First Last and First Last. Year. Title. Journal vol, issue (Month Year), pages.

TECHNICAL REQUIREMENTS

Python 3.7 or higher
Streamlit web framework
PyPDF2 for PDF processing
Standard Python libraries (re, requests, urllib, datetime, etc.)

INSTALLATION & SETUP

Clone or download the project files
Install required dependencies: pip install streamlit PyPDF2 requests
Activate virtual environment (if using): source assessment-tool/bin/activate
Run the application: streamlit run citation_analyzer.py
Open web browser to: http://localhost:8501

USAGE INSTRUCTIONS

Upload PDF: Click "Choose a PDF file" and select your academic document
View Content: Check extracted text in the "Content" tab for accuracy
- Edit content directly in the text area if needed
- Fix OCR errors, formatting issues, or reference section problems
Interactive Re-analysis:
- Modify content in the editable text area
- Click "Re-analyze Citations" to update all analysis tabs
- Use "Force Refresh" if the text area seems unresponsive
- Click "Reset to Original" to restore the original PDF content
Analyze In-Text Citations: Review citations within context sentences
Examine References: View detected references and their formatting
- Check the "Raw References Text" expander for debugging
- References update automatically after re-analysis
Citation Analysis: Get detailed style analysis with:
- Individual reference style detection
- Format validation results
- Component extraction details
- Style consistency warnings
Download Reports: Export comprehensive analysis reports for documentation

APPLICATION STRUCTURE

The application is organized into 7 logical sections:

Configuration & Constants
- Citation style definitions
- Rate limiting configuration
- Application settings
Citation Style Definitions
- Comprehensive pattern matching for each style
- Validation rules and requirements
- Style-specific formatting guidelines
Utility Functions
- Domain extraction and rate limiting
- Helper functions for processing
Enhanced Text Processing & Extraction
- PDF text extraction with page separation
- Advanced reference section detection with flexible header recognition
- Support for multiple reference formats: [1], 1., plain numbers, Author et al.
- Smart separation handles page numbers, section headers, and headerless references
- In-text citation identification
- Interactive content modification and re-analysis
Citation Style Analysis
- Automated style detection algorithms
- Confidence scoring mechanisms
- Component extraction by style
URL & DOI Validation
- URL cleaning and validation
- DOI resolution checking
- Error handling for network requests
UI Display Functions
- Professional result presentation
- Report generation and formatting
- User interface components

KEY ALGORITHMS

Advanced Reference Detection: Multi-pass algorithm that:
- Detects reference headers with various prefixes (page numbers, section numbers)
- Handles complex formats: "Page 15 References", "7. References", "Appendix A: References"
- Identifies references without explicit headers using content pattern analysis
- Supports references anywhere in document, not just at page boundaries
Enhanced Reference Format Support: Extracts references in multiple formats:
- Bracketed numbers: [1] Author, A. (2023). Title...
- Numbered with periods: 1. Author, A. (2023). Title...
- Plain numbers: 1 Author, A. (2023). Title...
- Author names: Author, A., Co-author, B. (2023). Title...
- Et al. format: Authorname et al. (2023). Title...
Style Detection Engine: Uses multi-pattern matching with confidence scoring to identify citation styles based on formatting patterns, punctuation, and structure
Component Extraction: Style-specific parsing algorithms extract bibliographic components (authors, titles, years, etc.) for detailed analysis
ormat Validation: Rule-based validation checks citations against academic style guidelines and reports errors/warnings

VALIDATION FEATURES

Missing author information detection
Publication year validation and formatting
Title presence and formatting checks
Style-specific punctuation validation
Reference completeness assessment
Style consistency across document
Google Scholar verification links

OUTPUT & REPORTING

The tool generates comprehensive reports including: - Total reference count and distribution - Citation style breakdown with percentages - Individual reference analysis with confidence scores - Format validation results with specific errors - Component extraction details - Style consistency warnings - Downloadable text reports for documentation

PROFESSIONAL USE CASES

🎓 Academic Research: Validate citations in research papers and dissertations

📚 Educational Assessment: Check student paper citation quality and consistency

📝 Manuscript Review: Verify reference formatting before journal submission

🔍 Quality Assurance: Systematic citation analysis for institutional standards

📊 Style Compliance: Ensure adherence to specific academic style guidelines

LIMITATIONS & CONSIDERATIONS

PDF quality affects text extraction accuracy
Complex citation formats may require manual verification
Style detection confidence varies with citation complexity
Network-dependent features require internet connectivity
Rate limiting prevents excessive external service calls

TROUBLESHOOTING

Common Issues:

References not updating:
- Make sure to click "Re-analyze Citations" after editing content
- Try "Force Refresh" if the text area seems unresponsive
- Check the debug information for content length changes
No references found: Check PDF has extractable text and reference content (tool now handles headers with page numbers, section numbers, and headerless references)
Text area not responding: Use "Force Refresh" button to reset the widget state
Old content persisting: Click "Reset to Original" to restore the original PDF content
Low confidence scores: Complex or non-standard citation formats may need manual review
Style inconsistencies: Mixed citation styles require editorial attention
Network errors: Check internet connection for external validation features

Interactive Re-analysis Tips:

Edit content directly in the text area to fix extraction issues
Add reference headers manually if not detected: "References", "Bibliography"
Fix concatenated headers: change "19REFERENCES" to "19 REFERENCES"
Separate merged references into individual lines
Remove non-reference content like "(No references available)"

For best results: - Use PDFs with clear, machine-readable text - Reference sections can now be detected even without clear headers - Supports various reference formats: [1], 1., plain numbers, Author et al. - Tool handles prefixed headers like "Page 15 References" or "7. References" - Use consistent citation style throughout document - Utilize the interactive editing feature for challenging formats

TECHNICAL SUPPORT

This tool is designed for academic and research purposes. For technical issues:

Verify PDF quality and text extractability
Check citation format against standard style guides
Review debug information in application tabs
Use manual verification options when automated detection fails

Created for academic excellence and citation quality assurance.

Owner

Name: janetyc
Login: janetyc
Kind: user

Repositories: 15
Profile: https://github.com/janetyc

Citation (citation_analyzer.py)

"""
Enhanced PDF Citation Analyzer
A comprehensive tool for extracting and analyzing citations from academic PDFs.
"""

import streamlit as st
import PyPDF2
import io
from pathlib import Path
import base64
import re
import requests
import urllib.parse
import time
from collections import defaultdict
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional, Any
import json

# ============================================================================
# CONFIGURATION AND CONSTANTS
# ============================================================================

# Rate limiting configuration
RATE_LIMITS = {
    'dl.acm.org': {'requests': 0, 'last_reset': datetime.now(), 'max_requests': 10, 'reset_interval': 60},
    'doi.org': {'requests': 0, 'last_reset': datetime.now(), 'max_requests': 30, 'reset_interval': 60},
    'default': {'requests': 0, 'last_reset': datetime.now(), 'max_requests': 50, 'reset_interval': 60}
}

# Set page config
st.set_page_config(
    page_title="PDF Citation Analyzer",
    page_icon="📄",
    layout="wide"
)

# ============================================================================
# CITATION STYLE DEFINITIONS
# ============================================================================

@dataclass
class CitationStyle:
    """Represents a citation style with its patterns and characteristics."""
    name: str
    patterns: List[str]
    year_pattern: str
    author_pattern: str
    title_indicators: List[str]
    common_punctuation: Dict[str, str]

# Define citation style patterns
CITATION_STYLES = {
    'APA': CitationStyle(
        name='APA',
        patterns=[
            r'^[A-Z][a-z]+,\s+[A-Z]\.\s*(?:[A-Z]\.\s*)?(?:,?\s*&\s*[A-Z][a-z]+,\s+[A-Z]\.\s*(?:[A-Z]\.\s*)?)?\s*\(\d{4}\)',
            r'^[A-Z][a-z]+,\s+[A-Z]\.\s*(?:[A-Z]\.\s*)?\s*\(\d{4}\).*?\.\s*[A-Z][a-z]+.*?,\s*\d+\s*\(\d+\),\s*\d+[-–]\d+',
            r'^[A-Z][a-z]+,\s+[A-Z]\.\s*(?:[A-Z]\.\s*)?,\s*et\s+al\.\s*\(\d{4}\)'
        ],
        year_pattern=r'\(\d{4}\)',
        author_pattern=r'^[A-Z][a-z]+,\s+[A-Z]\.',
        title_indicators=['Italics after year', 'Sentence case'],
        common_punctuation={'after_year': '.', 'after_title': '.', 'before_pages': ','}
    ),
    
    'MLA': CitationStyle(
        name='MLA',
        patterns=[
            r'^[A-Z][a-z]+,\s+[A-Z][a-z]+\.\s*"[^"]+\."\s*[A-Z][a-z]+.*?,\s*vol\.\s*\d+',
            r'^[A-Z][a-z]+,\s+[A-Z][a-z]+\.\s*[A-Z][a-z]+.*?\.\s*[A-Z][a-z]+.*?,\s*\d{4}',
            r'^[A-Z][a-z]+,\s+[A-Z][a-z]+,\s+and\s+[A-Z][a-z]+\s+[A-Z][a-z]+'
        ],
        year_pattern=r',\s*\d{4}(?:\.|,)',
        author_pattern=r'^[A-Z][a-z]+,\s+[A-Z][a-z]+',
        title_indicators=['Quotes for articles', 'Italics for books'],
        common_punctuation={'after_author': '.', 'after_title': '.', 'before_year': ','}
    ),
    
    'Chicago': CitationStyle(
        name='Chicago',
        patterns=[
            r'^[A-Z][a-z]+,\s+[A-Z][a-z]+\.\s*"[^"]+\."\s*[A-Z][a-z]+.*?\s+\d+,\s*no\.\s*\d+\s*\(\d{4}\):',
            r'^[A-Z][a-z]+,\s+[A-Z][a-z]+\.\s*[A-Z][a-z]+.*?\.\s*[A-Z][a-z]+:\s*[A-Z][a-z]+,\s*\d{4}',
            r'^\d+\.\s*[A-Z][a-z]+,\s*[A-Z][a-z]+.*?\s*\([A-Z][a-z]+:\s*[A-Z][a-z]+,\s*\d{4}\)'
        ],
        year_pattern=r'\(\d{4}\)|\,\s*\d{4}(?:\.|,)',
        author_pattern=r'^(?:\d+\.\s*)?[A-Z][a-z]+,\s+[A-Z][a-z]+',
        title_indicators=['Quotes for articles', 'Italics for books', 'Footnote numbers'],
        common_punctuation={'after_author': '.', 'after_title': '.', 'publisher_separator': ':'}
    ),
    
    'IEEE': CitationStyle(
        name='IEEE',
        patterns=[
            r'^\[\d+\]\s*[A-Z]\.\s*[A-Z][a-z]+(?:\s+and\s+[A-Z]\.\s*[A-Z][a-z]+)*,\s*"[^"]+,"',
            r'^\[\d+\]\s*[A-Z]\.\s*[A-Z][a-z]+,\s*[A-Z][a-z]+.*?\.\s*[A-Z][a-z]+:\s*[A-Z][a-z]+,\s*\d{4}',
            r'^\[\d+\]\s*[A-Z]\.\s*[A-Z][a-z]+.*?,\s*"[^"]+,"\s*in\s+Proc\.'
        ],
        year_pattern=r',\s*\d{4}(?:\.|,)',
        author_pattern=r'^\[\d+\]\s*[A-Z]\.\s*[A-Z][a-z]+',
        title_indicators=['Quotes for all titles', 'Numbered references', 'Abbreviated first names'],
        common_punctuation={'after_number': ' ', 'after_authors': ',', 'title_quotes': '"'}
    ),
    
    'ACM': CitationStyle(
        name='ACM',
        patterns=[
            r'^\[\d+\]\s*[A-Z][a-z]+\s+(?:[A-Z]\.\s+)?[A-Z][a-z]+(?:\s+and\s+[A-Z][a-z]+\s+(?:[A-Z]\.\s+)?[A-Z][a-z]+)*\.\s+\d{4}\.',
            r'^(?:\[\d+\]\s*)?[A-Z][a-z]+\s+[A-Z][a-z]+(?:\s+and\s+[A-Z][a-z]+\s+[A-Z][a-z]+)*\.\s+\d{4}\.',
            r'^(?:\[\d+\]\s*)?[A-Z][a-z]+\s+[A-Z][a-z]+,?\s+et\s+al\.\s+\d{4}\.',
            r'[A-Z][a-z]+\.?\s+[A-Z]+\s+\d+,\s*\d+\s*\([A-Z][a-z]+\.?\s+\d{4}\),\s*\d+[-–]\d+',
            r'In\s+Proceedings\s+of\s+(?:the\s+)?\d*(?:st|nd|rd|th)?\.?\s*[A-Z]',
            r'Article\s+\d+\s*\([A-Z][a-z]+\s+\d{4}\),\s*\d+\s+pages?'
        ],
        year_pattern=r'\b\d{4}\b(?:\.|,|\s|$)',
        author_pattern=r'^(?:\[\d+\]\s*)?[A-Z][a-z]+\s+(?:[A-Z]\.\s+)?[A-Z][a-z]+',
        title_indicators=['Full names preferred', 'Year follows authors with period', 'Square brackets for numbers', 'DOI/URL at end', 'Month in parentheses for journals'],
        common_punctuation={'after_year': '.', 'between_authors': ' and ', 'after_title': '.', 'ref_brackets': '[]', 'page_separator': ', '}
    )
}

# ============================================================================
# UTILITY FUNCTIONS
# ============================================================================

def get_domain(url):
    """Extract domain from URL."""
    try:
        return urllib.parse.urlparse(url).netloc.lower()
    except:
        return None

def check_rate_limit(domain):
    """Check if we've hit rate limit for a domain."""
    now = datetime.now()
    rate_info = RATE_LIMITS.get(domain, RATE_LIMITS['default'])
    
    if (now - rate_info['last_reset']).total_seconds() > rate_info['reset_interval']:
        rate_info['requests'] = 0
        rate_info['last_reset'] = now
    
    if rate_info['requests'] >= rate_info['max_requests']:
        return False
    
    rate_info['requests'] += 1
    return True

# ============================================================================
# TEXT PROCESSING AND EXTRACTION FUNCTIONS
# ============================================================================

def split_body_and_references(text):
    """
    Enhanced function to split the extracted PDF text into body_text and references_text.
    Handles various scenarios:
    - Reference headers with page numbers: "Page 15 References"
    - Section numbers: "7. References" or "Section 7 References"
    - Headers anywhere in line: "15 REFERENCES AND BIBLIOGRAPHY"
    - References without explicit headers (detected by reference patterns)
    - Multiple possible reference section locations
    """
    # Handle None or empty text
    if text is None:
        return "", ""
    
    ref_headers = [
        "references", "bibliography", "works cited", "literature cited", 
        "cited works", "sources", "reference list", "citations"
    ]
    
    lines = text.split('\n')
    page_delim_pattern = re.compile(r'^--- Page \d+ ---$', re.IGNORECASE)
    
    def contains_reference_header(line_text):
        """Check if line contains a reference header, handling various prefixes."""
        line_lower = line_text.lower().strip()
        
        # First, check if line starts with a reference header pattern
        # We'll be more lenient with length if it starts with a clear header
        header_at_start = False
        
        # Quick check for headers at start of line
        for header in ref_headers:
            if line_lower.startswith(header) or re.match(rf'^\d+\s*{header}', line_lower) or re.match(rf'^page\s*\d*\s*{header}', line_lower):
                header_at_start = True
                break
        
        # If header is NOT at start and line is too long, skip (likely body text)
        if not header_at_start and len(line_lower) > 100:
            return False
        
        # Skip if line contains common sentence indicators (unless header is clearly at start)
        sentence_indicators = ['the ', 'this ', 'these ', 'those ', 'for ', 'see ', 'in ', 'of ', 'to ', 'with ', 'from ', 'and ', 'or ', 'but ', 'however ', 'therefore ', 'according ', 'based on']
        if not header_at_start and any(indicator in line_lower for indicator in sentence_indicators):
            return False
        
        # Patterns to match reference headers with various prefixes (more restrictive)
        patterns = [
            # Direct header match (start of line or after number/section)
            rf'^({"|".join(ref_headers)})(?:\s|$)',
            # With page numbers: "Page 15 References"
            rf'^page\s+\d+\s+({"|".join(ref_headers)})(?:\s|$)',
            # With section numbers: "7. References" or "Section 7 References"  
            rf'^(?:section\s+)?\d+\.?\s+({"|".join(ref_headers)})(?:\s|$)',
            # Roman numerals: "VII. References"
            rf'^(?:section\s+)?[ivxlcdm]+\.?\s+({"|".join(ref_headers)})(?:\s|$)',
            # Appendix: "Appendix A: References"
            rf'^appendix\s+[a-z]\.?\s*:?\s*({"|".join(ref_headers)})(?:\s|$)',
            # Just numbers before: "15 REFERENCES" (but only if short line)
            rf'^\d+\s+({"|".join(ref_headers)})(?:\s|$)',
            # Concatenated formats: "19REFERENCES", "15BIBLIOGRAPHY" (numbers directly attached)
            rf'^\d+({"|".join(ref_headers)})(?:\s|$)',
            # Concatenated with page: "PageREFERENCES", "Page15REFERENCES"
            rf'^page\d*({"|".join(ref_headers)})(?:\s|$)',
            # Centered or standalone headers (short lines with only the header word)
            rf'^({"|".join(ref_headers)})(?:\s+(?:and\s+)?(?:bibliography|citations|list|sources))?$',
            # Combined headers like "References and Bibliography"
            rf'^({"|".join(ref_headers)})\s+and\s+({"|".join(ref_headers)})$'
        ]
        
        for pattern in patterns:
            if re.search(pattern, line_lower):
                return True
        return False
    
    def extract_content_after_header(line_text, header_found):
        """Extract any content that comes after the reference header in the same line."""
        line_lower = line_text.lower()
        
        # Find where the header ends and extract remainder
        for header in ref_headers:
            if header in line_lower:
                header_pos = line_lower.find(header)
                header_end = header_pos + len(header)
                remainder = line_text[header_end:].strip()
                
                # Remove common separators
                remainder = re.sub(r'^[\s\-.:]+', '', remainder)
                return remainder
        return ""
    
    def looks_like_references_section(lines_sample):
        """Check if a section looks like references based on content patterns."""
        if not lines_sample:
            return False
            
        # Join sample lines to analyze
        sample_text = '\n'.join(lines_sample[:10])  # Check first 10 lines
        
        # Count reference-like patterns
        ref_patterns = [
            r'^\s*\[\d+\]',  # [1] format
            r'^\s*\d+\.',    # 1. format  
            r'^\s*\d+\s+[A-Z]', # Plain number format
            r'[A-Z][a-z]+,\s+[A-Z]\..*?\(\d{4}\)', # Author, A. (year)
            r'et\s+al\.',    # et al.
            r'doi:|DOI:',    # DOI references
            r'https?://',    # URLs
            r'\(\d{4}\)',    # Years in parentheses
            r'vol\.\s*\d+|volume\s+\d+', # Volume numbers
            r'pp?\.\s*\d+',  # Page numbers
        ]
        
        pattern_count = 0
        for pattern in ref_patterns:
            matches = re.findall(pattern, sample_text, re.MULTILINE | re.IGNORECASE)
            pattern_count += len(matches)
        
        # If we find multiple reference indicators, likely a references section
        return pattern_count >= 3
    

    # First pass: Look for explicit reference headers
    for i, line in enumerate(lines):
        line_stripped = line.strip()
        
        # Check if this line contains a reference header
        if contains_reference_header(line_stripped):
            # Check if this header appears after a page delimiter (common case)
            is_after_page_break = (i > 0 and page_delim_pattern.match(lines[i - 1].strip()))
            
            # Extract any content after the header
            remainder = extract_content_after_header(line_stripped, True)
            
            # Find the end of this section (next page or end of document)
            next_page_idx = None
            for j in range(i + 1, len(lines)):
                if page_delim_pattern.match(lines[j].strip()):
                    next_page_idx = j
                    break
            
            # Collect references content, handling blank lines after headers
            if next_page_idx is not None:
                # Include content from the same line as header if present
                refs_lines = []
                if remainder:
                    refs_lines.append(remainder)
                
                # Add content after header, but clean up empty lines
                content_lines = lines[i+1:next_page_idx]
                # Keep all lines but clean up leading/trailing empty lines
                while content_lines and not content_lines[0].strip():
                    content_lines.pop(0)  # Remove leading empty lines
                while content_lines and not content_lines[-1].strip():
                    content_lines.pop()   # Remove trailing empty lines
                    
                refs_lines.extend(content_lines)
                references_text = '\n'.join(refs_lines)
                
                # Remove this section from body (include page marker if header is after page break)
                if is_after_page_break:
                    body_before = lines[:i-1]  # Exclude page marker and header
                else:
                    body_before = lines[:i]  # Just exclude header
                body_after = lines[next_page_idx:]
                body_text = '\n'.join(body_before + body_after)
            else:
                # References go to end of document
                refs_lines = []
                if remainder:
                    refs_lines.append(remainder)
                
                # Add content after header, but clean up empty lines
                content_lines = lines[i+1:]
                while content_lines and not content_lines[0].strip():
                    content_lines.pop(0)  # Remove leading empty lines
                while content_lines and not content_lines[-1].strip():
                    content_lines.pop()   # Remove trailing empty lines
                    
                refs_lines.extend(content_lines)
                references_text = '\n'.join(refs_lines)
                
                # Remove this section from body (include page marker if header is after page break)
                if is_after_page_break:
                    body_text = '\n'.join(lines[:i-1])  # Exclude page marker and header
                else:
                    body_text = '\n'.join(lines[:i])  # Just exclude header
            
            return body_text, references_text
    
    # Second pass: Look for reference sections without explicit headers
    # Check each page for reference-like content
    page_starts = []
    for i, line in enumerate(lines):
        if page_delim_pattern.match(line.strip()):
            page_starts.append(i)
    
    # Check pages starting from the end (references usually at end)
    for page_start in reversed(page_starts[-3:]):  # Check last 3 pages
        # Find next page boundary
        next_page_start = None
        for next_start in page_starts:
            if next_start > page_start:
                next_page_start = next_start
                break
        
        if next_page_start:
            page_lines = lines[page_start+1:next_page_start]
        else:
            page_lines = lines[page_start+1:]
        
        # Skip very short pages
        if len(page_lines) < 5:
            continue
            
        # Check if this page looks like references
        if looks_like_references_section(page_lines):
            references_text = '\n'.join(page_lines)
            
            # Remove this page from body
            body_before = lines[:page_start]
            if next_page_start:
                body_after = lines[next_page_start:]
                body_text = '\n'.join(body_before + body_after)
            else:
                body_text = '\n'.join(body_before)
            
            return body_text, references_text
    
    # Third pass: Look for reference patterns in the last portion of the document
    # Sometimes references appear without clear page boundaries
    if len(lines) > 50:  # Only for reasonably long documents
        last_quarter = lines[-len(lines)//4:]  # Last 25% of document
        
        if looks_like_references_section(last_quarter):
            split_point = len(lines) - len(last_quarter)
            body_text = '\n'.join(lines[:split_point])
            references_text = '\n'.join(last_quarter)
            return body_text, references_text
    
    # If no references section found, return original text
    return text, ""

def extract_in_text_citations(body_text):
    """Extract in-text citations from the body text."""
    if body_text is None:
        body_text = ""
    
    numbered = re.findall(r'\[(\d+)\]', body_text)
    author_year = re.findall(r'\(([A-Z][A-Za-z]+, \d{4}(; [A-Z][A-Za-z]+, \d{4})*)\)', body_text)
    return {
        "numbered": numbered,
        "author_year": author_year
    }

def extract_in_text_citation_sentences(body_text):
    """Extract sentences containing in-text citations."""
    if body_text is None:
        body_text = ""
    
    sentence_pattern = re.compile(r'(?<=[.!?])\s+')
    sentences = sentence_pattern.split(body_text)
    
    numbered_pattern = re.compile(r'\[\d+\]')
    author_year_pattern = re.compile(r'\([A-Z][A-Za-z]+, \d{4}(; [A-Z][A-Za-z]+, \d{4})*\)')
    
    citation_sentences = []
    for sent in sentences:
        found = []
        found += numbered_pattern.findall(sent)
        found += author_year_pattern.findall(sent)
        if found:
            citation_sentences.append({
                "sentence": sent.strip(),
                "citations": found
            })
    return citation_sentences

def extract_references_multiline(text):
    """
    Extract references from multiline text with support for multiple formats:
    - [1] Format: [1] Author, A. (2023). Title...
    - 1. Format: 1. Author, A. (2023). Title...
    - Plain number: 1 Author, A. (2023). Title...
    - Author format: Author, A. et al. (2023). Title...
    - Author year: Authorname et al. (2023)
    """
    if text is None:
        return []
    
    lines = text.split('\n')
    references = []
    current_ref = []

    # Enhanced patterns for different reference formats
    ref_start_patterns = [
        # [1] format - bracketed numbers
        re.compile(r'^\s*\[\d+\]'),
        # 1. format - numbered with period
        re.compile(r'^\s*\d+\.'),
        # Plain number format - just number followed by space and capital letter
        re.compile(r'^\s*\d+\s+[A-Z]'),
        # Author format - starts with author name (Last, First or Last, F.)
        re.compile(r'^\s*[A-Z][a-z]+,\s+[A-Z]\.?\s*(?:[A-Z]\.?\s*)?(?:,?\s*(?:and|&)\s+[A-Z][a-z]+,\s+[A-Z]\.?\s*(?:[A-Z]\.?\s*)?)*(?:,?\s*et\s+al\.)?'),
        # Author et al. format - starts with author followed by et al.
        re.compile(r'^\s*[A-Z][a-z]+(?:\s+[A-Z][a-z]+)?\s+et\s+al\.'),
        # Simple author year format
        re.compile(r'^\s*[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+\(\d{4}\)')
    ]

    def is_reference_start(line_text):
        """Check if line starts a new reference using any of the patterns."""
        line_text = line_text.strip()
        if not line_text:
            return False
        
        for pattern in ref_start_patterns:
            if pattern.match(line_text):
                return True
        return False

    def is_continuation_line(line_text):
        """Check if line is a continuation of a reference."""
        line_text = line_text.strip()
        if not line_text:
            return False
        
        # Common indicators of continuation lines
        continuation_indicators = [
            # Starts with lowercase (likely continuation)
            re.compile(r'^\s*[a-z]'),
            # Starts with common journal/publisher words
            re.compile(r'^\s*(?:In|Proceedings|Journal|IEEE|ACM|Springer|Elsevier)', re.IGNORECASE),
            # Starts with volume/page info
            re.compile(r'^\s*(?:vol\.|volume|pp?\.|pages?|doi:|https?://)', re.IGNORECASE),
            # Starts with punctuation (comma, period)
            re.compile(r'^\s*[,.]'),
            # URL or DOI on separate line
            re.compile(r'^\s*(?:https?://|doi:|www\.)', re.IGNORECASE),
            # Continuation of title or publisher info (common patterns)
            re.compile(r'^\s*(?:regelzaken|nederland|Annaouderenzorg)', re.IGNORECASE)
        ]
        
        for pattern in continuation_indicators:
            if pattern.match(line_text):
                return True
        return False
    
    def looks_like_new_reference(line_text, previous_line=None):
        """Enhanced logic to determine if a line starts a new reference."""
        line_text = line_text.strip()
        if not line_text:
            return False
            
        # If it matches standard patterns, it's definitely a new reference
        if is_reference_start(line_text):
            return True
            
        # Additional heuristics for author-style references
        # Check if line starts with author name pattern (e.g., "Alzheimer Nederland.")
        author_patterns = [
            # Organization or author name followed by period and year
            re.compile(r'^[A-Z][a-zA-Z\s]+\.?\s*\([12]\d{3}', re.IGNORECASE),
            # Organization name followed by colon (like "Alzheimer Nederland:")
            re.compile(r'^[A-Z][a-zA-Z\s]+:\s*', re.IGNORECASE),
            # Simple organization or author name at start
            re.compile(r'^[A-Z][a-zA-Z\s]{10,}\.?\s*\(', re.IGNORECASE)
        ]
        
        for pattern in author_patterns:
            if pattern.match(line_text):
                return True
        
        # If previous line ended with URL or period and this starts with capital, likely new ref
        if previous_line and (previous_line.strip().endswith('/') or previous_line.strip().endswith('.')):
            if line_text[0].isupper() and len(line_text) > 10:
                return True
                
        return False

    previous_line = None
    for line in lines:
        line_stripped = line.strip()
        
        # Skip empty lines but track them for reference separation
        if not line_stripped:
            # Empty line might separate references, save current if exists
            if current_ref:
                references.append(' '.join(current_ref).strip())
                current_ref = []
            previous_line = line
            continue
            
        # Check if this line starts a new reference
        if looks_like_new_reference(line_stripped, previous_line) or is_reference_start(line_stripped):
            # Save previous reference if exists
            if current_ref:
                references.append(' '.join(current_ref).strip())
                current_ref = []
            current_ref.append(line_stripped)
        elif current_ref and (is_continuation_line(line_stripped) or 
                             # If we're already in a reference and line doesn't clearly start a new one
                             not looks_like_new_reference(line_stripped, previous_line)):
            current_ref.append(line_stripped)
        else:
            # Line might start a new reference or be standalone
            if current_ref:
                references.append(' '.join(current_ref).strip())
                current_ref = []
            current_ref.append(line_stripped)
        
        previous_line = line
    
    # Add final reference if exists
    if current_ref:
        references.append(' '.join(current_ref).strip())

    # Filter out very short references and common non-reference patterns
    filtered_references = []
    for ref in references:
        ref_lower = ref.lower().strip()
        
        # Skip common non-reference patterns
        skip_patterns = [
            'no references available',
            'no citations found',
            'references not available',
            'none available',
            'not applicable',
            'n/a',
            'tbd',
            'to be determined',
            'coming soon',
            'under construction'
        ]
        
        # Check if this is a non-reference pattern
        is_non_reference = any(pattern in ref_lower for pattern in skip_patterns)
        
        # Only include references with reasonable length, content, and not matching skip patterns
        if not is_non_reference and len(ref) > 20 and (' ' in ref or ',' in ref):
            # Additional check: must contain some reference-like indicators
            ref_indicators = [
                r'\d{4}',  # Year
                r'[A-Z][a-z]+,',  # Author pattern
                r'\..*\.',  # Multiple periods (title/journal pattern)
                r'http',  # URL
                r'doi',  # DOI
                r'vol\.?|volume',  # Volume
                r'pp?\.?',  # Pages
                r'journal|proceedings|conference',  # Publication types
            ]
            
            has_indicator = any(re.search(pattern, ref, re.IGNORECASE) for pattern in ref_indicators)
            if has_indicator:
                filtered_references.append(ref)
    
    return filtered_references

# ============================================================================
# CITATION STYLE ANALYSIS FUNCTIONS
# ============================================================================

def detect_citation_style(reference: str) -> Tuple[str, float]:
    """
    Detect the citation style of a reference.
    Returns tuple of (style_name, confidence_score)
    """
    reference = reference.strip()
    best_match = ('Unknown', 0.0)
    
    for style_name, style in CITATION_STYLES.items():
        score = 0.0
        max_score = len(style.patterns) + 3
        
        for pattern in style.patterns:
            if re.search(pattern, reference, re.IGNORECASE):
                score += 1
                break
        
        if re.search(style.year_pattern, reference):
            score += 1
        
        if re.search(style.author_pattern, reference):
            score += 1
        
        punct_matches = 0
        for key, punct in style.common_punctuation.items():
            if punct in reference:
                punct_matches += 1
        score += punct_matches / len(style.common_punctuation)
        
        confidence = score / max_score
        if confidence > best_match[1]:
            best_match = (style_name, confidence)
    
    if best_match[1] < 0.3:
        return ('Unknown', 0.0)
    
    return best_match

def extract_reference_components(reference: str, style: str) -> Dict[str, str]:
    """Extract components like authors, title, year, etc. based on citation style."""
    components = {
        'authors': '',
        'year': '',
        'title': '',
        'source': '',
        'volume': '',
        'issue': '',
        'pages': '',
        'doi': '',
        'url': '',
        'publisher': '',
        'location': ''
    }
    
    # Extract DOI if present
    doi_match = re.search(r'(?:doi:?\s*|https?://doi\.org/)(10\.\d{4,}/[^\s,]+)', reference, re.IGNORECASE)
    if doi_match:
        components['doi'] = doi_match.group(1)
    
    # Extract URL if present
    url_match = re.search(r'https?://[^\s<>"\']+', reference)
    if url_match:
        components['url'] = url_match.group(0)
    
    # Style-specific extraction
    if style == 'APA':
        # Extract authors (before year)
        author_match = re.match(r'^([^(]+)\s*\(\d{4}\)', reference)
        if author_match:
            components['authors'] = author_match.group(1).strip()
        
        # Extract year
        year_match = re.search(r'\((\d{4})\)', reference)
        if year_match:
            components['year'] = year_match.group(1)
        
        # Extract title (after year, before next period)
        title_match = re.search(r'\(\d{4}\)\.\s*([^.]+)\.', reference)
        if title_match:
            components['title'] = title_match.group(1).strip()
    
    elif style == 'IEEE':
        # Extract reference number
        num_match = re.match(r'^\[(\d+)\]', reference)
        
        # Extract authors (after number, before comma and quotes)
        author_match = re.search(r'^\[\d+\]\s*([^,"]+),\s*"', reference)
        if author_match:
            components['authors'] = author_match.group(1).strip()
        
        # Extract title (in quotes)
        title_match = re.search(r'"([^"]+)"', reference)
        if title_match:
            components['title'] = title_match.group(1).strip()
        
        # Extract year (usually at the end)
        year_match = re.search(r',\s*(\d{4})(?:\.|,|$)', reference)
        if year_match:
            components['year'] = year_match.group(1)
    
    elif style == 'MLA':
        # Extract authors (before first period)
        author_match = re.match(r'^([^.]+)\.', reference)
        if author_match:
            components['authors'] = author_match.group(1).strip()
        
        # Extract title (in quotes or italics)
        title_match = re.search(r'[.]\s*"([^"]+)"', reference) or re.search(r'[.]\s*([^.]+)\.', reference)
        if title_match:
            components['title'] = title_match.group(1).strip()
        
        # Extract year
        year_match = re.search(r',\s*(\d{4})(?:\.|,|$)', reference)
        if year_match:
            components['year'] = year_match.group(1)
    
    elif style == 'ACM':
        # Extract reference number if present
        num_match = re.match(r'^\[(\d+)\]\s*', reference)
        ref_start = num_match.end() if num_match else 0
        
        # Extract authors and year (ACM format: Authors. Year.)
        author_year_match = re.search(r'^(?:\[\d+\]\s*)?([^.]+\.)?\s*(\d{4})\.', reference)
        if author_year_match:
            components['authors'] = author_year_match.group(1).strip('.').strip() if author_year_match.group(1) else ''
            components['year'] = author_year_match.group(2)
            
            # Extract title (after year, before next major punctuation)
            title_start = author_year_match.end()
            # Look for title ending patterns
            title_patterns = [
                r'([^.]+)\.\s*(?:In\s+Proceedings|In\s+ACM|Commun\.|J\.|Trans\.)',  # Conference/Journal
                r'([^.]+)\.\s*\(',  # Before edition info
                r'([^.]+)\.\s*[A-Z][a-z]+(?:,\s*[A-Z][a-z]+)*(?:\.|,)',  # Before publisher
                r'([^.]+)\.'  # Default: next period
            ]
            
            for pattern in title_patterns:
                title_match = re.search(pattern, reference[title_start:])
                if title_match:
                    components['title'] = title_match.group(1).strip()
                    break
        
        # Extract journal/conference info
        if 'In Proceedings of' in reference:
            proc_match = re.search(r'In Proceedings of ([^(]+)\s*\(([^)]+)\)', reference)
            if proc_match:
                components['source'] = proc_match.group(1).strip()
        else:
            # Look for journal pattern: Journal Name Vol, Issue (Month Year), pages
            journal_match = re.search(r'([A-Z][^,]+)\s+(\d+),\s*(\d+)\s*\(([^)]+)\),\s*([\d\-–]+)', reference)
            if journal_match:
                components['source'] = journal_match.group(1).strip()
                components['volume'] = journal_match.group(2)
                components['issue'] = journal_match.group(3)
                components['pages'] = journal_match.group(5)
        
        # Extract publisher and location
        publisher_match = re.search(r'([A-Z][^,]+),\s+([A-Z][^,.]+(?:,\s*[A-Z]{2})?)(?:\.|$)', reference)
        if publisher_match and 'In Proceedings' not in reference[:publisher_match.start()]:
            components['publisher'] = publisher_match.group(1).strip()
            components['location'] = publisher_match.group(2).strip()
    
    # Extract volume, issue, pages (common patterns)
    if not components['volume']:
        volume_match = re.search(r'(?:vol\.|volume)\s*(\d+)', reference, re.IGNORECASE)
        if volume_match:
            components['volume'] = volume_match.group(1)
    
    if not components['issue']:
        issue_match = re.search(r'(?:no\.|issue)\s*(\d+)|\((\d+)\)', reference, re.IGNORECASE)
        if issue_match:
            components['issue'] = issue_match.group(1) or issue_match.group(2)
    
    if not components['pages']:
        pages_match = re.search(r'(?:pp?\.|pages?)\s*(\d+[-–]\d+)|(\d+[-–]\d+)(?:\.|,|$)', reference)
        if pages_match:
            components['pages'] = pages_match.group(1) or pages_match.group(2)
    
    return components

def validate_citation_format(reference: str, style: str) -> Dict[str, Any]:
    """Validate if a citation follows the rules of a specific style."""
    validation = {
        'is_valid': True,
        'errors': [],
        'warnings': [],
        'suggestions': []
    }
    
    components = extract_reference_components(reference, style)
    style_obj = CITATION_STYLES.get(style)
    
    if not style_obj:
        validation['is_valid'] = False
        validation['errors'].append(f"Unknown citation style: {style}")
        return validation
    
    # Basic validations
    if not components['authors']:
        validation['errors'].append("Missing author information")
        validation['is_valid'] = False
    
    if not components['year']:
        validation['errors'].append("Missing publication year")
        validation['is_valid'] = False
    
    if not components['title']:
        validation['errors'].append("Missing title")
        validation['is_valid'] = False
    
    return validation

# ============================================================================
# URL AND DOI VALIDATION FUNCTIONS
# ============================================================================

def extract_doi_from_url(url):
    """Extract DOI from a URL if present."""
    url = re.sub(r'\s+', '', url)
    
    doi_patterns = [
        r'10\.\d{4,}/[-._;()/:\w]+',
        r'doi\.org/10\.\d{4,}/[-._;()/:\w]+',
        r'dx\.doi\.org/10\.\d{4,}/[-._;()/:\w]+',
        r'doi/(?:abs|pdf|full|book|citation)?/?10\.\d{4,}/[-._;()/:\w]+',
    ]
    
    for pattern in doi_patterns:
        match = re.search(pattern, url, re.IGNORECASE)
        if match:
            doi = match.group(0)
            if '/' in doi and '10.' in doi:
                doi = doi[doi.index('10.'):]
            return doi.strip()
    return None

def clean_url(url):
    """Clean and encode URL properly, handling spaces and special characters."""
    if not url:
        return None

    try:
        url = url.strip()
        
        url_parts = url.split()
        if len(url_parts) > 1:
            cleaned_parts = []
            for part in url_parts:
                part = re.sub(r'["\'\[\]<>{}]', '', part)
                part = part.strip('.,;:()[]{}')
                if part:
                    cleaned_parts.append(part)
            url = ''.join(cleaned_parts)
        
        url = re.sub(r'\s+', '', url)
        url = url.replace('\\', '/')
        
        if 'doi' in url.lower():
            doi = extract_doi_from_url(url)
            if doi:
                clean_doi = doi.strip().replace(' ', '')
                if 'dl.acm.org' in url.lower():
                    return f"https://dl.acm.org/doi/{clean_doi}"
                return f"https://doi.org/{clean_doi}"
        
        url = re.sub(r'[.,;:)\]}]+$', '', url)
        
        if not url.lower().startswith(('http://', 'https://')):
            if url.lower().startswith('www.'):
                url = 'https://' + url
            else:
                return None
        
        parsed = urllib.parse.urlparse(url)
        
        path_parts = parsed.path.split('/')
        cleaned_path_parts = []
        for part in path_parts:
            part = re.sub(r'["\'\[\]<>{}]', '', part)
            if part:
                cleaned_path_parts.append(urllib.parse.quote(part))
        clean_path = '/'.join(cleaned_path_parts)
        
        if parsed.query:
            query_parts = parsed.query.split('&')
            cleaned_query_parts = []
            for part in query_parts:
                if '=' in part:
                    key, value = part.split('=', 1)
                    cleaned_query_parts.append(
                        f"{urllib.parse.quote_plus(key)}={urllib.parse.quote_plus(value)}"
                    )
            clean_query = '&'.join(cleaned_query_parts)
        else:
            clean_query = ''
        
        clean_fragment = urllib.parse.quote(parsed.fragment)
        
        cleaned = parsed._replace(
            path=clean_path,
            query=clean_query,
            fragment=clean_fragment
        )
        
        final_url = urllib.parse.urlunparse(cleaned)
        
        if not re.match(r'https?://.+\..+', final_url):
            return None
        
        return final_url
    except Exception as e:
        print(f"Error cleaning URL: {str(e)}")
        return None

def validate_doi(doi):
    """Validate a DOI by checking if it resolves."""
    if not doi:
        return False
    
    try:
        headers = {
            'Accept': 'application/json',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(
            f'https://doi.org/{doi}',
            headers=headers,
            allow_redirects=True,
            timeout=5
        )
        return response.status_code == 200
    except Exception as e:
        print(f"DOI validation error: {str(e)}")
        return False

def validate_reference(ref):
    """Validate a reference by checking various sources."""
    results = {
        'valid': False,
        'doi_found': False,
        'doi_valid': False,
        'url_found': False,
        'url_valid': False,
        'scholar_search': None,
        'message': '',
        'rate_limited': False,
        'acm_url': False,
        'doi_text': '',
        'url_text': ''
    }
    
    # Extract DOI from reference and validate it
    doi_match = re.search(r'(?:doi:?\s*|https?://doi\.org/)(10\.\d{4,}/[^\s,]+)', ref, re.IGNORECASE)
    if doi_match:
        doi = doi_match.group(1)
        results['doi_found'] = True
        results['doi_text'] = doi
        
        # Check rate limit for DOI validation
        if check_rate_limit('doi.org'):
            try:
                results['doi_valid'] = validate_doi(doi)
                if results['doi_valid']:
                    results['valid'] = True
                    results['message'] = f"✅ DOI validated successfully: {doi}"
                else:
                    results['message'] = f"❌ DOI does not resolve: {doi}"
            except Exception as e:
                results['message'] = f"⚠️ Error validating DOI {doi}: {str(e)}"
        else:
            results['rate_limited'] = True
            results['message'] = f"⏳ Rate limited - cannot validate DOI: {doi}"
    
    # Extract URL from reference (excluding DOI URLs)
    url_match = re.search(r'https?://(?!doi\.org)[^\s<>"\']+', ref)
    if url_match:
        url = url_match.group(0)
        results['url_found'] = True
        results['url_text'] = clean_url(url)
        
        # For now, we don't validate general URLs due to rate limiting concerns
        # But we mark that a URL was found
        results['message'] += f" | 🔗 URL found: {results['url_text']}" if results['message'] else f"🔗 URL found: {results['url_text']}"
    
    # Create Google Scholar search URL for manual verification
    search_query = urllib.parse.quote(ref[:200])
    results['scholar_search'] = f"https://scholar.google.com/scholar?q={search_query}"
    
    # If no DOI or URL validation occurred, provide default message
    if not results['doi_found'] and not results['url_found']:
        results['message'] = f"ℹ️ No DOI or URL found for validation"
    
    return results

# ============================================================================
# UI DISPLAY FUNCTIONS
# ============================================================================

def format_style_confidence(style: str, confidence: float) -> str:
    """Format the style detection result with confidence level."""
    if style == 'Unknown':
        return '<span class="citation-style-badge style-unknown">Unknown Style</span>'
    
    confidence_text = ""
    if confidence >= 0.8:
        confidence_text = "High confidence"
    elif confidence >= 0.5:
        confidence_text = "Medium confidence"
    else:
        confidence_text = "Low confidence"
    
    style_class = f"style-{style.lower()}"
    return f'<span class="citation-style-badge {style_class}">{style} ({confidence:.0%} {confidence_text})</span>'

def display_reference_with_style(ref, validation_result, style_info, ref_type="standard"):
    """Display reference with both validation status and citation style analysis."""
    style, confidence = style_info
    
    st.markdown(f"**Reference:** {ref}")
    st.markdown(format_style_confidence(style, confidence), unsafe_allow_html=True)
    
    # Display DOI and URL validation status
    validation_msgs = []
    if validation_result['doi_found']:
        if validation_result['rate_limited']:
            validation_msgs.append(f"⏳ **DOI:** {validation_result['doi_text']} (Rate limited - cannot validate)")
        elif validation_result['doi_valid']:
            validation_msgs.append(f"✅ **DOI:** {validation_result['doi_text']} (Valid)")
        else:
            validation_msgs.append(f"❌ **DOI:** {validation_result['doi_text']} (Invalid or unreachable)")
    
    if validation_result['url_found']:
        validation_msgs.append(f"🔗 **URL:** {validation_result['url_text']} (Found)")
    
    if validation_msgs:
        st.markdown("**Validation Status:**")
        for msg in validation_msgs:
            st.markdown(f"- {msg}")
    
    # Display validation message if present
    if validation_result['message']:
        if validation_result['doi_valid']:
            st.success(validation_result['message'])
        elif validation_result['doi_found'] and not validation_result['doi_valid'] and not validation_result['rate_limited']:
            st.error(validation_result['message'])
        elif validation_result['rate_limited']:
            st.warning(validation_result['message'])
        else:
            st.info(validation_result['message'])
    
    if style != 'Unknown':
        format_validation = validate_citation_format(ref, style)
        
        if format_validation['errors']:
            st.error("**Format Errors:**")
            for error in format_validation['errors']:
                st.markdown(f"- ❌ {error}")
        
        if format_validation['warnings']:
            st.warning("**Format Warnings:**")
            for warning in format_validation['warnings']:
                st.markdown(f"- ⚠️ {warning}")
        
        components = extract_reference_components(ref, style)
        if any(components.values()):
            with st.expander("📋 Extracted Components"):
                cols = st.columns(2)
                with cols[0]:
                    if components['authors']:
                        st.markdown(f"**Authors:** {components['authors']}")
                    if components['year']:
                        st.markdown(f"**Year:** {components['year']}")
                    if components['title']:
                        st.markdown(f"**Title:** {components['title']}")
                    if components['doi']:
                        doi_status = "✅ Valid" if validation_result.get('doi_valid') else "❓ Not validated"
                        st.markdown(f"**DOI:** {components['doi']} ({doi_status})")
                with cols[1]:
                    if components['source']:
                        st.markdown(f"**Source:** {components['source']}")
                    if components['volume']:
                        st.markdown(f"**Volume:** {components['volume']}")
                    if components['pages']:
                        st.markdown(f"**Pages:** {components['pages']}")
                    if components['url']:
                        st.markdown(f"**URL:** {components['url']}")
    
    if validation_result['scholar_search']:
        st.info(f"[🔍 Verify on Google Scholar]({validation_result['scholar_search']})")

def create_citation_report(references, style_counts):
    """Generate a comprehensive citation analysis report."""
    report = "# Citation Analysis Report\n\n"
    report += f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n"
    report += f"Total References: {len(references)}\n\n"
    
    report += "## Citation Style Summary\n"
    for style, count in sorted(style_counts.items(), key=lambda x: x[1], reverse=True):
        percentage = (count / len(references)) * 100
        report += f"- {style}: {count} ({percentage:.1f}%)\n"
    report += "\n"
    
    report += "## Detailed Reference Analysis\n\n"
    for i, ref in enumerate(references, 1):
        style_info = detect_citation_style(ref)
        validation_result = validate_reference(ref)
        
        report += f"### Reference {i}\n"
        report += f"**Text:** {ref}\n"
        report += f"**Style:** {style_info[0]} (Confidence: {style_info[1]:.1%})\n"
        
        # DOI validation status
        if validation_result['doi_found']:
            if validation_result['rate_limited']:
                report += f"**DOI:** {validation_result['doi_text']} (Rate limited)\n"
            elif validation_result['doi_valid']:
                report += f"**DOI:** {validation_result['doi_text']} (Valid)\n"
            else:
                report += f"**DOI:** {validation_result['doi_text']} (Invalid)\n"
        
        # URL status
        if validation_result['url_found']:
            report += f"**URL:** {validation_result['url_text']} (Found)\n"
        
        report += f"**Overall Valid:** {'Yes' if validation_result['valid'] else 'No'}\n"
        
        if style_info[0] != 'Unknown':
            format_validation = validate_citation_format(ref, style_info[0])
            report += f"**Format Valid:** {'Yes' if format_validation['is_valid'] else 'No'}\n"
            if format_validation['errors']:
                report += f"**Errors:** {', '.join(format_validation['errors'])}\n"
        
        report += "\n"
    
    return report

# ============================================================================
# INTERACTIVE RE-ANALYSIS FUNCTIONS
# ============================================================================

def perform_reanalysis(modified_content):
    """Perform re-analysis on modified content."""
    # Handle None or empty content
    if modified_content is None:
        modified_content = ""
    
    # Split body and references
    body_text, references_text = split_body_and_references(modified_content)
    references = extract_references_multiline(references_text)
    in_text_citations = extract_in_text_citations(body_text)
    citation_sentences = extract_in_text_citation_sentences(body_text)
    
    # Style analysis
    style_counts = defaultdict(int)
    for ref in references:
        style_info = detect_citation_style(ref)
        style_counts[style_info[0]] += 1
    
    return {
        'body_text': body_text,
        'references_text': references_text,
        'references': references,
        'in_text_citations': in_text_citations,
        'citation_sentences': citation_sentences,
        'style_counts': style_counts
    }

# ============================================================================
# MAIN APPLICATION
# ============================================================================

def main():
    """Main application function."""
    
    # Custom CSS
    st.markdown("""
        <style>
            .stTitle {
                color: #2c3e50;
                font-size: 3rem !important;
                padding-bottom: 2rem;
            }
            .stSubheader {
                color: #34495e;
                padding-top: 1rem;
            }
            .citation-style-badge {
                display: inline-block;
                padding: 0.25em 0.6em;
                font-size: 0.875em;
                font-weight: 600;
                line-height: 1;
                text-align: center;
                white-space: nowrap;
                vertical-align: baseline;
                border-radius: 0.25rem;
                margin-left: 0.5em;
            }
            .style-apa { background-color: #007bff; color: white; }
            .style-mla { background-color: #28a745; color: white; }
            .style-chicago { background-color: #dc3545; color: white; }
            .style-ieee { background-color: #ffc107; color: black; }
            .style-acm { background-color: #17a2b8; color: white; }
            .style-unknown { background-color: #6c757d; color: white; }
        </style>
    """, unsafe_allow_html=True)

    # Main title
    st.title("📄 PDF Citation Analyzer")
    
    # Description
    st.markdown("""
        Upload your PDF file and extract its content. The tool will:
        - Extract text from all pages
        - Detect and validate references with citation style analysis
        - Check if citations follow academic formats (APA, MLA, Chicago, IEEE, ACM)
        - Analyze in-text citations and reference consistency
        
        **🆕 Interactive Feature:** After extraction, you can edit the content and click **"Re-analyze Citations"** to handle custom reference formats or fix extraction issues.
    """)

    # Create layout
    col1, col2 = st.columns([1, 2])

    with col1:
        st.markdown("### Upload PDF")
        uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")

    with col2:
        if uploaded_file is not None:
            try:
                # Reset session state when a new file is uploaded
                if 'uploaded_file_name' not in st.session_state or st.session_state.uploaded_file_name != uploaded_file.name:
                    st.session_state.uploaded_file_name = uploaded_file.name
                    st.session_state.current_content = None
                    st.session_state.analysis_results = None
                
                pdf_reader = PyPDF2.PdfReader(uploaded_file)
                
                text_content = ""
                with st.spinner("Extracting text from PDF..."):
                    for page_num, page in enumerate(pdf_reader.pages, 1):
                        try:
                            # Add page marker
                            text_content += f"\n--- Page {page_num} ---\n"
                            
                            # Extract text with error handling
                            try:
                                page_text = page.extract_text()
                            except UnicodeEncodeError as e:
                                # Handle encoding errors by using a different encoding
                                page_text = page.extract_text().encode('utf-8', errors='replace').decode('utf-8')
                            except Exception as e:
                                st.warning(f"Warning: Error extracting text from page {page_num}: {str(e)}")
                                page_text = f"[Error extracting text from page {page_num}]"
                            
                            # Clean the text to remove invalid characters
                            page_text = ''.join(char for char in page_text if ord(char) < 0x10000)
                            text_content += page_text
                            
                        except Exception as e:
                            st.warning(f"Warning: Error processing page {page_num}: {str(e)}")
                            text_content += f"\n[Error processing page {page_num}]\n"
                
                if not text_content.strip():
                    st.error("No text could be extracted from the PDF. The file might be scanned or contain only images.")
                    return
                
                st.success(f"Successfully extracted text from {len(pdf_reader.pages)} pages!")
                
                # Create tabs
                tab1, tab2, tab3, tab4 = st.tabs(["📝 Content", "🔎 In-Text Citations", "📚 References", "📊 Citation Analysis"])
                
                # Initialize session state for content and analysis results
                if 'current_content' not in st.session_state or st.session_state.current_content is None:
                    st.session_state.current_content = text_content
                if 'analysis_results' not in st.session_state:
                    st.session_state.analysis_results = None
                
                # Ensure current_content is not None before analysis
                if st.session_state.current_content is None:
                    st.session_state.current_content = text_content
                
                # Perform initial analysis or use cached results
                if st.session_state.analysis_results is None:
                    try:
                        st.session_state.analysis_results = perform_reanalysis(st.session_state.current_content)
                    except Exception as e:
                        st.error(f"Error during initial analysis: {str(e)}")
                        # Create empty results as fallback
                        st.session_state.analysis_results = {
                            'body_text': st.session_state.current_content or "",
                            'references_text': "",
                            'references': [],
                            'in_text_citations': {'numbered': [], 'author_year': []},
                            'citation_sentences': [],
                            'style_counts': defaultdict(int)
                        }
                
                with tab1:
                    st.subheader("Extracted Content")
                    
                    # Instructions for interactive feature
                    with st.expander("💡 How to use Interactive Re-analysis"):
                        st.markdown("""
                        **Step 1:** Review the extracted content below
                        
                        **Step 2:** Edit the content if needed:
                        - Fix OCR errors or formatting issues
                        - Manually separate references section if not detected
                        - Add missing reference headers (e.g., "References", "Bibliography")
                        - Fix malformed reference entries
                        
                        **Step 3:** Click **"Re-analyze Citations"** to update all analysis tabs
                        
                        **Common fixes:**
                        - Add `--- Page X ---` markers to separate sections
                        - Add `References` header before reference list
                        - Fix concatenated headers like `19REFERENCES` → `19 REFERENCES`
                        - Separate merged references into individual lines
                        """)
                    
                    # Editable text area for content modification
                    content_value = st.session_state.current_content if st.session_state.current_content is not None else ""
                    
                    # Use a dynamic key to force refresh when needed
                    if 'text_area_key' not in st.session_state:
                        st.session_state.text_area_key = 0
                    
                    modified_content = st.text_area(
                        "PDF Content (You can edit this content and re-analyze)",
                        value=content_value,
                        height=400,
                        key=f"content_editor_{st.session_state.text_area_key}",
                        help="Edit the extracted content to fix issues, then click 'Re-analyze Citations'"
                    )
                    
                    # Debug info
                    if modified_content != content_value:
                        st.info(f"🔄 Content changed! Modified length: {len(modified_content)}, Original length: {len(content_value)}")
                        st.info("Click 'Re-analyze Citations' to update the analysis.")
                    
                    # Additional debug information
                    st.caption(f"🔍 Debug: Text area content length: {len(modified_content)} | Session content length: {len(st.session_state.current_content) if st.session_state.current_content else 0}")
                    
                    # Re-analyze button
                    col_btn1, col_btn2, col_btn3 = st.columns([1, 1, 2])
                    
                    with col_btn1:
                        if st.button("🔄 Re-analyze Citations", type="primary"):
                            # Always use the current content from the text area, not session state comparison
                            st.session_state.current_content = modified_content
                            try:
                                with st.spinner("Re-analyzing citations..."):
                                    new_results = perform_reanalysis(modified_content)
                                    st.session_state.analysis_results = new_results
                                
                                st.success("✅ Re-analysis completed!")
                                st.info(f"Found {len(new_results['references'])} references, {len(new_results['citation_sentences'])} citation sentences")
                                st.rerun()
                            except Exception as e:
                                st.error(f"Error during re-analysis: {str(e)}")
                    
                    with col_btn2:
                        if st.button("↩️ Reset to Original"):
                            st.session_state.current_content = text_content
                            st.session_state.text_area_key += 1  # Force text area refresh
                            try:
                                st.session_state.analysis_results = perform_reanalysis(text_content)
                                st.success("✅ Reset to original content!")
                                st.rerun()
                            except Exception as e:
                                st.error(f"Error during reset: {str(e)}")
                    
                    with col_btn3:
                        if st.button("🔄 Force Refresh"):
                            st.session_state.text_area_key += 1  # Force text area refresh
                            st.success("✅ Text area refreshed!")
                            st.rerun()
                    
                    # Download button and statistics
                    if st.session_state.current_content:
                        st.download_button(
                            label="Download Current Text",
                            data=st.session_state.current_content,
                            file_name=f"{Path(uploaded_file.name).stem}_extracted.txt",
                            mime="text/plain"
                        )
                        
                        # Show statistics
                        original_length = len(text_content) if text_content else 0
                        current_content = st.session_state.current_content or ""
                        current_length = len(current_content)
                        length_diff = current_length - original_length
                        
                        st.info(f"""
                            📊 **Content Statistics**
                            - File name: {uploaded_file.name}
                            - Number of pages: {len(pdf_reader.pages)}
                            - Original text length: {original_length:,} characters
                            - Current text length: {current_length:,} characters
                            - Difference: {length_diff:+,} characters
                        """)
                
                # Get current analysis results
                results = st.session_state.analysis_results
                
                with tab2:
                    st.subheader("In-Text Citations (with Sentences)")
                    citation_sentences = results['citation_sentences']
                    
                    # Debug info
                    current_content_length = len(st.session_state.current_content) if st.session_state.current_content else 0
                    st.caption(f"🔍 Debug: Content length: {current_content_length} chars | Analysis timestamp: {id(results)}")
                    
                    if citation_sentences:
                        st.info(f"Found {len(citation_sentences)} sentences with citations")
                        for item in citation_sentences:
                            st.markdown(f"**Citations:** {', '.join(item['citations'])}")
                            st.write(item["sentence"])
                            st.markdown("---")
                    else:
                        st.info("No in-text citations found in the document.")
                
                with tab3:
                    st.subheader("References Section")
                    references_text = results['references_text']
                    references = results['references']
                    
                    # Enhanced debug info
                    current_content_length = len(st.session_state.current_content) if st.session_state.current_content else 0
                    st.caption(f"🔍 Debug: Content length: {current_content_length} chars | References text: {len(references_text)} chars | Found {len(references)} references | Analysis ID: {id(results)} | Text area key: {st.session_state.get('text_area_key', 'N/A')}")
                    
                    # Show first 200 chars of current content for verification
                    if st.session_state.current_content:
                        st.caption(f"📄 Current content preview: {repr(st.session_state.current_content[:200])}...")
                    
                    with st.expander("🔍 Raw References Text"):
                        st.text(references_text)
                    
                    st.write("**Extracted References:**")
                    if references:
                        st.write(f"Found {len(references)} references:")
                        for i, ref in enumerate(references, 1):
                            st.markdown(f"**[{i}]** {ref}")
                    else:
                        st.info("No references found in the document.")
                
                with tab4:
                    st.subheader("Citation Analysis")
                    references = results['references']
                    style_counts = results['style_counts']
                    
                    if references:
                        for i, ref in enumerate(references, 1):
                            st.markdown(f"### Reference {i}")
                            style_info = detect_citation_style(ref)
                            validation_result = validate_reference(ref)
                            display_reference_with_style(ref, validation_result, style_info)
                            st.markdown("---")
                        
                        with st.expander("📊 Citation Style Summary"):
                            st.markdown("### Detected Citation Styles")
                            for style, count in sorted(style_counts.items(), key=lambda x: x[1], reverse=True):
                                percentage = (count / len(references)) * 100
                                st.markdown(f"- **{style}**: {count} references ({percentage:.1f}%)")
                            if len(style_counts) > 1 and style_counts['Unknown'] < len(references) * 0.5:
                                st.warning("⚠️ Multiple citation styles detected. Consider using a consistent style throughout the document.")
                        
                        # Generate and offer download report
                        citation_report = create_citation_report(references, style_counts)
                        st.download_button(
                            label="Download Citation Analysis Report",
                            data=citation_report,
                            file_name=f"{Path(uploaded_file.name).stem}_citation_analysis.txt",
                            mime="text/plain"
                        )
                    else:
                        st.info("No references found for analysis.")
                
            except Exception as e:
                st.error(f"Error processing PDF: {str(e)}")
                st.warning("Please make sure you've uploaded a valid PDF file.")
        else:
            st.markdown("""
                ### Preview Area
                Upload a PDF file to see:
                - Extracted content with page numbers
                - In-text citations with context sentences
                - References with citation style detection
                - Comprehensive citation analysis
                - Downloadable analysis report
            """)

    # Footer
    st.markdown("---")
    st.markdown("""
        <div style='text-align: center; color: #666;'>
            Made with ❤️ using Streamlit, PyPDF2, and Citation Style Analysis
        </div>
    """, unsafe_allow_html=True)

if __name__ == "__main__":
    main()

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science