documentassistant

Allows for easy searching of a document

https://github.com/ggrow3/documentassistant

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Allows for easy searching of a document

Basic Info
  • Host: GitHub
  • Owner: ggrow3
  • Language: Python
  • Default Branch: main
  • Size: 0 Bytes
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme Citation

README.md

Legal Document AI Assistant

A Streamlit application that helps law firms manage, process, and query their case documents using AI.

Features

  • Document Management: Upload, organize, and tag case documents
  • Document Processing: Extract text from PDFs, Word documents, and images
  • AI-Powered Search: Ask questions in natural language about your case documents
  • Citation Tracking: See the exact sources for information provided by the AI
  • Tag-Based Organization: Add and filter documents using tags
  • OCR Support: Extract text from images and scanned PDFs
  • Multiple Vector Database Options: Choose between Chroma (in-memory) or Pinecone (cloud-based)

Project Structure

  • app.py: Main application entry point
  • sidebar.py: Sidebar interface with document management and settings
  • chat_interface.py: Chat interface and processing
  • document_context.py: Document context panel display
  • about_page.py: About page content
  • system_check.py: System dependency checking
  • document_processing.py: Document extraction and processing
  • vector_store.py: Vector database and embedding functionality
  • citation_handler.py: Citation tracking functionality
  • ui_components.py: UI components and styling
  • utils.py: Utility functions
  • pinecone_setup.py: Helper script for Pinecone setup
  • requirements.txt: Dependencies

Installation

  1. Clone this repository
  2. Install the required dependencies:

bash pip install -r requirements.txt

  1. For OCR functionality (optional but recommended):
    • Install Tesseract OCR on your system (Windows | macOS brew install tesseract | Linux apt-get install tesseract-ocr)
    • For enhanced PDF OCR: bash pip install PyMuPDF or bash pip install pdf2image # On Ubuntu/Debian: sudo apt-get install poppler-utils # On macOS: brew install poppler

Usage

  1. Run the Streamlit app:

bash streamlit run app.py

  1. Open your browser and navigate to the provided URL (typically http://localhost:8501)

  2. Enter your OpenAI API key in the Settings tab (sidebar)

  3. Upload and process documents in the Document Management tab (sidebar)

  4. Use the chat interface to ask questions about your documents

Vector Database Options

Chroma (Default)

  • In-memory vector database that works well for local usage
  • Fast and easy to set up
  • Data is not persisted between sessions

Pinecone

  • Cloud-based vector database for persistent storage
  • Requires a Pinecone account and API key
  • Good for larger document collections and persistence between sessions

To set up Pinecone: 1. Sign up for a Pinecone account 2. Create an API key from the Pinecone console 3. Run the setup script to create an index (optional): bash python pinecone_setup.py --api-key YOUR_API_KEY --environment YOUR_ENVIRONMENT 4. In the application settings, select "Pinecone" as the vector store type 5. Enter your API key and index information

Document Types Supported

  • PDF files (.pdf)
  • Word documents (.docx, .doc)
  • Text files (.txt)
  • Images with text (.jpg, .jpeg, .png)

Advanced Features

Tag-Based Search

You can search for documents with specific tags using: - tag:important in your queries - #important in your queries

OCR Settings

OCR settings can be configured in the sidebar for better text extraction from images and scanned PDFs.

Privacy & Security

  • All documents are processed locally
  • Your OpenAI API key is stored only in your current session
  • If using Pinecone, document embeddings are stored in your Pinecone account
  • No document data is sent to external servers beyond what's needed for AI processing with OpenAI

Requirements

  • Python 3.8+
  • OpenAI API key
  • Pinecone API key (optional)
  • Tesseract OCR (for image text extraction)

License

This project is open source and available under the MIT License.

Owner

  • Name: Kevin MacNeel
  • Login: ggrow3
  • Kind: user
  • Company: Health Plus

Citation (citation_handler.py)

from langchain_core.callbacks import BaseCallbackHandler

class CitationTrackingHandler(BaseCallbackHandler):
    """
    Callback handler for tracking citations in LangChain.
    
    This handler captures documents returned by the retriever and converts
    them to a format suitable for displaying citations in the UI.
    """
    def __init__(self):
        """Initialize the citation handler"""
        self.citations = []
        self.citation_sources = set()  # Track unique citation sources to prevent duplicates
        
    def on_chain_start(self, serialized, inputs, **kwargs):
        """Called when the chain starts running."""
        import streamlit as st
        if st.session_state.get("debug_mode", False):
            st.write("Chain started")
        
    def on_chain_end(self, outputs, **kwargs):
        """Called when the chain finishes running."""
        import streamlit as st
        
        # Check if we have source documents in the output
        if 'source_documents' in outputs and outputs['source_documents']:
            if st.session_state.get("debug_mode", False):
                st.write(f"Chain finished with {len(outputs['source_documents'])} source documents")
            
            # Process the source documents
            self.on_retriever_end(outputs['source_documents'])
        else:
            if st.session_state.get("debug_mode", False):
                st.write("Chain finished but no source documents were found in outputs")
    
    def on_retriever_start(self, query, **kwargs):
        """Called when the retriever starts retrieving documents."""
        import streamlit as st
        if st.session_state.get("debug_mode", False):
            st.write(f"Retriever started with query: {query}")
        # Clear existing citations when starting a new retrieval
        self.citations = []
        self.citation_sources = set()
        
    def on_retriever_end(self, documents, **kwargs):
        """
        Callback that runs after the retriever finishes retrieving documents.
        
        Args:
            documents: The documents returned by the retriever
            **kwargs: Additional keyword arguments
        """
        import streamlit as st
        
        self.citations = []
        self.citation_sources = set()
        
        if not documents:
            if st.session_state.get("debug_mode", False):
                st.warning("No documents were retrieved")
            return
        
        if st.session_state.get("debug_mode", False):
            st.write(f"Retrieved {len(documents)} documents")
        
        for doc in documents:
            try:
                # Make sure we're dealing with a Document object
                if hasattr(doc, 'page_content') and hasattr(doc, 'metadata'):
                    # Get content and metadata
                    content = doc.page_content
                    metadata = doc.metadata
                    
                    # Debug the metadata
                    if st.session_state.get("debug_mode", False):
                        st.write(f"Document metadata: {metadata}")
                    
                    # Create a unique key for this citation to check for duplicates
                    doc_id = metadata.get("doc_id", "Unknown")
                    source = metadata.get("source", "Unknown")
                    page = metadata.get("page", 0)
                    chunk = metadata.get("chunk", 0)
                    citation_key = f"{doc_id}_{page}_{chunk}_{source}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        if st.session_state.get("debug_mode", False):
                            st.write(f"Skipping duplicate citation: {citation_key}")
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    # Convert tags_str back to a list if it exists
                    tags = []
                    tags_str = metadata.get("tags_str", "")
                    if tags_str and isinstance(tags_str, str):
                        tags = tags_str.split(",")
                    
                    self.citations.append({
                        "text": content,
                        "source": source,
                        "doc_id": doc_id,
                        "doc_type": metadata.get("doc_type", "Unknown"),
                        "case_id": metadata.get("case_id", "Unknown"),
                        "tags": tags,
                        "page": page,
                        "chunk": chunk
                    })
                    
                    if st.session_state.get("debug_mode", False):
                        st.write(f"Successfully processed citation from {metadata.get('source', 'Unknown')}")
                elif isinstance(doc, dict):
                    # It's a dictionary format
                    metadata = doc.get("metadata", {})
                    
                    # Create a unique key for this citation to check for duplicates
                    doc_id = metadata.get("doc_id", "Unknown")
                    source = metadata.get("source", "Unknown")
                    page = metadata.get("page", 0)
                    chunk = metadata.get("chunk", 0)
                    citation_key = f"{doc_id}_{page}_{chunk}_{source}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        if st.session_state.get("debug_mode", False):
                            st.write(f"Skipping duplicate citation: {citation_key}")
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    # Convert tags_str back to a list if it exists
                    tags = []
                    tags_str = metadata.get("tags_str", "")
                    if tags_str and isinstance(tags_str, str):
                        tags = tags_str.split(",")
                    
                    self.citations.append({
                        "text": doc.get("page_content", ""),
                        "source": source,
                        "doc_id": doc_id,
                        "doc_type": metadata.get("doc_type", "Unknown"),
                        "case_id": metadata.get("case_id", "Unknown"),
                        "tags": tags,
                        "page": page,
                        "chunk": chunk
                    })
                elif isinstance(doc, str):
                    # It's just a string - not much we can do about deduplication here
                    # but we can still track it for future duplicates
                    citation_key = f"string_{hash(doc)}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    self.citations.append({
                        "text": doc,
                        "source": "Unknown",
                        "doc_id": "Unknown",
                        "doc_type": "Unknown",
                        "case_id": "Unknown",
                        "tags": [],
                        "page": 0,
                        "chunk": 0
                    })
                else:
                    # Unknown format - try to get some info
                    # Generate a unique key based on the string representation
                    citation_key = f"unknown_{hash(str(doc))}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    self.citations.append({
                        "text": str(doc),
                        "source": "Unknown",
                        "doc_id": "Unknown",
                        "doc_type": "Unknown",
                        "case_id": "Unknown",
                        "tags": [],
                        "page": 0,
                        "chunk": 0
                    })
            except Exception as e:
                # If anything goes wrong, add an error citation
                if st.session_state.get("debug_mode", False):
                    st.error(f"Error processing citation: {str(e)}")
                
                # For errors, we still want to show unique errors
                citation_key = f"error_{str(e)}"
                
                # Skip if we've already seen this exact error
                if citation_key in self.citation_sources:
                    continue
                
                # Add this citation key to our set of seen citations
                self.citation_sources.add(citation_key)
                
                self.citations.append({
                    "text": f"[Error processing document: {str(e)}]",
                    "source": "Error",
                    "doc_id": "Error",
                    "doc_type": "Error",
                    "case_id": "Error",
                    "tags": [],
                    "page": 0,
                    "chunk": 0
                })
        
        if st.session_state.get("debug_mode", False):
            st.write(f"Total citations processed: {len(self.citations)}")

GitHub Events

Total
  • Push event: 13
  • Create event: 2
Last Year
  • Push event: 13
  • Create event: 2

Dependencies

requirements.txt pypi
  • Pillow >=9.5.0
  • PyMuPDF >=1.22.5
  • PyPDF2 >=3.0.0
  • chromadb >=0.4.18
  • langchain >=0.1.0
  • langchain-community >=0.0.13
  • langchain-core >=0.1.10
  • langchain-openai >=0.0.2
  • langchain-text-splitters >=0.0.1
  • openai >=1.1.1
  • pdf2image >=1.16.3
  • pinecone-client >=2.2.1
  • pypdf *
  • pytesseract >=0.3.10
  • python-docx >=0.8.11
  • regex >=2023.6.3
  • streamlit >=1.22.0