documentassistant

Allows for easy searching of a document

https://github.com/ggrow3/documentassistant

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Allows for easy searching of a document

Basic Info

Host: GitHub
Owner: ggrow3
Language: Python
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme Citation

Legal Document AI Assistant

A Streamlit application that helps law firms manage, process, and query their case documents using AI.

Features

Document Management: Upload, organize, and tag case documents
Document Processing: Extract text from PDFs, Word documents, and images
AI-Powered Search: Ask questions in natural language about your case documents
Citation Tracking: See the exact sources for information provided by the AI
Tag-Based Organization: Add and filter documents using tags
OCR Support: Extract text from images and scanned PDFs
Multiple Vector Database Options: Choose between Chroma (in-memory) or Pinecone (cloud-based)

Project Structure

app.py: Main application entry point
sidebar.py: Sidebar interface with document management and settings
chat_interface.py: Chat interface and processing
document_context.py: Document context panel display
about_page.py: About page content
system_check.py: System dependency checking
document_processing.py: Document extraction and processing
vector_store.py: Vector database and embedding functionality
citation_handler.py: Citation tracking functionality
ui_components.py: UI components and styling
utils.py: Utility functions
pinecone_setup.py: Helper script for Pinecone setup
requirements.txt: Dependencies

Installation

Clone this repository
Install the required dependencies:

bash pip install -r requirements.txt

For OCR functionality (optional but recommended):
- Install Tesseract OCR on your system (Windows | macOS brew install tesseract | Linux apt-get install tesseract-ocr)
- For enhanced PDF OCR: bash pip install PyMuPDF or bash pip install pdf2image # On Ubuntu/Debian: sudo apt-get install poppler-utils # On macOS: brew install poppler

Usage

Run the Streamlit app:

bash streamlit run app.py

Open your browser and navigate to the provided URL (typically http://localhost:8501)
Enter your OpenAI API key in the Settings tab (sidebar)
Upload and process documents in the Document Management tab (sidebar)
Use the chat interface to ask questions about your documents

Vector Database Options

Chroma (Default)

In-memory vector database that works well for local usage
Fast and easy to set up
Data is not persisted between sessions

Pinecone

Cloud-based vector database for persistent storage
Requires a Pinecone account and API key
Good for larger document collections and persistence between sessions

To set up Pinecone: 1. Sign up for a Pinecone account 2. Create an API key from the Pinecone console 3. Run the setup script to create an index (optional): bash python pinecone_setup.py --api-key YOUR_API_KEY --environment YOUR_ENVIRONMENT 4. In the application settings, select "Pinecone" as the vector store type 5. Enter your API key and index information

Document Types Supported

PDF files (.pdf)
Word documents (.docx, .doc)
Text files (.txt)
Images with text (.jpg, .jpeg, .png)

Advanced Features

Tag-Based Search

You can search for documents with specific tags using: - tag:important in your queries - #important in your queries

OCR Settings

OCR settings can be configured in the sidebar for better text extraction from images and scanned PDFs.

Privacy & Security

All documents are processed locally
Your OpenAI API key is stored only in your current session
If using Pinecone, document embeddings are stored in your Pinecone account
No document data is sent to external servers beyond what's needed for AI processing with OpenAI

Requirements

Python 3.8+
OpenAI API key
Pinecone API key (optional)
Tesseract OCR (for image text extraction)

License

This project is open source and available under the MIT License.

Owner

Name: Kevin MacNeel
Login: ggrow3
Kind: user
Company: Health Plus

Website: http://healthplus.azurewebsites.net/
Repositories: 47
Profile: https://github.com/ggrow3

Citation (citation_handler.py)

from langchain_core.callbacks import BaseCallbackHandler

class CitationTrackingHandler(BaseCallbackHandler):
    """
    Callback handler for tracking citations in LangChain.
    
    This handler captures documents returned by the retriever and converts
    them to a format suitable for displaying citations in the UI.
    """
    def __init__(self):
        """Initialize the citation handler"""
        self.citations = []
        self.citation_sources = set()  # Track unique citation sources to prevent duplicates
        
    def on_chain_start(self, serialized, inputs, **kwargs):
        """Called when the chain starts running."""
        import streamlit as st
        if st.session_state.get("debug_mode", False):
            st.write("Chain started")
        
    def on_chain_end(self, outputs, **kwargs):
        """Called when the chain finishes running."""
        import streamlit as st
        
        # Check if we have source documents in the output
        if 'source_documents' in outputs and outputs['source_documents']:
            if st.session_state.get("debug_mode", False):
                st.write(f"Chain finished with {len(outputs['source_documents'])} source documents")
            
            # Process the source documents
            self.on_retriever_end(outputs['source_documents'])
        else:
            if st.session_state.get("debug_mode", False):
                st.write("Chain finished but no source documents were found in outputs")
    
    def on_retriever_start(self, query, **kwargs):
        """Called when the retriever starts retrieving documents."""
        import streamlit as st
        if st.session_state.get("debug_mode", False):
            st.write(f"Retriever started with query: {query}")
        # Clear existing citations when starting a new retrieval
        self.citations = []
        self.citation_sources = set()
        
    def on_retriever_end(self, documents, **kwargs):
        """
        Callback that runs after the retriever finishes retrieving documents.
        
        Args:
            documents: The documents returned by the retriever
            **kwargs: Additional keyword arguments
        """
        import streamlit as st
        
        self.citations = []
        self.citation_sources = set()
        
        if not documents:
            if st.session_state.get("debug_mode", False):
                st.warning("No documents were retrieved")
            return
        
        if st.session_state.get("debug_mode", False):
            st.write(f"Retrieved {len(documents)} documents")
        
        for doc in documents:
            try:
                # Make sure we're dealing with a Document object
                if hasattr(doc, 'page_content') and hasattr(doc, 'metadata'):
                    # Get content and metadata
                    content = doc.page_content
                    metadata = doc.metadata
                    
                    # Debug the metadata
                    if st.session_state.get("debug_mode", False):
                        st.write(f"Document metadata: {metadata}")
                    
                    # Create a unique key for this citation to check for duplicates
                    doc_id = metadata.get("doc_id", "Unknown")
                    source = metadata.get("source", "Unknown")
                    page = metadata.get("page", 0)
                    chunk = metadata.get("chunk", 0)
                    citation_key = f"{doc_id}_{page}_{chunk}_{source}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        if st.session_state.get("debug_mode", False):
                            st.write(f"Skipping duplicate citation: {citation_key}")
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    # Convert tags_str back to a list if it exists
                    tags = []
                    tags_str = metadata.get("tags_str", "")
                    if tags_str and isinstance(tags_str, str):
                        tags = tags_str.split(",")
                    
                    self.citations.append({
                        "text": content,
                        "source": source,
                        "doc_id": doc_id,
                        "doc_type": metadata.get("doc_type", "Unknown"),
                        "case_id": metadata.get("case_id", "Unknown"),
                        "tags": tags,
                        "page": page,
                        "chunk": chunk
                    })
                    
                    if st.session_state.get("debug_mode", False):
                        st.write(f"Successfully processed citation from {metadata.get('source', 'Unknown')}")
                elif isinstance(doc, dict):
                    # It's a dictionary format
                    metadata = doc.get("metadata", {})
                    
                    # Create a unique key for this citation to check for duplicates
                    doc_id = metadata.get("doc_id", "Unknown")
                    source = metadata.get("source", "Unknown")
                    page = metadata.get("page", 0)
                    chunk = metadata.get("chunk", 0)
                    citation_key = f"{doc_id}_{page}_{chunk}_{source}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        if st.session_state.get("debug_mode", False):
                            st.write(f"Skipping duplicate citation: {citation_key}")
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    # Convert tags_str back to a list if it exists
                    tags = []
                    tags_str = metadata.get("tags_str", "")
                    if tags_str and isinstance(tags_str, str):
                        tags = tags_str.split(",")
                    
                    self.citations.append({
                        "text": doc.get("page_content", ""),
                        "source": source,
                        "doc_id": doc_id,
                        "doc_type": metadata.get("doc_type", "Unknown"),
                        "case_id": metadata.get("case_id", "Unknown"),
                        "tags": tags,
                        "page": page,
                        "chunk": chunk
                    })
                elif isinstance(doc, str):
                    # It's just a string - not much we can do about deduplication here
                    # but we can still track it for future duplicates
                    citation_key = f"string_{hash(doc)}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    self.citations.append({
                        "text": doc,
                        "source": "Unknown",
                        "doc_id": "Unknown",
                        "doc_type": "Unknown",
                        "case_id": "Unknown",
                        "tags": [],
                        "page": 0,
                        "chunk": 0
                    })
                else:
                    # Unknown format - try to get some info
                    # Generate a unique key based on the string representation
                    citation_key = f"unknown_{hash(str(doc))}"
                    
                    # Skip if we've already seen this exact citation
                    if citation_key in self.citation_sources:
                        continue
                    
                    # Add this citation key to our set of seen citations
                    self.citation_sources.add(citation_key)
                    
                    self.citations.append({
                        "text": str(doc),
                        "source": "Unknown",
                        "doc_id": "Unknown",
                        "doc_type": "Unknown",
                        "case_id": "Unknown",
                        "tags": [],
                        "page": 0,
                        "chunk": 0
                    })
            except Exception as e:
                # If anything goes wrong, add an error citation
                if st.session_state.get("debug_mode", False):
                    st.error(f"Error processing citation: {str(e)}")
                
                # For errors, we still want to show unique errors
                citation_key = f"error_{str(e)}"
                
                # Skip if we've already seen this exact error
                if citation_key in self.citation_sources:
                    continue
                
                # Add this citation key to our set of seen citations
                self.citation_sources.add(citation_key)
                
                self.citations.append({
                    "text": f"[Error processing document: {str(e)}]",
                    "source": "Error",
                    "doc_id": "Error",
                    "doc_type": "Error",
                    "case_id": "Error",
                    "tags": [],
                    "page": 0,
                    "chunk": 0
                })
        
        if st.session_state.get("debug_mode", False):
            st.write(f"Total citations processed: {len(self.citations)}")

GitHub Events

Total

Push event: 13
Create event: 2

Last Year

Push event: 13
Create event: 2

Dependencies

requirements.txt pypi

Pillow >=9.5.0
PyMuPDF >=1.22.5
PyPDF2 >=3.0.0
chromadb >=0.4.18
langchain >=0.1.0
langchain-community >=0.0.13
langchain-core >=0.1.10
langchain-openai >=0.0.2
langchain-text-splitters >=0.0.1
openai >=1.1.1
pdf2image >=1.16.3
pinecone-client >=2.2.1
pypdf *
pytesseract >=0.3.10
python-docx >=0.8.11
regex >=2023.6.3
streamlit >=1.22.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science