documentassistant
Allows for easy searching of a document
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary
Repository
Allows for easy searching of a document
Basic Info
- Host: GitHub
- Owner: ggrow3
- Language: Python
- Default Branch: main
- Size: 0 Bytes
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Legal Document AI Assistant
A Streamlit application that helps law firms manage, process, and query their case documents using AI.
Features
- Document Management: Upload, organize, and tag case documents
- Document Processing: Extract text from PDFs, Word documents, and images
- AI-Powered Search: Ask questions in natural language about your case documents
- Citation Tracking: See the exact sources for information provided by the AI
- Tag-Based Organization: Add and filter documents using tags
- OCR Support: Extract text from images and scanned PDFs
- Multiple Vector Database Options: Choose between Chroma (in-memory) or Pinecone (cloud-based)
Project Structure
- app.py: Main application entry point
- sidebar.py: Sidebar interface with document management and settings
- chat_interface.py: Chat interface and processing
- document_context.py: Document context panel display
- about_page.py: About page content
- system_check.py: System dependency checking
- document_processing.py: Document extraction and processing
- vector_store.py: Vector database and embedding functionality
- citation_handler.py: Citation tracking functionality
- ui_components.py: UI components and styling
- utils.py: Utility functions
- pinecone_setup.py: Helper script for Pinecone setup
- requirements.txt: Dependencies
Installation
- Clone this repository
- Install the required dependencies:
bash
pip install -r requirements.txt
- For OCR functionality (optional but recommended):
Usage
- Run the Streamlit app:
bash
streamlit run app.py
Open your browser and navigate to the provided URL (typically http://localhost:8501)
Enter your OpenAI API key in the Settings tab (sidebar)
Upload and process documents in the Document Management tab (sidebar)
Use the chat interface to ask questions about your documents
Vector Database Options
Chroma (Default)
- In-memory vector database that works well for local usage
- Fast and easy to set up
- Data is not persisted between sessions
Pinecone
- Cloud-based vector database for persistent storage
- Requires a Pinecone account and API key
- Good for larger document collections and persistence between sessions
To set up Pinecone:
1. Sign up for a Pinecone account
2. Create an API key from the Pinecone console
3. Run the setup script to create an index (optional):
bash
python pinecone_setup.py --api-key YOUR_API_KEY --environment YOUR_ENVIRONMENT
4. In the application settings, select "Pinecone" as the vector store type
5. Enter your API key and index information
Document Types Supported
- PDF files (.pdf)
- Word documents (.docx, .doc)
- Text files (.txt)
- Images with text (.jpg, .jpeg, .png)
Advanced Features
Tag-Based Search
You can search for documents with specific tags using:
- tag:important in your queries
- #important in your queries
OCR Settings
OCR settings can be configured in the sidebar for better text extraction from images and scanned PDFs.
Privacy & Security
- All documents are processed locally
- Your OpenAI API key is stored only in your current session
- If using Pinecone, document embeddings are stored in your Pinecone account
- No document data is sent to external servers beyond what's needed for AI processing with OpenAI
Requirements
- Python 3.8+
- OpenAI API key
- Pinecone API key (optional)
- Tesseract OCR (for image text extraction)
License
This project is open source and available under the MIT License.
Owner
- Name: Kevin MacNeel
- Login: ggrow3
- Kind: user
- Company: Health Plus
- Website: http://healthplus.azurewebsites.net/
- Repositories: 47
- Profile: https://github.com/ggrow3
Citation (citation_handler.py)
from langchain_core.callbacks import BaseCallbackHandler
class CitationTrackingHandler(BaseCallbackHandler):
"""
Callback handler for tracking citations in LangChain.
This handler captures documents returned by the retriever and converts
them to a format suitable for displaying citations in the UI.
"""
def __init__(self):
"""Initialize the citation handler"""
self.citations = []
self.citation_sources = set() # Track unique citation sources to prevent duplicates
def on_chain_start(self, serialized, inputs, **kwargs):
"""Called when the chain starts running."""
import streamlit as st
if st.session_state.get("debug_mode", False):
st.write("Chain started")
def on_chain_end(self, outputs, **kwargs):
"""Called when the chain finishes running."""
import streamlit as st
# Check if we have source documents in the output
if 'source_documents' in outputs and outputs['source_documents']:
if st.session_state.get("debug_mode", False):
st.write(f"Chain finished with {len(outputs['source_documents'])} source documents")
# Process the source documents
self.on_retriever_end(outputs['source_documents'])
else:
if st.session_state.get("debug_mode", False):
st.write("Chain finished but no source documents were found in outputs")
def on_retriever_start(self, query, **kwargs):
"""Called when the retriever starts retrieving documents."""
import streamlit as st
if st.session_state.get("debug_mode", False):
st.write(f"Retriever started with query: {query}")
# Clear existing citations when starting a new retrieval
self.citations = []
self.citation_sources = set()
def on_retriever_end(self, documents, **kwargs):
"""
Callback that runs after the retriever finishes retrieving documents.
Args:
documents: The documents returned by the retriever
**kwargs: Additional keyword arguments
"""
import streamlit as st
self.citations = []
self.citation_sources = set()
if not documents:
if st.session_state.get("debug_mode", False):
st.warning("No documents were retrieved")
return
if st.session_state.get("debug_mode", False):
st.write(f"Retrieved {len(documents)} documents")
for doc in documents:
try:
# Make sure we're dealing with a Document object
if hasattr(doc, 'page_content') and hasattr(doc, 'metadata'):
# Get content and metadata
content = doc.page_content
metadata = doc.metadata
# Debug the metadata
if st.session_state.get("debug_mode", False):
st.write(f"Document metadata: {metadata}")
# Create a unique key for this citation to check for duplicates
doc_id = metadata.get("doc_id", "Unknown")
source = metadata.get("source", "Unknown")
page = metadata.get("page", 0)
chunk = metadata.get("chunk", 0)
citation_key = f"{doc_id}_{page}_{chunk}_{source}"
# Skip if we've already seen this exact citation
if citation_key in self.citation_sources:
if st.session_state.get("debug_mode", False):
st.write(f"Skipping duplicate citation: {citation_key}")
continue
# Add this citation key to our set of seen citations
self.citation_sources.add(citation_key)
# Convert tags_str back to a list if it exists
tags = []
tags_str = metadata.get("tags_str", "")
if tags_str and isinstance(tags_str, str):
tags = tags_str.split(",")
self.citations.append({
"text": content,
"source": source,
"doc_id": doc_id,
"doc_type": metadata.get("doc_type", "Unknown"),
"case_id": metadata.get("case_id", "Unknown"),
"tags": tags,
"page": page,
"chunk": chunk
})
if st.session_state.get("debug_mode", False):
st.write(f"Successfully processed citation from {metadata.get('source', 'Unknown')}")
elif isinstance(doc, dict):
# It's a dictionary format
metadata = doc.get("metadata", {})
# Create a unique key for this citation to check for duplicates
doc_id = metadata.get("doc_id", "Unknown")
source = metadata.get("source", "Unknown")
page = metadata.get("page", 0)
chunk = metadata.get("chunk", 0)
citation_key = f"{doc_id}_{page}_{chunk}_{source}"
# Skip if we've already seen this exact citation
if citation_key in self.citation_sources:
if st.session_state.get("debug_mode", False):
st.write(f"Skipping duplicate citation: {citation_key}")
continue
# Add this citation key to our set of seen citations
self.citation_sources.add(citation_key)
# Convert tags_str back to a list if it exists
tags = []
tags_str = metadata.get("tags_str", "")
if tags_str and isinstance(tags_str, str):
tags = tags_str.split(",")
self.citations.append({
"text": doc.get("page_content", ""),
"source": source,
"doc_id": doc_id,
"doc_type": metadata.get("doc_type", "Unknown"),
"case_id": metadata.get("case_id", "Unknown"),
"tags": tags,
"page": page,
"chunk": chunk
})
elif isinstance(doc, str):
# It's just a string - not much we can do about deduplication here
# but we can still track it for future duplicates
citation_key = f"string_{hash(doc)}"
# Skip if we've already seen this exact citation
if citation_key in self.citation_sources:
continue
# Add this citation key to our set of seen citations
self.citation_sources.add(citation_key)
self.citations.append({
"text": doc,
"source": "Unknown",
"doc_id": "Unknown",
"doc_type": "Unknown",
"case_id": "Unknown",
"tags": [],
"page": 0,
"chunk": 0
})
else:
# Unknown format - try to get some info
# Generate a unique key based on the string representation
citation_key = f"unknown_{hash(str(doc))}"
# Skip if we've already seen this exact citation
if citation_key in self.citation_sources:
continue
# Add this citation key to our set of seen citations
self.citation_sources.add(citation_key)
self.citations.append({
"text": str(doc),
"source": "Unknown",
"doc_id": "Unknown",
"doc_type": "Unknown",
"case_id": "Unknown",
"tags": [],
"page": 0,
"chunk": 0
})
except Exception as e:
# If anything goes wrong, add an error citation
if st.session_state.get("debug_mode", False):
st.error(f"Error processing citation: {str(e)}")
# For errors, we still want to show unique errors
citation_key = f"error_{str(e)}"
# Skip if we've already seen this exact error
if citation_key in self.citation_sources:
continue
# Add this citation key to our set of seen citations
self.citation_sources.add(citation_key)
self.citations.append({
"text": f"[Error processing document: {str(e)}]",
"source": "Error",
"doc_id": "Error",
"doc_type": "Error",
"case_id": "Error",
"tags": [],
"page": 0,
"chunk": 0
})
if st.session_state.get("debug_mode", False):
st.write(f"Total citations processed: {len(self.citations)}")
GitHub Events
Total
- Push event: 13
- Create event: 2
Last Year
- Push event: 13
- Create event: 2
Dependencies
- Pillow >=9.5.0
- PyMuPDF >=1.22.5
- PyPDF2 >=3.0.0
- chromadb >=0.4.18
- langchain >=0.1.0
- langchain-community >=0.0.13
- langchain-core >=0.1.10
- langchain-openai >=0.0.2
- langchain-text-splitters >=0.0.1
- openai >=1.1.1
- pdf2image >=1.16.3
- pinecone-client >=2.2.1
- pypdf *
- pytesseract >=0.3.10
- python-docx >=0.8.11
- regex >=2023.6.3
- streamlit >=1.22.0