https://github.com/crvernon/knowai
AI pipeline for large PDF interrogation
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
AI pipeline for large PDF interrogation
Basic Info
- Host: GitHub
- Owner: crvernon
- License: bsd-3-clause
- Language: Python
- Default Branch: main
- Size: 639 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 8
Metadata Files
README.md
knowai
An agentic AI pipeline for multiple, large PDF reports interrogation
Set up
- Clone this repostiory into a local directory of your choosing
- Build a virtual environment
- Install
knowaiby running:pip install .from the root directory of your clone (OR) install usingpip install knowaifrom PyPI. - Configure a
.envfile with the following:AZURE_OPENAI_API_KEY- Your API keyAZURE_OPENAI_ENDPOINT- Your Azure endpointAZURE_OPENAI_DEPLOYMENT- Your LLM deployment name (e.g., "gpt-4o")AZURE_EMBEDDINGS_DEPLOYMENT- Your embeddings model deployment name (e.g., "text-embedding-3-large", defaults to "text-embedding-3-large")
AZURE_OPENAI_API_VERSION- Your Azure LLM deployment version (e.g., "2024-02-01")AZURE_OPENAI_EMBEDDINGS_API_VERSION- Your Azure embeddings API version (e.g., "2024-02-01", defaults to "2024-02-01")
Building the vectorstore
Using the CLI (Recommended)
From the root directory of this repository, run the following from a terminal (ensuring that your virtual environment is active) to build the vectorstore:
bash
python -m knowai.cli_vectorstore build <directory_containing_your_input_pdf_files> --metadata_parquet_path <path_to_metadata.parquet> --vectorstore_path <directory_name_for_vectorstore>
Using the Python API
You can also build vectorstores programmatically:
```python from knowai import getretrieverfrom_directory
retriever = getretrieverfromdirectory( directorypath="path/to/pdfs", persistdirectory="myvectorstore", metadataparquetpath="metadata.parquet", k=10, chunksize=1000, chunkoverlap=200 ) ```
Inspecting Vectorstores
To inspect an existing vectorstore:
bash
python -m knowai.cli_vectorstore inspect <vectorstore_path>
Or programmatically:
```python from knowai import loadvectorstore, showvectorstoreschema, listvectorstorefiles, analyzevectorstore_chunking
Load vectorstore
vectorstore = loadvectorstore("myvectorstore")
Show schema information
schema = showvectorstoreschema(vectorstore) print(f"Total vectors: {schema['totalvectors']}") print(f"Metadata fields: {schema['metadatafields']}")
List files in vectorstore
files = listvectorstorefiles(vectorstore) print(f"Files: {files}")
Analyze chunking parameters used to build the vectorstore
analysis = analyzevectorstorechunking(vectorstore) print(f"Estimated chunk size: {analysis['recommendedsettings']['chunksize']}") print(f"Estimated overlap: {analysis['recommendedsettings']['chunkoverlap']}") ```
By default, this will create a vectorstore using FAISS named "testfaissstore" in the root directory of your repository.
Running the knowai in a simple chatbot example via streamlit
From the root directory, run the following in a terminal after you have your virtual environment active:
streamlit run app_chat_simple.py
This will open the app in your default browser.
Using knowai
Once your vector store is built, you can use knowai either programmatically or through the provided Streamlit interface.
Python quick‑start
The package ships with the KnowAIAgent class for fully programmatic access
inside notebooks or scripts:
```python from knowai.core import KnowAIAgent
Path that you supplied with --vectorstore_path when building
VSTOREPATH = "testfaiss_store"
agent = KnowAIAgent(vectorstorepath=VSTOREPATH)
A single conversational turn
response = await agent.processturn( userquestion="Summarize the key findings in the 2025 maritime report", selectedfiles=["myreport.pdf"], )
print(response["generation"]) ```
Streaming Responses
For a more responsive user experience, you can enable streaming responses:
```python def stream_callback(token: str): """Called for each token as it's generated.""" print(token, end='', flush=True)
response = await agent.processturn( userquestion="Summarize the key findings in the 2025 maritime report", selectedfiles=["myreport.pdf"], streamingcallback=streamcallback # Enable streaming ) ```
The response will be streamed in real-time via the callback, while still being available in the returned dictionary.
The returned dictionary contains:
| Key | Description |
| ----------------------------- | ------------------------------------------------------------ |
| generation | Final answer synthesised from the selected documents. |
| individual_answers | Per‑file answers (when bypassindividualgen=False). |
| documents_by_file | Retrieved document chunks keyed by filename. |
| raw_documents_for_synthesis | Raw text block used when bypassing individual generation. |
| bypass_individual_generation| Whether the bypass mode was used for this turn. |
Token Counting Configuration
KnowAI supports two methods for token counting to manage context window limits:
Accurate Token Counting (Default)
- Uses tiktoken library for precise token estimation
- More accurate batch sizing and context management
- Automatically falls back to heuristic method if tiktoken unavailable
```python
Default behavior (accurate token counting)
agent = KnowAIAgent(vectorstorepath=VSTOREPATH)
Explicit accurate token counting
agent = KnowAIAgent( vectorstorepath=VSTOREPATH,
) ```
Heuristic Token Counting - Uses character-based estimation (4 characters ≈ 1 token) - Faster performance, suitable when approximate estimation is sufficient - Always available as fallback
```python
Use heuristic token counting
agent = KnowAIAgent( vectorstorepath=VSTOREPATH,
) ```
CLI Configuration ```bash curl -X POST http://127.0.0.1:8000/initialize \ -H "Content-Type: application/json" \ -d '{ "vectorstores3uri": "/path/to/vectorstore",
}' ```
Benefits of Accurate Token Counting: - More precise token limits and batch sizing - Reduced risk of context overflow - Better resource utilization - Improved reliability with large document sets
When to Use Heuristic Counting:
- tiktoken not available in environment
- Performance is critical
- Approximate estimation is sufficient
- Debugging token counting issues
Streamlit chat app
If you prefer a ready‑made UI, launch the demo:
bash
streamlit run app_chat_simple.py
Upload or select PDF files, ask questions in the sidebar, and inspect per‑file answers or the combined response in the main panel.
For advanced configuration options (e.g., conversation history length,
retriever k values, or combine thresholds) see the docstrings in
knowai/core.py and knowai/agent.py.
Containerization
To build and run both the knowai service and the Svelte UI using Docker Compose:
- Ensure Docker and Docker Compose are installed on your machine.
- From the directory containing this README (the repo root), navigate to the Svelte example folder:
bash cd example_apps/svelte2a. Compile the Svelte app and package the build assvelte-example:bash npm install npm run build mv dist svelte-example - Start the services and build images:
bash docker compose up --buildThis will:- Build the
knowaiservice (listening on port 8000). - Build the
uiservice (Svelte app, listening on port 5173).
- Build the
- Open your browser and visit:
- FastAPI docs: http://localhost:8000/docs
- Svelte UI: http://localhost:5173
- To stop and remove containers, press
CTRL+Cand then run:bash docker compose down
Running the knowai CLI Locally
You can start the FastAPI micro-service locally without Docker and point it to either a local vectorstore or one hosted on S3.
Using a Local Vectorstore
- Ensure you have a built FAISS vectorstore on disk (e.g.,
test_faiss_store). - Start the service:
bash python -m knowai.cli - In another terminal, initialize the session:
bash curl -X POST http://127.0.0.1:8000/initialize \ -H "Content-Type: application/json" \ -d '{"vectorstore_s3_uri":"/absolute/path/to/your/vectorstore"}' - Ask a question:
bash curl -X POST http://127.0.0.1:8000/ask \ -H "Content-Type: application/json" \ -d '{ "session_id":"<session_id>", "question":"Your question here", "selected_files":["file1.pdf","file2.pdf"] }'
Streaming API
For real-time streaming responses, use the /ask-stream endpoint:
bash
curl -X POST http://127.0.0.1:8000/ask-stream \
-H "Content-Type: application/json" \
-d '{
"session_id":"<session_id>",
"question":"Your question here",
"selected_files":["file1.pdf","file2.pdf"]
}' \
--no-buffer
This will stream the response in real-time using Server-Sent Events (SSE). Each token will be sent as it's generated by the LLM.
For more details on streaming functionality, see docs/STREAMING.md.
Using an S3-Hosted Vectorstore
- Start the service:
bash python -m knowai.cli - Initialize the session against your S3 bucket:
bash curl -X POST http://127.0.0.1:8000/initialize \ -H "Content-Type: application/json" \ -d '{"vectorstore_s3_uri":"s3://your-bucket/path"}' - Ask a question in a similar way:
bash curl -X POST http://127.0.0.1:8000/ask \ -H "Content-Type: application/json" \ -d '{ "session_id":"<session_id>", "question":"Another question example", "selected_files":[] }'
Enhanced User Feedback
KnowAI provides comprehensive feedback when the search process doesn't find relevant information in your documents.
No-Chunks Feedback
When no text chunks are extracted for a query in a file, KnowAI ensures users are clearly informed:
- Individual File Level: Each file that has no matching content receives a specific message explaining that "The search did not retrieve any document chunks that match your query."
- Synthesis Level: The final response clearly states which files had no relevant content, helping users understand the scope of the search results.
- Progress Tracking: Files with no matching content are tracked separately from files with errors, providing clear distinction in the response.
Example Response
When asking about "climate change impacts" across multiple reports:
``` I found information about climate change impacts in the following reports:
From report1.pdf (Page 15): "Global temperatures have increased by 1.1°C since pre-industrial times..."
From report2.pdf (Page 8): "Sea level rise is accelerating at a rate of 3.3mm per year..."
No matching content found in: report3.pdf (no matching content). ```
This helps users understand: - Which files contained relevant information - Which files were searched but had no matching content - The specific nature of missing information
Error Handling
KnowAI distinguishes between different types of issues: - No matching content: Files that were searched but had no relevant chunks - Content policy violations: Issues with AI provider content filters - Processing errors: Technical issues during document processing
Each type is handled appropriately and communicated clearly to the user.
Testing
Run the test suite to verify functionality:
```bash
Run all tests
python -m pytest
Test specific functionality
python -m pytest tests/testprompts.py -v python -m pytest tests/testagent.py -v
Test no-chunks feedback improvements
python scripts/testnochunks_feedback.py ```
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Individual File Processing
KnowAI supports two processing modes for handling multiple files:
Traditional Batch Processing (Default) - All documents from all files are combined and processed together - Faster processing, good for related content across files
Individual File Processing - Each file is processed separately by the LLM (in parallel, max 10 concurrent), then responses are consolidated - Ensures each file gets equal attention, better for distinct topics - Significantly faster than sequential processing for multiple files - Concurrency limit prevents overwhelming the LLM service
```python
Enable individual file processing
agent = KnowAIAgent( vectorstorepath=VSTOREPATH, processfilesindividually=True # Enable individual processing )
Or enable per-request
response = await agent.processturn( userquestion="What are the main strategies?", selectedfiles=["file1.pdf", "file2.pdf"], processfiles_individually=True # Override for this request ) ```
When to Use Individual File Processing: - Files contain distinct topics that should be analyzed separately - You want to ensure each file gets equal attention from the LLM - You want to see how each file contributes to the final answer - Dealing with large files that might benefit from focused analysis
For more details, see Individual File Processing Documentation.
Token Counting Configuration
Owner
- Name: Chris Vernon
- Login: crvernon
- Kind: user
- Location: Remote
- Company: @JGCRI @IMMM-SFA @PNNL @multisectordynamics @openjournals
- Website: https://energyenvironment.pnnl.gov/staff/staff_info.asp?staff_num=1834
- Twitter: chris_r_vernon
- Repositories: 38
- Profile: https://github.com/crvernon
Data scientist specializing in all things geospatial who spends a lot of time developing open-source software ecosystems and mentoring
GitHub Events
Total
- Create event: 11
- Release event: 6
- Watch event: 1
- Delete event: 3
- Issue comment event: 1
- Member event: 1
- Push event: 38
- Pull request review comment event: 3
- Pull request review event: 4
- Pull request event: 13
Last Year
- Create event: 11
- Release event: 6
- Watch event: 1
- Delete event: 3
- Issue comment event: 1
- Member event: 1
- Push event: 38
- Pull request review comment event: 3
- Pull request review event: 4
- Pull request event: 13
Packages
- Total packages: 1
-
Total downloads:
- pypi 46 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 7
- Total maintainers: 1
pypi.org: knowai
A conversational RAG agent pipeline using LangGraph
- Documentation: https://knowai.readthedocs.io/
- License: BSD 3-Clause License Copyright (c) 2025, Chris Vernon Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
Latest release: 0.4.0
published 6 months ago