https://github.com/adricwht/responsesapi
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: adricwht
- Language: Python
- Default Branch: main
- Size: 255 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
OpenAI File Search Implementation with Vector Store Visualization
This project implements the OpenAI Responses API file search tool for semantic search on PDF documents, along with 3D visualization of vector store embeddings. It's based on the sample code that demonstrates using vector stores to search and answer questions from PDF content.
Features
- Create vector stores on OpenAI
- Upload PDFs to the vector store
- Perform standalone vector searches
- Integrate search results with LLM responses
- Generate evaluation questions from PDFs
- Evaluate retrieval performance with metrics
- Visualize vector store embeddings in 3D space
- Explore semantic relationships between documents
- Cluster similar documents automatically
Prerequisites
- Python 3.8+
- OpenAI API key
Installation
- Clone this repository
- Install the required dependencies:
pip install -r requirements.txt - Set up your environment variables:
- Copy
.env.exampleto.env - Add your OpenAI API key to the
.envfile:OPENAI_API_KEY=your-api-key-here - Optionally, configure other variables in the
.envfile
- Copy
Environment Variables
The following environment variables can be set in the .env file:
OPENAI_API_KEY: Your OpenAI API key (required)VECTOR_STORE_ID: ID of an existing vector store (optional)OPENAI_MODEL: Model to use for LLM searches (default: gpt-4o-mini)MAX_RESULTS: Number of results to retrieve (default: 5)
These variables can also be overridden via command-line arguments.
Usage
The implementation is provided as a Python module with a command-line interface:
1. Create a Vector Store
python search_api_implementation.py --action create_store --store_name "my_pdf_store" --output store_details.json
2. Upload PDFs to the Vector Store
python search_api_implementation.py --action upload --store_id "vs_123456789" --pdf_dir "path/to/pdfs" --output upload_stats.json
3. Perform Vector Search
python search_api_implementation.py --action search --store_id "vs_123456789" --query "What is Deep Research?"
4. Integrated LLM Search
python search_api_implementation.py --action llm_search --store_id "vs_123456789" --query "What is Deep Research?" --model "gpt-4o-mini"
5. Generate Evaluation Questions
python search_api_implementation.py --action generate_questions --pdf_dir "path/to/pdfs" --output questions.json
6. Evaluate Retrieval Performance
python search_api_implementation.py --action evaluate --store_id "vs_123456789" --output questions.json --k 5 --model "gpt-4o-mini"
7. Visualize Vector Store Embeddings
Technical Architecture
```mermaid graph TD A[OpenAI Vector Store] --> B[Data Retrieval Module] B --> C[Embedding Processing] C --> D[Dimensionality Reduction] D --> E[Clustering Algorithm] E --> F[Interactive Visualization] F --> G[User Interface]
H[PDF Documents] --> A
subgraph "Backend Processing"
B
C
D
E
end
subgraph "Frontend Visualization"
F
G
end
```
Using the Interactive Scripts
For a user-friendly interface, run one of the interactive scripts:
```
Windows Command Prompt
run_visualization.bat
PowerShell
.\run_visualization.ps1 ```
Using the Command Line
Create a vector store, upload PDFs, and visualize in one step:
python visualize_vector_store.py create-and-visualize --store_name "my_store" --pdf_dir "path/to/pdfs" --output "visualization.html"
Visualize an existing vector store:
python visualize_vector_store.py visualize --store_id "vs_your_vector_store_id" --output "visualization.html"
Run with interactive Dash web application:
python visualize_vector_store.py visualize --store_id "vs_your_vector_store_id" --run_dash
For more details on visualization features, see VISUALIZATION_README.md.
Implementation Workflow
```mermaid sequenceDiagram participant User participant Script participant OpenAI participant Visualization
User->>Script: Run with vector store ID
Script->>OpenAI: Request embeddings
OpenAI-->>Script: Return embeddings data
Script->>Script: Process embeddings
Script->>Script: Reduce dimensions (UMAP)
Script->>Script: Perform clustering
Script->>Visualization: Generate interactive plot
Visualization-->>User: Display 3D visualization
User->>Visualization: Interact (zoom, rotate, select)
Visualization-->>User: Show document details
```
Project Structure
search_api_implementation.py: Main implementation filevector_store_visualizer.py: Vector store visualization modulevisualize_vector_store.py: Command-line interface for visualizationrun_visualization.bat: Interactive batch script for Windowsrun_visualization.ps1: Interactive PowerShell script for Windowsexample_visualization.py: Example script demonstrating programmatic usageVISUALIZATION_README.md: Detailed documentation for visualization featuresrequirements.txt: Python dependenciesREADME.md: This documentation file.env.example: Template for environment variables.env: Your local environment variables (not committed to the repository)
Example Workflow
Create a vector store:
python search_api_implementation.py --action create_store --store_name "openai_blog_store" --output store_details.jsonUpload PDFs to the vector store:
python search_api_implementation.py --action upload --store_id "vs_67d06b9b9a9c8191bafd456cf2364ce3" --pdf_dir "C:\Users\admin\ResponsesAPI\SearchOnThis" --output upload_stats.jsonGenerate questions for evaluation:
python search_api_implementation.py --action generate_questions --pdf_dir "C:\Users\admin\ResponsesAPI\SearchOnThis" --output questions.jsonEvaluate retrieval performance:
python search_api_implementation.py --action evaluate --store_id "vs_67d06b9b9a9c8191bafd456cf2364ce3" --output questions.json --k 5Perform search with LLM integration:
python search_api_implementation.py --action llm_search --store_id "vs_67d06b9b9a9c8191bafd456cf2364ce3" --query "What is Deep Research?"Visualize the vector store embeddings:
python visualize_vector_store.py visualize --store_id "vs_67d06b9b9a9c8191bafd456cf2364ce3" --output "visualization.html"Run the interactive visualization dashboard:
python visualize_vector_store.py visualize --store_id "vs_67d06b9b9a9c8191bafd456cf2364ce3" --run_dash
Notes
- The API key is required for all operations
- Vector store IDs should be saved after creation for later use
- All PDF files in the specified directory will be processed
- Evaluation metrics include Recall, Precision, MRR, and MAP
Future Development: Knowledge Graph Visualization
Building on the current vector store visualization capabilities, a natural extension would be to develop a comprehensive knowledge graph visualization system. This would transform the current document-level embeddings into a rich, interconnected graph of entities, concepts, and relationships.
Conceptual Architecture
```mermaid graph TD A[Document Extraction] --> B[Entity/Concept Recognition] B --> C[Relationship Extraction] C --> D[Graph Construction] D --> E[3D Graph Layout] E --> F[Interactive Visualization]
subgraph "Entity Recognition"
B1[spaCy NER] --> B
B2[OpenAI API] --> B
B3[Hybrid Approach] --> B
end
subgraph "Relationship Extraction"
C1[Co-occurrence Analysis] --> C
C2[Semantic Similarity] --> C
C3[LLM-Based Extraction] --> C
end
subgraph "Graph Visualization"
F1[Node Filtering] --> F
F2[Relationship Filtering] --> F
F3[Search & Exploration] --> F
end
```
Implementation Components
Entity and Concept Extraction
- Named Entity Recognition (NER) using spaCy
- Concept extraction using OpenAI API
- Hybrid approaches combining rule-based and ML techniques
Relationship Definition
- Co-occurrence analysis (entities appearing in the same context)
- Semantic similarity between entity embeddings
- LLM-based relationship extraction using OpenAI API
- Knowledge base integration (e.g., Wikidata)
Graph Construction
- NetworkX for graph data structure
- Nodes representing documents, entities, and concepts
- Edges representing relationships with descriptive labels
- Metadata enrichment for nodes and edges
Graph Layout
- UMAP for dimensionality reduction of combined embeddings
- Force-directed layout algorithms (e.g., Fruchterman-Reingold)
- Hierarchical layouts for concept taxonomies
Interactive Visualization
- 3D graph rendering with Plotly
- Node coloring by type/category
- Edge styling by relationship type
- Interactive selection and exploration
Enhanced Dash Interface
- Filtering by node type and relationship type
- Search functionality for finding specific nodes
- Detailed node and relationship information panels
- Path finding between concepts
Benefits
- Deeper Insights: Understand not just document similarity but the specific entities and concepts that connect them
- Contextual Exploration: Navigate through related concepts and discover unexpected connections
- Improved Search: Find documents based on contained entities and concepts, not just overall similarity
- Knowledge Discovery: Identify patterns and relationships that might not be apparent from document-level analysis
Technical Considerations
- Scalability: Graph visualization can become complex with large numbers of nodes and edges
- Performance: Real-time interaction requires efficient graph algorithms and rendering
- Accuracy: Entity and relationship extraction quality directly impacts the usefulness of the visualization
- User Experience: Balancing complexity with usability for effective knowledge exploration
This future development would transform the current vector store visualization from a document-centric view to a rich knowledge graph that reveals the underlying semantic structure of the content.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: Adric White
- Login: adricwht
- Kind: user
- Location: Vallejo, CA
- Repositories: 1
- Profile: https://github.com/adricwht
Senior Software Engineer | React/Next.js, Angular, Vue | AI, Python, C#/.NET, Node.js, Java