cyberragllm
CyberRAG is a cybersecurity focused Local LLM using RAG to optimize the output of the base llm.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Repository
CyberRAG is a cybersecurity focused Local LLM using RAG to optimize the output of the base llm.
Basic Info
- Host: GitHub
- Owner: ZeTioZ
- License: mit
- Language: Python
- Default Branch: master
- Size: 1.36 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
CyberRAGLLM
CyberRAGLLM is a cybersecurity-focused question-answering system that uses Retrieval-Augmented Generation (RAG) with local LLMs to provide accurate and contextually relevant responses to cybersecurity questions. It leverages a graph-based workflow to combine document retrieval, web search, and quality control mechanisms.
Features
- 🔒 Cybersecurity Focus: Specialized for cybersecurity domain with extensive document collection
- 🤖 Local LLM Integration: Works with Ollama-based local LLMs (e.g., Llama 3.2)
- 📚 Web Document Processing: Loads and processes documents from web URLs
- 📄 PDF Support: Processes PDF files from both local paths and internet URLs
- 🔍 Vector Search: Efficient retrieval using SKLearnVectorStore with HuggingFace embeddings
- 🌐 Web Search Integration: Uses Tavily Search API for supplementary information
- 🔘 Web Search Toggle: Ability to enable or disable web searches completely
- 🧠 Graph-based Workflow: Sophisticated control flow for question routing and answer generation
- 🧐 Quality Control: Grades document relevance and answer quality
- 📊 Hallucination Detection: Checks if generated answers are grounded in the retrieved documents
- 🔄 Adaptive Retrieval: Combines vector search with web search when needed
Installation
Clone the repository:
bash git clone https://github.com/zetioz/CyberRAGLLM.git cd CyberRAGLLMCreate a virtual environment (optional but recommended): ```bash python -m venv .venv ..venv\Scripts\activate # On Windows
OR
source .venv/bin/activate # On Unix/MacOS ```
Install the required dependencies:
bash pip install -r requirements.txtInstall Ollama and pull a compatible model: ```bash
Install Ollama from https://ollama.ai/
Then pull the model
ollama pull hf.co/safe049/mistral-v0.3-7b-cybersecurity:latest ```
Set up your API keys: ```bash
Get your Tavily API key from https://tavily.com/
Get your LangSmith API key from https://smith.langchain.com/
You'll be prompted for these when running the application if not set as environment variables
```
Usage
Running the Application
The main entry point is src/main.py, which provides an interactive question-answering interface:
bash
python src\main.py
This will:
1. Load the Llama 3.2 model via Ollama
2. Process documents from URLs listed in rag_urls.txt
3. Create a vector store for efficient retrieval
4. Set up the graph-based workflow
5. Start an interactive loop where you can ask cybersecurity questions
Example Questions
The system is particularly well-suited for questions about: - Exploit techniques and vulnerabilities - Memory protection mechanisms - Side-channel attacks - Kernel security - Hardware vulnerabilities
Examples: - "What is a buffer overflow attack and how does it work?" - "Explain the Rowhammer attack technique" - "How do ASLR bypass techniques work?" - "What are the best practices for preventing SQL injection?"
Customizing Document Sources
The system uses URLs listed in rag_urls.txt as document sources. You can modify this file to include your own sources:
https://example.com/cybersecurity-doc1
https://example.com/cybersecurity-doc2
https://example.com/document.pdf
/path/to/local/document.pdf
Using PDF Files
The system can process PDF files from both local paths and internet URLs:
- Local PDF Files: Add the full path to the PDF file in
rag_urls.txt - Internet PDF Files: Add the URL to the PDF file in
rag_urls.txt
The system automatically detects PDF files based on file extension (.pdf) or content type and processes them accordingly.
Controlling Web Search
You can enable or disable web searches completely:
API Usage: When using the API, include the
web_search_enabledparameter in your request:json { "model": "your_model_name", "messages": [ { "role": "system", "content": "system_prompt" }, { "role": "user", "content": "user_prompt" } ], "web_search_enabled": False, "max_retries": 5, "temperature": 0.7, "max_tokens": -1, "stream": False }Benefits of Disabling Web Search:
- Privacy and security when dealing with sensitive information
- Offline mode for environments without internet access
- Controlled information to ensure answers are based only on vetted documents
- Testing and evaluation to compare answer quality with and without web search
System Architecture
CyberRAGLLM uses a graph-based workflow with the following components:
- Question Routing: Determines whether to use vectorstore or web search
- Document Retrieval: Fetches relevant documents from the vectorstore
- Document Grading: Assesses the relevance of retrieved documents
- Web Search: Supplements with web search results when needed
- Answer Generation: Creates answers using RAG with the retrieved documents
- Quality Control: Checks for hallucinations and relevance to the question
Requirements
- Python 3.8+
- Ollama with a compatible local LLM model (e.g., Llama 3.2)
- Dependencies:
- langchain, langchaincore, langchaincommunity
- langchain_ollama
- langgraph
- scikit-learn
- tiktoken
- tavily-python
- beautifulsoup4 (BeautifulSoup)
- pypdf (for PDF processing)
- Tavily API key for web search functionality
- LangSmith API key for tracing (optional)
- Sufficient RAM for embedding and running the LLM
License
This project is licensed under the terms included in the LICENSE file.
Owner
- Name: ZeTioZ
- Login: ZeTioZ
- Kind: user
- Location: Belgium
- Twitter: ZeTioZ
- Repositories: 2
- Profile: https://github.com/ZeTioZ
Little guy who like to make some little projects: 😄
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Gentile" given-names: "Donato" - family-names: "Alarcon" given-names: "Diego" title: "CyberRAGLLM" version: 1.0.0 date-released: 2025-05-27 url: "https://github.com/ZeTioZ/CyberRAGLLM"
GitHub Events
Total
- Member event: 1
- Push event: 19
- Create event: 2
Last Year
- Member event: 1
- Push event: 19
- Create event: 2
Dependencies
- bs4 *
- langchain *
- langchain-nomic *
- langchain_community *
- langchain_core *
- langchain_ollama *
- langgraph *
- nomic *
- scikit-learn *
- tavily-python *
- tiktoken *