https://github.com/kunalshah03/rufus

Rufus - Intelligent Web Scraper for RAG Systems

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 4 months ago · JSON representation

Repository

Rufus - Intelligent Web Scraper for RAG Systems

Basic Info

Host: GitHub
Owner: kunalshah03
Language: Python
Default Branch: main
Size: 270 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

Rufus - Intelligent Web Scraper for RAG Systems

Rufus is an intelligent web scraping tool designed to prepare data for Retrieval-Augmented Generation (RAG) systems. It combines advanced web crawling capabilities with AI-powered content processing to extract and structure relevant information from websites.

Features

Intelligent web crawling with dynamic content support
AI-driven content extraction and structuring
RAG-optimized document output
Asynchronous processing for improved performance
Rich metadata and content classification
JSONL export for seamless RAG integration

Prerequisites

Python 3.8+
OpenAI API key

Installation

Clone the repository: bash git clone https://github.com/yourusername/rufus.git
Create and activate a virtual environment bash python -m venv venv source venv/bin/activate
Install dependencies bash pip install -r requirements.txt
Install Playwright browsers bash playwright install chromium
Set up environment variables bash cp .env.example .env Edit .env and add your OpenAI API key: RUFUS_API_KEY=your-openai-api-key
Install the package in development mode from the project root: pip install -e . This makes the rufus package importable while allowing you to modify the code. Run this from the directory containing setup.py.

Usage

Basic Example

Note: The basic example is as per the instruction provided in the task file

```bash from rufus import RufusClient

client = RufusClient() documents = client.scrape( "https://www.sfgov.com", instructions = "We're making a chatbot for the HR in San Francisco." ) ```

Running the Demo

Run the following command: bash python examples/basic_usage.py Output will be saved in output/scraped_content.jsonl.

Example Output

```bash Processing pages: 0%| | 0/1 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" Processing pages: 100%|█████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.75s/it]

Documents saved to: output/scraped_content.jsonl

Extracted Documents Preview:

Title: Chatbot for HR in San Francisco Text: sfgov.com sfgov.com 2024 Copyright. All Rights Reserved. Privacy Policy... Topics: privacy policy, copyright Type: policy Relevance: 0.8 output/scraped_content.jsonljson { "id": "58f37f8c-9b44-42ef-bdd1-9c60720e5132", "text": "sfgov.com sfgov.com 2024 Copyright. All Rights Reserved. Privacy Policy", "metadata": { "title": "Chatbot for HR in San Francisco", "sourceurl": "https://www.sfgov.com", "chunktype": "policy", "timestamp": "2024-11-15T13:49:37.555420", "topics": [ "privacy policy", "copyright" ], "context": "Creating a chatbot for HR in San Francisco", "relevance_score": 0.8 } }

```

Output Format

Documents are structured to optimize for RAG systems: ```json { "id": "unique-id", "text": "Main content for embedding", "metadata": { "title": "Content title", "sourceurl": "Source URL", "chunktype": "Content type", "timestamp": "ISO timestamp", "topics": ["topic1", "topic2"], "context": "Additional context", "relevance_score": 0.95 } }

```

Configuration

Rufus supports the following configurable parameters:

| Parameter | Description | Default | |------------------|------------------------------|---------| | max_depth | Maximum crawling depth | 3 | | max_pages | Maximum pages to crawl | 100 | | timeout | Request timeout in seconds | 30 | | max_concurrent | Maximum concurrent requests | 5 |

Error Handling

Network errors are logged and skipped: Handles timeouts, unreachable servers, or connection issues gracefully. These errors are logged and skipped.
Invalid content is filtered out: Automatically filters out invalid or non-parsable content, ensuring only relevant data is processed.
Rate limiting is managed automatically: Manages rate limits from web servers and APIs by implementing retry logic with exponential backoff.

Common Issues

OpenAI API Error: Check your API key in .env.
No Content Extracted: Verify the URL's accessibility.
Timeout Errors: Adjust the timeout settings.
Memory Issues: Reduce max_pages or max_concurrent.

Project Structure

rufus/ ├── rufus/ │ ├── __init__.py │ ├── client.py │ ├── scraper.py │ ├── processor.py │ └── utils.py ├── examples/ │ └── basic_usage.py ├── tests/ │ └── test_rufus.py ├── .env.example ├── requirements.txt └── README.md

Description of Folders and Files

rufus/: Main package containing the core logic of the project.
- __init__.py: Initializes the package.
- client.py: Module for client-related functionality.
- scraper.py: Contains web scraping methods and classes.
- processor.py: Handles data processing tasks.
- utils.py: Helper functions shared across modules.
examples/: Contains usage examples for understanding and testing the project.
tests/: Unit tests to ensure the code behaves as expected.
.env.example: Template for environment variables required for configuration.
requirements.txt: Specifies Python dependencies for the project.
README.md: Main documentation file for the project.

Notes

Ensure to create a .env file based on .env.example for local configuration.
Run tests in the tests/ directory before deploying or extending functionality.

To Contribute to this repository:

Fork the repository
Create a feature branch
Submit a pull request

Author

Kunal Shah

Owner

Name: Kunal shah
Login: kunalshah03
Kind: user

Repositories: 4
Profile: https://github.com/kunalshah03

Student studying in PICT PUNE . Branch : Computer Science.

GitHub Events

Total

Push event: 17
Create event: 1

Last Year

Push event: 17
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/kunalshah03/rufus

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Rufus - Intelligent Web Scraper for RAG Systems

Features

Prerequisites

Installation

Usage

Basic Example

Note: The basic example is as per the instruction provided in the task file

Running the Demo

Example Output

Output Format

```

Configuration

Error Handling

Common Issues

Project Structure

Description of Folders and Files

Notes

To Contribute to this repository:

Author

Owner

GitHub Events

Total

Last Year