https://github.com/asanchezyali/conan-researcher

https://github.com/asanchezyali/conan-researcher

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: asanchezyali
  • License: gpl-3.0
  • Language: TypeScript
  • Default Branch: main
  • Size: 1.19 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

Conan Researcher

Screenshot 2025-04-01 at 12 52 23 AM

Screenshot 2025-04-01 at 12 51 58 AM

A sophisticated data extraction and analysis system with web scraping capabilities, built with FastAPI backend and a modern frontend.

Architecture Overview

Conan Researcher is designed as a microservices architecture with containerized components:

┌────────────┐ ┌────────────┐ ┌────────────┐ │ Frontend │────▶│ Backend │────▶│ Database │ │ (Next.js) │ │ (FastAPI) │ │ (Postgres) │ └────────────┘ └────────────┘ └────────────┘ │ ▼ ┌────────────┐ ┌────────────┐ │ Vector DB │ │ Ray │ │ (Chroma) │ │ (Parallel) │ └────────────┘ └────────────┘

Key Features

  • Web Scraping: Extract structured data from various websites
  • Real Estate Analysis: Specialized data extraction for real estate listings
  • Parallel Processing: Ray integration for scalable workloads
  • Vector Database: Chroma integration for similarity search
  • Admin Interface: PGAdmin for database management

Getting Started

Prerequisites

  • Docker and Docker Compose
  • Python 3.9+ (for local development)
  • Node.js (for local frontend development)

Quick Start with Docker

  1. Clone the repository: bash git clone https://github.com/yourusername/conan-researcher.git cd conan-researcher

  2. Set up environment variables: ```bash cp backend/.env.dist backend/.env

    Edit backend/.env if needed

    ```

  3. Start the services: bash docker-compose up -d

  4. Access the services:

    • Frontend: http://localhost:3000
    • Backend API: http://localhost:8000/docs
    • PGAdmin: http://localhost:5050 (login with admin@admin.com / admin)

Local Development

Backend

  1. Navigate to the backend directory: bash cd backend

  2. Create and activate a virtual environment: bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

  3. Install dependencies: bash pip install -r requirements.txt

  4. Run the application: bash uvicorn app.main:app --reload

Frontend

  1. Navigate to the frontend directory: bash cd frontend

  2. Install dependencies: bash npm install

  3. Start the development server: bash npm run dev

API Endpoints

Scraping API

  • POST /api/scrape/: Start a new scraping job
    • Expects a ScraperRun object with URLs and parameters
    • Returns extracted data or error information

Configuration

Environment Variables

Key environment variables defined in .env.dist:

| Variable | Description | |----------|-------------| | POSTGRESUSER | Database username | | POSTGRESPASSWORD | Database password | | POSTGRESDB | Database name | | CHROMAHOSTADDR | Vector database host | | CHROMAHOSTPORT | Vector database port | | TWITTERTOKEN | API token for Twitter scraper |

Project Structure

conan-researcher/ ├── backend/ # FastAPI backend │ ├── app/ │ │ ├── api/ # API endpoints │ │ ├── agents/ # Extraction agents │ │ │ └── scrapegraph_agent/ │ │ │ └── prompts.py │ │ ├── core/ # Core functionality │ │ ├── crud/ # Database operations │ │ ├── models/ # Data models │ │ └── services/ # Business logic │ ├── migrations/ # Database migrations │ └── .env.dist # Environment variables template ├── frontend/ # Next.js frontend ├── data/ # Persistent data storage └── docker-compose.yaml # Docker services configuration

Development

Adding New Scrapers

  1. Create a new scraper agent in backend/app/agents/
  2. Implement extraction logic in a service class
  3. Register the new agent in the ScrapeService class
  4. Add any new environment variables to .env.dist

Testing

Run backend tests: bash cd backend pytest

Troubleshooting

Common Issues

  • Database connection errors: Verify PostgreSQL container is running and credentials are correct
  • Chroma connection issues: Check if the Chroma vector database is accessible
  • Scraper failures: Check logs for specific error messages

View logs for any service: bash docker-compose logs -f [service_name]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

  • Lead Developer: Alejandro Sánchez Yalí
  • [Contributors welcome!]

Owner

  • Name: Alejandro Sánchez Yalí
  • Login: asanchezyali
  • Kind: user
  • Company: Monadical

Mathematician with experience in Software Development, Data Science and Blockchain

GitHub Events

Total
  • Push event: 3
Last Year
  • Push event: 3

Dependencies

backend/Dockerfile docker
  • ${BASE_IMAGE} latest build
frontend/Dockerfile docker
  • debian latest build
frontend/package.json npm
  • @commitlint/cli ^16.3.0 development
  • @commitlint/config-conventional ^16.2.4 development
  • @svgr/webpack ^8.1.0 development
  • @tailwindcss/forms ^0.5.7 development
  • @tailwindcss/typography ^0.5.13 development
  • @testing-library/jest-dom ^5.17.0 development
  • @testing-library/react ^13.4.0 development
  • @types/react ^18.2.45 development
  • @types/testing-library__jest-dom ^5.14.9 development
  • @typescript-eslint/eslint-plugin ^5.62.0 development
  • @typescript-eslint/parser ^5.62.0 development
  • autoprefixer ^10.4.16 development
  • daisyui ^4.12.10 development
  • eslint ^8.56.0 development
  • eslint-config-next ^14.0.4 development
  • eslint-config-prettier ^8.10.0 development
  • eslint-plugin-simple-import-sort ^7.0.0 development
  • eslint-plugin-unused-imports ^2.0.0 development
  • jest ^27.5.1 development
  • lint-staged ^12.5.0 development
  • next-router-mock ^0.9.0 development
  • next-sitemap ^2.5.28 development
  • postcss ^8.4.32 development
  • prettier ^3.3.3 development
  • prettier-plugin-tailwindcss ^0.6.5 development
  • tailwind-merge ^2.5.2 development
  • tailwindcss ^3.3.6 development
  • typescript ^4.9.5 development
  • @hookform/resolvers ^3.3.4
  • @tabler/icons-react ^3.12.0
  • @types/react-datepicker ^7.0.0
  • axios ^1.6.8
  • clsx ^2.1.1
  • framer-motion ^11.3.29
  • lucide-react ^0.365.0
  • next ^14.0.4
  • npm ^10.8.2
  • react ^18.2.0
  • react-datepicker ^7.3.0
  • react-dom ^18.2.0
  • react-hook-form ^7.51.3
  • react-icons ^4.12.0
  • react-markdown ^9.0.1
  • sass ^1.77.8
  • zod ^3.22.4
frontend/yarn.lock npm
  • 1186 dependencies
backend/Pipfile pypi
  • alembic *
  • arxiv *
  • asyncpg *
  • black *
  • chromadb *
  • fastapi *
  • feedparser *
  • flake8 *
  • google-api-python-client *
  • google-auth-httplib2 *
  • google-auth-oauthlib *
  • langchain *
  • langchain-google-community *
  • llama-index *
  • llama-index-readers-smart-pdf-loader *
  • llama-index-readers-snscrape-twitter *
  • llama-index-readers-web *
  • llama-index-readers-wikipedia *
  • llama-index-readers-youtube-transcript *
  • llmsherpa *
  • loguru *
  • pydantic *
  • pydantic-settings *
  • pymupdf *
  • pytz *
  • ray *
  • sentence-transformers *
  • sqlmodel *
  • tweepy *
  • unstructured *
  • wikipedia 1.4.0
backend/Pipfile.lock pypi
  • 258 dependencies
backend/poetry.lock pypi
  • 267 dependencies
backend/pyproject.toml pypi
  • black * develop
  • flake8 * develop
  • alembic *
  • alembic-postgresql-enum ^1.3.0
  • arxiv *
  • asyncpg *
  • chromadb *
  • fastapi *
  • feedparser *
  • google-api-python-client *
  • google-auth-httplib2 *
  • google-auth-oauthlib *
  • langchain *
  • langchain-google-community *
  • llama-index *
  • llama-index-readers-smart-pdf-loader *
  • llama-index-readers-snscrape-twitter *
  • llama-index-readers-web *
  • llama-index-readers-wikipedia *
  • llama-index-readers-youtube-transcript *
  • llmsherpa *
  • loguru *
  • nest-asyncio ^1.6.0
  • pydantic *
  • pydantic-settings *
  • pymupdf *
  • python ^3.9
  • pytz *
  • ray *
  • sentence-transformers *
  • sqlmodel *
  • tweepy *
  • unstructured *
  • wikipedia 1.4.0
backend/requirements.txt pypi
  • black ==22.10.0
  • chromadb *
  • feedparser *
  • flake8 ==5.0.4
  • httpie *
  • ipython *
  • langchain *
  • loguru *
  • psycopg2-binary *
  • pytz *
  • ray *
  • sentence-transformers *
  • sqlalchemy *