https://github.com/asanchezyali/conan-researcher

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: asanchezyali
License: gpl-3.0
Language: TypeScript
Default Branch: main
Size: 1.19 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Conan Researcher

Screenshot 2025-04-01 at 12 52 23 AM

Screenshot 2025-04-01 at 12 51 58 AM

A sophisticated data extraction and analysis system with web scraping capabilities, built with FastAPI backend and a modern frontend.

Architecture Overview

Conan Researcher is designed as a microservices architecture with containerized components:

┌────────────┐ ┌────────────┐ ┌────────────┐ │ Frontend │────▶│ Backend │────▶│ Database │ │ (Next.js) │ │ (FastAPI) │ │ (Postgres) │ └────────────┘ └────────────┘ └────────────┘ │ ▼ ┌────────────┐ ┌────────────┐ │ Vector DB │ │ Ray │ │ (Chroma) │ │ (Parallel) │ └────────────┘ └────────────┘

Key Features

Web Scraping: Extract structured data from various websites
Real Estate Analysis: Specialized data extraction for real estate listings
Parallel Processing: Ray integration for scalable workloads
Vector Database: Chroma integration for similarity search
Admin Interface: PGAdmin for database management

Getting Started

Prerequisites

Docker and Docker Compose
Python 3.9+ (for local development)
Node.js (for local frontend development)

Quick Start with Docker

Clone the repository: bash git clone https://github.com/yourusername/conan-researcher.git cd conan-researcher
Set up environment variables: ```bash cp backend/.env.dist backend/.env

Edit backend/.env if needed

```
Start the services: bash docker-compose up -d
Access the services:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000/docs
- PGAdmin: http://localhost:5050 (login with admin@admin.com / admin)

Local Development

Backend

Navigate to the backend directory: bash cd backend
Create and activate a virtual environment: bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies: bash pip install -r requirements.txt
Run the application: bash uvicorn app.main:app --reload

Frontend

Navigate to the frontend directory: bash cd frontend
Install dependencies: bash npm install
Start the development server: bash npm run dev

API Endpoints

Scraping API

POST /api/scrape/: Start a new scraping job
- Expects a ScraperRun object with URLs and parameters
- Returns extracted data or error information

Configuration

Environment Variables

Key environment variables defined in .env.dist:

| Variable | Description | |----------|-------------| | POSTGRESUSER | Database username | | POSTGRESPASSWORD | Database password | | POSTGRESDB | Database name | | CHROMAHOSTADDR | Vector database host | | CHROMAHOSTPORT | Vector database port | | TWITTERTOKEN | API token for Twitter scraper |

Project Structure

conan-researcher/ ├── backend/ # FastAPI backend │ ├── app/ │ │ ├── api/ # API endpoints │ │ ├── agents/ # Extraction agents │ │ │ └── scrapegraph_agent/ │ │ │ └── prompts.py │ │ ├── core/ # Core functionality │ │ ├── crud/ # Database operations │ │ ├── models/ # Data models │ │ └── services/ # Business logic │ ├── migrations/ # Database migrations │ └── .env.dist # Environment variables template ├── frontend/ # Next.js frontend ├── data/ # Persistent data storage └── docker-compose.yaml # Docker services configuration

Development

Adding New Scrapers

Create a new scraper agent in backend/app/agents/
Implement extraction logic in a service class
Register the new agent in the ScrapeService class
Add any new environment variables to .env.dist

Testing

Run backend tests: bash cd backend pytest

Troubleshooting

Common Issues

Database connection errors: Verify PostgreSQL container is running and credentials are correct
Chroma connection issues: Check if the Chroma vector database is accessible
Scraper failures: Check logs for specific error messages

View logs for any service: bash docker-compose logs -f [service_name]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

Lead Developer: Alejandro Sánchez Yalí
[Contributors welcome!]

Owner

Name: Alejandro Sánchez Yalí
Login: asanchezyali
Kind: user
Company: Monadical

Website: www.asanchezyali.com
Twitter: asanchezyali
Repositories: 16
Profile: https://github.com/asanchezyali

Mathematician with experience in Software Development, Data Science and Blockchain

GitHub Events

Total

Push event: 3

Last Year

Push event: 3

Dependencies

backend/Dockerfile docker

${BASE_IMAGE} latest build

frontend/Dockerfile docker

debian latest build

frontend/package.json npm

@commitlint/cli ^16.3.0 development
@commitlint/config-conventional ^16.2.4 development
@svgr/webpack ^8.1.0 development
@tailwindcss/forms ^0.5.7 development
@tailwindcss/typography ^0.5.13 development
@testing-library/jest-dom ^5.17.0 development
@testing-library/react ^13.4.0 development
@types/react ^18.2.45 development
@types/testing-library__jest-dom ^5.14.9 development
@typescript-eslint/eslint-plugin ^5.62.0 development
@typescript-eslint/parser ^5.62.0 development
autoprefixer ^10.4.16 development
daisyui ^4.12.10 development
eslint ^8.56.0 development
eslint-config-next ^14.0.4 development
eslint-config-prettier ^8.10.0 development
eslint-plugin-simple-import-sort ^7.0.0 development
eslint-plugin-unused-imports ^2.0.0 development
jest ^27.5.1 development
lint-staged ^12.5.0 development
next-router-mock ^0.9.0 development
next-sitemap ^2.5.28 development
postcss ^8.4.32 development
prettier ^3.3.3 development
prettier-plugin-tailwindcss ^0.6.5 development
tailwind-merge ^2.5.2 development
tailwindcss ^3.3.6 development
typescript ^4.9.5 development
@hookform/resolvers ^3.3.4
@tabler/icons-react ^3.12.0
@types/react-datepicker ^7.0.0
axios ^1.6.8
clsx ^2.1.1
framer-motion ^11.3.29
lucide-react ^0.365.0
next ^14.0.4
npm ^10.8.2
react ^18.2.0
react-datepicker ^7.3.0
react-dom ^18.2.0
react-hook-form ^7.51.3
react-icons ^4.12.0
react-markdown ^9.0.1
sass ^1.77.8
zod ^3.22.4

frontend/yarn.lock npm

1186 dependencies

backend/Pipfile pypi

alembic *
arxiv *
asyncpg *
black *
chromadb *
fastapi *
feedparser *
flake8 *
google-api-python-client *
google-auth-httplib2 *
google-auth-oauthlib *
langchain *
langchain-google-community *
llama-index *
llama-index-readers-smart-pdf-loader *
llama-index-readers-snscrape-twitter *
llama-index-readers-web *
llama-index-readers-wikipedia *
llama-index-readers-youtube-transcript *
llmsherpa *
loguru *
pydantic *
pydantic-settings *
pymupdf *
pytz *
ray *
sentence-transformers *
sqlmodel *
tweepy *
unstructured *
wikipedia 1.4.0

backend/Pipfile.lock pypi

258 dependencies

backend/poetry.lock pypi

267 dependencies

backend/pyproject.toml pypi

black * develop
flake8 * develop
alembic *
alembic-postgresql-enum ^1.3.0
arxiv *
asyncpg *
chromadb *
fastapi *
feedparser *
google-api-python-client *
google-auth-httplib2 *
google-auth-oauthlib *
langchain *
langchain-google-community *
llama-index *
llama-index-readers-smart-pdf-loader *
llama-index-readers-snscrape-twitter *
llama-index-readers-web *
llama-index-readers-wikipedia *
llama-index-readers-youtube-transcript *
llmsherpa *
loguru *
nest-asyncio ^1.6.0
pydantic *
pydantic-settings *
pymupdf *
python ^3.9
pytz *
ray *
sentence-transformers *
sqlmodel *
tweepy *
unstructured *
wikipedia 1.4.0

backend/requirements.txt pypi

black ==22.10.0
chromadb *
feedparser *
flake8 ==5.0.4
httpie *
ipython *
langchain *
loguru *
psycopg2-binary *
pytz *
ray *
sentence-transformers *
sqlalchemy *