https://github.com/asanchezyali/conan-researcher
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: asanchezyali
- License: gpl-3.0
- Language: TypeScript
- Default Branch: main
- Size: 1.19 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Conan Researcher
A sophisticated data extraction and analysis system with web scraping capabilities, built with FastAPI backend and a modern frontend.
Architecture Overview
Conan Researcher is designed as a microservices architecture with containerized components:
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Frontend │────▶│ Backend │────▶│ Database │
│ (Next.js) │ │ (FastAPI) │ │ (Postgres) │
└────────────┘ └────────────┘ └────────────┘
│
▼
┌────────────┐ ┌────────────┐
│ Vector DB │ │ Ray │
│ (Chroma) │ │ (Parallel) │
└────────────┘ └────────────┘
Key Features
- Web Scraping: Extract structured data from various websites
- Real Estate Analysis: Specialized data extraction for real estate listings
- Parallel Processing: Ray integration for scalable workloads
- Vector Database: Chroma integration for similarity search
- Admin Interface: PGAdmin for database management
Getting Started
Prerequisites
- Docker and Docker Compose
- Python 3.9+ (for local development)
- Node.js (for local frontend development)
Quick Start with Docker
Clone the repository:
bash git clone https://github.com/yourusername/conan-researcher.git cd conan-researcherSet up environment variables: ```bash cp backend/.env.dist backend/.env
Edit backend/.env if needed
```
Start the services:
bash docker-compose up -dAccess the services:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000/docs
- PGAdmin: http://localhost:5050 (login with admin@admin.com / admin)
Local Development
Backend
Navigate to the backend directory:
bash cd backendCreate and activate a virtual environment:
bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies:
bash pip install -r requirements.txtRun the application:
bash uvicorn app.main:app --reload
Frontend
Navigate to the frontend directory:
bash cd frontendInstall dependencies:
bash npm installStart the development server:
bash npm run dev
API Endpoints
Scraping API
POST /api/scrape/: Start a new scraping job- Expects a
ScraperRunobject with URLs and parameters - Returns extracted data or error information
- Expects a
Configuration
Environment Variables
Key environment variables defined in .env.dist:
| Variable | Description | |----------|-------------| | POSTGRESUSER | Database username | | POSTGRESPASSWORD | Database password | | POSTGRESDB | Database name | | CHROMAHOSTADDR | Vector database host | | CHROMAHOSTPORT | Vector database port | | TWITTERTOKEN | API token for Twitter scraper |
Project Structure
conan-researcher/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/ # API endpoints
│ │ ├── agents/ # Extraction agents
│ │ │ └── scrapegraph_agent/
│ │ │ └── prompts.py
│ │ ├── core/ # Core functionality
│ │ ├── crud/ # Database operations
│ │ ├── models/ # Data models
│ │ └── services/ # Business logic
│ ├── migrations/ # Database migrations
│ └── .env.dist # Environment variables template
├── frontend/ # Next.js frontend
├── data/ # Persistent data storage
└── docker-compose.yaml # Docker services configuration
Development
Adding New Scrapers
- Create a new scraper agent in
backend/app/agents/ - Implement extraction logic in a service class
- Register the new agent in the ScrapeService class
- Add any new environment variables to
.env.dist
Testing
Run backend tests:
bash
cd backend
pytest
Troubleshooting
Common Issues
- Database connection errors: Verify PostgreSQL container is running and credentials are correct
- Chroma connection issues: Check if the Chroma vector database is accessible
- Scraper failures: Check logs for specific error messages
View logs for any service:
bash
docker-compose logs -f [service_name]
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributors
- Lead Developer: Alejandro Sánchez Yalí
- [Contributors welcome!]
Owner
- Name: Alejandro Sánchez Yalí
- Login: asanchezyali
- Kind: user
- Company: Monadical
- Website: www.asanchezyali.com
- Twitter: asanchezyali
- Repositories: 16
- Profile: https://github.com/asanchezyali
Mathematician with experience in Software Development, Data Science and Blockchain
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3
Dependencies
- ${BASE_IMAGE} latest build
- debian latest build
- @commitlint/cli ^16.3.0 development
- @commitlint/config-conventional ^16.2.4 development
- @svgr/webpack ^8.1.0 development
- @tailwindcss/forms ^0.5.7 development
- @tailwindcss/typography ^0.5.13 development
- @testing-library/jest-dom ^5.17.0 development
- @testing-library/react ^13.4.0 development
- @types/react ^18.2.45 development
- @types/testing-library__jest-dom ^5.14.9 development
- @typescript-eslint/eslint-plugin ^5.62.0 development
- @typescript-eslint/parser ^5.62.0 development
- autoprefixer ^10.4.16 development
- daisyui ^4.12.10 development
- eslint ^8.56.0 development
- eslint-config-next ^14.0.4 development
- eslint-config-prettier ^8.10.0 development
- eslint-plugin-simple-import-sort ^7.0.0 development
- eslint-plugin-unused-imports ^2.0.0 development
- jest ^27.5.1 development
- lint-staged ^12.5.0 development
- next-router-mock ^0.9.0 development
- next-sitemap ^2.5.28 development
- postcss ^8.4.32 development
- prettier ^3.3.3 development
- prettier-plugin-tailwindcss ^0.6.5 development
- tailwind-merge ^2.5.2 development
- tailwindcss ^3.3.6 development
- typescript ^4.9.5 development
- @hookform/resolvers ^3.3.4
- @tabler/icons-react ^3.12.0
- @types/react-datepicker ^7.0.0
- axios ^1.6.8
- clsx ^2.1.1
- framer-motion ^11.3.29
- lucide-react ^0.365.0
- next ^14.0.4
- npm ^10.8.2
- react ^18.2.0
- react-datepicker ^7.3.0
- react-dom ^18.2.0
- react-hook-form ^7.51.3
- react-icons ^4.12.0
- react-markdown ^9.0.1
- sass ^1.77.8
- zod ^3.22.4
- 1186 dependencies
- alembic *
- arxiv *
- asyncpg *
- black *
- chromadb *
- fastapi *
- feedparser *
- flake8 *
- google-api-python-client *
- google-auth-httplib2 *
- google-auth-oauthlib *
- langchain *
- langchain-google-community *
- llama-index *
- llama-index-readers-smart-pdf-loader *
- llama-index-readers-snscrape-twitter *
- llama-index-readers-web *
- llama-index-readers-wikipedia *
- llama-index-readers-youtube-transcript *
- llmsherpa *
- loguru *
- pydantic *
- pydantic-settings *
- pymupdf *
- pytz *
- ray *
- sentence-transformers *
- sqlmodel *
- tweepy *
- unstructured *
- wikipedia 1.4.0
- 258 dependencies
- 267 dependencies
- black * develop
- flake8 * develop
- alembic *
- alembic-postgresql-enum ^1.3.0
- arxiv *
- asyncpg *
- chromadb *
- fastapi *
- feedparser *
- google-api-python-client *
- google-auth-httplib2 *
- google-auth-oauthlib *
- langchain *
- langchain-google-community *
- llama-index *
- llama-index-readers-smart-pdf-loader *
- llama-index-readers-snscrape-twitter *
- llama-index-readers-web *
- llama-index-readers-wikipedia *
- llama-index-readers-youtube-transcript *
- llmsherpa *
- loguru *
- nest-asyncio ^1.6.0
- pydantic *
- pydantic-settings *
- pymupdf *
- python ^3.9
- pytz *
- ray *
- sentence-transformers *
- sqlmodel *
- tweepy *
- unstructured *
- wikipedia 1.4.0
- black ==22.10.0
- chromadb *
- feedparser *
- flake8 ==5.0.4
- httpie *
- ipython *
- langchain *
- loguru *
- psycopg2-binary *
- pytz *
- ray *
- sentence-transformers *
- sqlalchemy *