Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: scholar.google
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: niyoseris
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Size: 25.3 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License Citation Security

README.md

LazyScholar - Academic Research Assistant

LazyScholar is an AI-powered research assistant that helps users conduct academic research by automating the process of literature review and academic paper writing.

Features

  • Topic Generation: Analyzes a problem statement to generate relevant topics and subtopics
  • Automated Research: Searches academic databases (like Google Scholar) for relevant papers
  • PDF Analysis: Downloads and extracts information from PDF files
  • Vision AI Integration: Uses Google's Gemini Flash Vision LLM to navigate search interfaces
  • Content Extraction: Extracts relevant information from research papers
  • Paper Generation: Compiles findings into a structured academic paper with proper citations
  • Organized Output: Saves research findings in a structured directory format
  • Specialized Templates: Includes pre-defined templates for various research domains
  • Dry Run Mode: Test initialization and analysis without performing full research

How It Works

  1. Problem Statement Analysis: User enters a research problem statement
  2. Topic Generation: LazyScholar uses Gemini Flash LLM to generate topics and subtopics
  3. Web Search: Searches academic databases for each topic
  4. PDF Processing: Downloads and analyzes up to 10 PDF files per topic
  5. Content Extraction: Extracts relevant information for each subtopic
  6. Paper Compilation: Combines all research into a final academic paper

Installation

  1. Clone this repository
  2. Install dependencies: pip install -r requirements.txt
  3. Create a .env file with your Google API key: GOOGLE_API_KEY=your_api_key_here

You can obtain a Google API key from Google AI Studio.

Usage

Run the LazyScholar application with a research problem statement:

bash python lazy_scholar.py "Your research problem statement here"

Options

  • --search-engine: Specify a different search engine URL (default: https://scholar.google.com)
  • --headless: Run browser in headless mode
  • --output-dir: Specify output directory (default: research_output)
  • --dry-run: Test initialization and analysis without performing full research
  • --timeout: Set browser operation timeout in seconds (default: 120)
  • --max-pdfs: Maximum number of PDFs to download per topic (default: 10)

Example:

bash python lazy_scholar.py "The impact of climate change on marine ecosystems" --output-dir climate_research --dry-run

Specialized Research Templates

LazyScholar includes specialized templates for various research domains:

  1. Music and Politics: For research on music's relationship with political views and movements
  2. AI and Education: For research on artificial intelligence in educational contexts
  3. Technology and Society: For research on digital transformation and social impacts
  4. Health and Medicine: For research on medical innovations and healthcare systems
  5. Environment and Sustainability: For research on climate change and conservation
  6. Business and Economics: For research on economic theories and business strategies
  7. Psychology and Human Behavior: For research on cognitive processes and mental health

These templates are automatically selected based on keywords in your research problem statement.

Output Structure

LazyScholar organizes research output in the following structure:

research_output/ ├── pdfs/ # Downloaded PDF files ├── Topic_1/ # Directory for Topic 1 │ ├── Subtopic_1_1.md # Research on Subtopic 1.1 │ ├── Subtopic_1_2.md # Research on Subtopic 1.2 │ └── ... ├── Topic_2/ # Directory for Topic 2 │ ├── Subtopic_2_1.md # Research on Subtopic 2.1 │ └── ... └── final_paper.md # Final compiled research paper

Requirements

  • Python 3.8+
  • Google API key for Gemini Flash 2.0
  • Internet connection
  • Chrome or Firefox browser

Limitations

  • Requires a valid Google API key with access to Gemini Flash 2.0
  • May trigger CAPTCHAs on academic search engines
  • PDF extraction quality depends on the PDF structure
  • Limited to 10 PDFs per topic to avoid excessive processing
  • Without a valid API key, only default templates will be used
  • Subject to API rate limits, which may cause pauses during research process

Handling API Rate Limits

LazyScholar uses Google's Gemini API, which has usage quotas that may be exceeded during intensive research sessions. When this happens:

  1. The application will automatically pause and wait for the quota to reset
  2. Wait times follow an exponential backoff pattern (2, 4, 8, 16, 32 seconds)
  3. Progress is preserved during these pauses
  4. For uninterrupted usage, consider:
    • Breaking large research projects into smaller sessions
    • Upgrading to a higher API quota tier if using the free version
    • Running the application during off-peak hours

Recent Updates

  1. Content Type Support: LazyScholar now supports different content types:

    • Academic research (default)
    • Practical how-to guides
    • Travel guides
  2. Search Purpose Parameter: Added a search_purpose parameter to specify the type of content to generate.

  3. Source Flexibility: Changed terminology from "PDFs" to "Sources" to make it clear that the application can work with various content types.

  4. Improved Search Phrases: Search phrases for subtopics are now more specific and always in English, including relevant keywords about the main topic.

  5. PDF Requirement Toggle: Added a require_pdfs parameter to specify whether PDFs are required for the search.

  6. Sequential PDF Naming: Implemented sequential numbering for downloaded PDF files (1.pdf, 2.pdf, etc.) instead of using hash-based filenames, making it easier to identify and manage downloaded files.

  7. Domain Filtering: Added site_tld parameter to filter search results based on domain patterns (e.g., 'edu', 'gov', 'org').

  8. Minimum Sources Requirement: Added minimum_pdfs parameter to ensure LazyScholar continues searching until it finds a minimum number of valuable sources for each subtopic.

  9. Crawl Depth Control: Added crawldepth and maxcrawl_pages parameters to control how deeply the application crawls websites for content.

  10. Real-time Progress Tracking: Implemented a progress tracking system in the Flask wrapper to provide real-time updates on the research process.

  11. UI Improvements: Updated the user interface to reflect the broader focus on various content types rather than just PDFs.

Flask Web Interface

LazyScholar now includes a Flask web application that provides a user-friendly interface to:

  • Configure research parameters
  • Track real-time progress of research
  • View and read generated content and downloaded sources
  • Save and load user research profiles

Using the Flask Interface

  1. Start the Flask Application: bash python app.py This will launch the web server, typically at http://127.0.0.1:5000/

  2. User Account:

    • Register a new account or log in with existing credentials
    • Accounts allow you to save and manage multiple research profiles
  3. Dashboard:

    • View all your saved research profiles
    • Access research results from previous projects
    • Create new research profiles
  4. Create/Edit Research Profile:

    • Name your research project
    • Enter your research problem statement
    • Configure search settings:
      • Content type (academic, practical, travel)
      • Search engine
      • Language preferences
      • Maximum and minimum sources per topic
      • Domain filtering (edu, gov, org, etc.)
      • Crawl depth settings
  5. Start Research:

    • Select a profile from your dashboard
    • Click "Start Research" to begin the automated research process
    • Monitor real-time progress of topic generation, searches, and content extraction
  6. View Results:

    • Browse all generated files organized by topic
    • Read downloaded sources
    • Access the final compiled research paper
  7. Save & Load Templates:

    • Save successful research configurations as templates
    • Quickly start new research using proven settings

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

  • Login: niyoseris
  • Kind: user

Citation (CITATION_PROCESSOR_README.md)

# Citation Processor and Research Workflow

This set of tools helps you process academic PDFs, extract citations and key points, and build structured research papers through a systematic workflow.

## Overview

The workflow consists of the following steps:

1. **Process PDF files**: Send PDFs to a language model to extract citation information and key points
2. **Create subtopic sketches**: Collect the extracted information into JSON sketch files
3. **Generate subtopic papers**: Send the sketch files to the language model to generate research papers for each subtopic
4. **Combine into topic papers**: Combine subtopic papers into comprehensive topic papers
5. **Create final paper**: Combine topic papers into a complete research paper

## Requirements

- Python 3.7+
- Google Gemini API key (set as `GOOGLE_API_KEY` in a `.env` file)
- Required Python packages:
  - google-generativeai
  - PyPDF2
  - pdfplumber
  - python-dotenv
  - requests

## Installation

1. Clone this repository or download the scripts
2. Install required packages:
   ```
   pip install google-generativeai PyPDF2 pdfplumber python-dotenv requests
   ```
3. Create a `.env` file in the same directory as the scripts with your Google API key:
   ```
   GOOGLE_API_KEY=your_api_key_here
   ```

## Usage

### Using the Research Workflow Script

The `research_workflow.py` script provides a user-friendly interface to execute the research workflow. It has three main modes of operation:

#### 1. Subtopic Mode

Process PDFs for a single subtopic:

```bash
python research_workflow.py --mode subtopic --subtopic "Your Subtopic Name" --pdf_dir path/to/pdfs --output_dir research_output
```

#### 2. Topic Mode

Process multiple subtopics for a topic:

```bash
python research_workflow.py --mode topic --topic "Your Topic Name" --subtopics "Subtopic 1" "Subtopic 2" "Subtopic 3" --pdfs_base_dir path/to/pdfs_base_dir --output_dir research_output
```

For this mode, the PDFs should be organized in subdirectories named after each subtopic under the `pdfs_base_dir`.

#### 3. Paper Mode

Combine multiple topic papers into a final research paper:

```bash
python research_workflow.py --mode paper --title "Your Research Paper Title" --topics "Topic 1" "Topic 2" "Topic 3" --output_dir research_output
```

### Using the Citation Processor Directly

You can also use the `citation_processor.py` script directly for more fine-grained control:

```bash
python citation_processor.py --pdf_dir path/to/pdfs --subtopic "Your Subtopic Name" --output_dir research_output
```

Or to combine subtopic papers into a topic paper:

```bash
python citation_processor.py --output_dir research_output --topic "Your Topic Name"
```

Or to combine topic papers into a final paper:

```bash
python citation_processor.py --output_dir research_output --title "Your Research Paper Title"
```

## Directory Structure

For best results, organize your PDFs in the following structure:

```
research_project/
├── pdfs/
│   ├── Topic 1/
│   │   ├── Subtopic 1/
│   │   │   ├── paper1.pdf
│   │   │   ├── paper2.pdf
│   │   │   └── ...
│   │   ├── Subtopic 2/
│   │   │   ├── paper1.pdf
│   │   │   ├── paper2.pdf
│   │   │   └── ...
│   │   └── ...
│   ├── Topic 2/
│   │   ├── Subtopic 1/
│   │   │   ├── paper1.pdf
│   │   │   ├── paper2.pdf
│   │   │   └── ...
│   │   └── ...
│   └── ...
└── research_output/
    └── (output files will be saved here)
```

## Output Files

The scripts generate the following types of output files:

- **Subtopic sketch files**: JSON files containing extracted information from PDFs (`subtopic_name_sketch.json`)
- **Subtopic paper files**: Markdown files containing generated research papers for each subtopic (`subtopic_name_paper.md`)
- **Topic paper files**: Markdown files containing combined research papers for each topic (`topic_name_paper.md`)
- **Final paper file**: Markdown file containing the complete research paper (`paper_title_final_paper.md`)

## Example Workflow

Here's an example of a complete workflow for a research project on "Climate Change":

1. Organize PDFs in the appropriate directory structure
2. Process each subtopic:
   ```bash
   python research_workflow.py --mode subtopic --subtopic "Sea Level Rise" --pdf_dir pdfs/Climate_Change/Sea_Level_Rise --output_dir research_output
   python research_workflow.py --mode subtopic --subtopic "Extreme Weather" --pdf_dir pdfs/Climate_Change/Extreme_Weather --output_dir research_output
   python research_workflow.py --mode subtopic --subtopic "Carbon Emissions" --pdf_dir pdfs/Climate_Change/Carbon_Emissions --output_dir research_output
   ```
3. Combine subtopics into a topic paper:
   ```bash
   python research_workflow.py --mode topic --topic "Climate Change" --subtopics "Sea Level Rise" "Extreme Weather" "Carbon Emissions" --pdfs_base_dir pdfs/Climate_Change --output_dir research_output
   ```
4. If you have multiple topics, combine them into a final paper:
   ```bash
   python research_workflow.py --mode paper --title "Environmental Challenges in the 21st Century" --topics "Climate Change" "Biodiversity Loss" "Pollution" --output_dir research_output
   ```

## Troubleshooting

- **PDF extraction issues**: If text extraction from PDFs fails, try converting the PDFs to text using other tools before processing
- **API rate limits**: If you encounter API rate limits, the scripts include retry logic with exponential backoff
- **Memory issues**: For very large PDFs, the scripts limit the amount of text sent to the language model to avoid token limits

## License

This project is licensed under the MIT License - see the LICENSE file for details. 

GitHub Events

Total
  • Push event: 13
  • Create event: 4
Last Year
  • Push event: 13
  • Create event: 4

Dependencies

requirements.txt pypi
  • Pillow >=10.1.0
  • PyPDF2 >=3.0.0
  • beautifulsoup4 >=4.9.0
  • bs4 >=0.0.1
  • fake-useragent >=0.1.11
  • geckodriver-autoinstaller >=0.1.0
  • google-generativeai >=0.3.1
  • lxml >=4.9.0
  • pdfplumber >=0.7.0
  • pytest >=7.0.0
  • pytest-cov >=3.0.0
  • python-dotenv >=1.0.0
  • requests >=2.25.0
  • selenium >=4.0.0
  • tqdm >=4.60.0
  • webdriver-manager >=3.5.0
setup.py pypi
  • PyPDF2 >=2.0.0
  • beautifulsoup4 >=4.9.0
  • fake-useragent >=0.1.11
  • pdfplumber >=0.7.0
  • requests >=2.25.0
  • selenium >=4.0.0
  • tqdm >=4.60.0
  • webdriver-manager >=3.5.0