ai_enhanced_pdf_scholar

https://github.com/jackela/ai_enhanced_pdf_scholar

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: Jackela
Language: Python
Default Branch: main
Size: 3.11 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 10 months ago

Metadata Files

Readme Contributing Citation Security

🎓 AI Enhanced PDF Scholar

Stop Reading, Start Understanding.

AI Enhanced PDF Scholar is an intelligent platform that transforms your academic research workflow. Instead of drowning in a sea of PDFs, you can now have a conversation with your documents, uncover hidden connections, and focus on what truly matters: generating new insights.

The Problem: Information Overload is Slowing Down Research

University students, academics, and corporate researchers all face the same challenge: an ever-growing mountain of research papers. The traditional workflow is broken and inefficient: - Fragmented Tools: Juggling PDF readers, note-takers, citation managers, and separate AI tools. - Manual Searching: Endlessly scrolling and using Ctrl+F to find specific pieces of information. - Lost Context: Losing track of where information came from, making citations a nightmare.

This manual, time-consuming process is a major bottleneck, leaving less time for critical thinking and analysis.

The Solution: Your Unified, Intelligent Research Hub

AI Enhanced PDF Scholar brings the power of cutting-edge AI directly to your research library. We provide a single, secure platform to:

Centralize Your Knowledge: Upload all your documents into one clean, searchable library.
Ask, Don't Search: Interact with your papers using natural language. Get direct answers, summaries, and insights in seconds.
Discover Connections: Automatically analyze citation networks to understand how research evolves.

Our mission is to help you move from tedious searching to accelerated understanding.

Designed For...

University & College Students: Perfect for writing literature reviews, essays, and dissertations.
Academic Researchers & Professors: Ideal for staying current, preparing lectures, and guiding student research.
Corporate R&D Professionals: A powerful tool for market research, competitive analysis, and internal knowledge management.

Key Features & Benefits

| Feature | Your Benefit | | :--- | :--- | | 💬 Chat with Your Documents | Instantly get answers and summaries from your PDFs. Stop skimming and start learning. | | 🔗 Untangle Research Connections | Automatically extract citations and visualize the network to see how ideas connect and identify key papers. | | 🔒 Secure & Private by Design | Your research is yours alone. All documents are stored securely and are never used to train public models. | | ⚡️ Quick & Easy Setup | Get started in minutes. A clean, intuitive interface means you spend your time on research, not on learning a new tool. |

Product Demo

See AI Enhanced PDF Scholar in action!

A picture is worth a thousand words. Here we would include high-quality screenshots or GIFs showcasing the core user flow.

Caption: Your entire research library, organized and ready for analysis.

Caption: Ask a question and get a direct answer, complete with citations from your documents.

How to Get Started

Ready to revolutionize your research process?

🚀 View Live Demo | 📖 Read the Docs | 🛠️ Developer Quick Start

Future Roadmap

We are constantly improving the platform. See what's coming next in our public Product Roadmap (ROADMAP.md).

Technical Stack

For those interested, AI Enhanced PDF Scholar is built with a modern, robust technology stack: - Backend: Python, FastAPI, LlamaIndex - Frontend: React, TypeScript, Vite, TailwindCSS - Database: PostgreSQL / SQLite - DevOps: Docker, GitHub Actions

🛠️ Developer Quick Start

While this platform is designed for end-users, it's also a full-featured open-source project. Developers can get started by following these steps.

Prerequisites

Python 3.11+, Node.js 18+, Git
Google Gemini API Key

1. Installation & Setup

```bash

Clone, install dependencies

git clone https://github.com/Jackela/aienhancedpdfscholar.git cd aienhancedpdfscholar pip install -r requirements.txt cd frontend && npm install && cd ..

Configure API Key

export GOOGLEAPIKEY="yourgeminiapikeyhere" ```

2. Launch The App

```bash

Run backend (Terminal 1)

uvicorn web_main:app --reload --port 8000

Run frontend (Terminal 2)

cd frontend && npm run dev ```

Access the app at http://localhost:5173 and the API docs at http://localhost:8000/docs.

This project was created to showcase product management and software engineering skills.

Owner

Login: Jackela
Kind: user

Repositories: 1
Profile: https://github.com/Jackela

Citation (CITATION_LIBRARIES_GUIDE.md)

# 引用解析第三方库集成指南

## 概述

AI Enhanced PDF Scholar 支持集成多个第三方库来显著提升引用解析的准确性。我们的系统采用**渐进式增强**策略：即使没有安装第三方库，基础功能仍然可用，但安装后可获得更高的解析精度。

## 🚀 快速开始

### 基础安装（仅内置解析器）
```bash
# 基础功能，使用正则表达式解析
pip install -r requirements.txt
```

### 增强安装（推荐）
```bash
# 安装核心增强库，显著提升精度
pip install -r requirements-citation.txt
```

## 📚 支持的第三方库

### 1. **refextract** (CERN开发) - 🌟 强烈推荐

**优势**：
- 由CERN（欧洲核子研究中心）开发，专门用于学术引用提取
- 在高能物理和学术文献领域经过大量验证
- 支持多种期刊格式和引用样式
- 高精度识别（通常>85%置信度）

**安装**：
```bash
pip install refextract
```

**功能增强**：
- 智能作者名称解析
- 期刊标题标准化
- DOI自动识别
- 年份和页码精确提取

### 2. **AnyStyle.io** (API集成)

**优势**：
- 基于机器学习的现代解析引擎
- 支持多种引用格式（APA, MLA, Chicago等）
- 高准确率的结构化数据输出

**配置**（可选）：
```python
# 在config.py中添加API配置
ANYSTYLE_API_URL = "https://anystyle.io/api"
ANYSTYLE_API_KEY = "your-api-key"  # 如果需要
```

### 3. **PDF处理增强库**

**pdfplumber** - 更好的PDF文本提取：
```bash
pip install pdfplumber
```

**PyPDF2** - 经典PDF处理：
```bash
pip install PyPDF2>=3.0.1
```

### 4. **文本处理增强库**

**字符串相似度计算**：
```bash
pip install jellyfish python-Levenshtein
```

**Unicode标准化**：
```bash
pip install unidecode
```

## 🔧 使用方法

### 基础使用
```python
from src.services.citation_parsing_service import CitationParsingService

service = CitationParsingService()

# 自动使用所有可用的第三方库
citations = service.parse_citations_from_text(text_content)

# 仅使用内置解析器
citations = service.parse_citations_from_text(text_content, use_third_party=False)
```

### 高级配置
```python
# 检查第三方库可用性
from src.services.citation_parsing_service import REFEXTRACT_AVAILABLE, REQUESTS_AVAILABLE

if REFEXTRACT_AVAILABLE:
    print("refextract库可用，将获得更高解析精度")
else:
    print("使用内置解析器，考虑安装refextract以提升精度")
```

## 📊 性能对比

| 解析方法 | 精度 | 速度 | 依赖 |
|---------|------|------|------|
| 内置正则表达式 | ~35% | 快 | 无 |
| + refextract | ~60-85% | 中等 | refextract |
| + AnyStyle API | ~75-90% | 较慢 | 网络连接 |
| 混合模式（推荐） | ~70-85% | 中等 | refextract |

## 🛠️ 故障排除

### refextract安装问题

**Windows用户**：
```bash
# 如果遇到编译错误，先安装Visual Studio Build Tools
pip install --upgrade pip setuptools wheel
pip install refextract
```

**Linux用户**：
```bash
# 安装必要的系统依赖
sudo apt-get install python3-dev libxml2-dev libxslt1-dev
pip install refextract
```

**macOS用户**：
```bash
# 使用Homebrew安装依赖
brew install libxml2 libxslt
pip install refextract
```

### 常见错误及解决方案

**错误1**: `ImportError: No module named 'refextract'`
```bash
# 解决方案：安装refextract
pip install refextract
```

**错误2**: 解析结果为空
```python
# 检查文本格式，确保包含标准学术引用
sample_text = """
Smith, J. (2023). Title of Paper. Journal Name, 15(3), 123-145.
"""
```

**错误3**: 第三方库版本冲突
```bash
# 升级到最新版本
pip install --upgrade refextract requests
```

## 🎯 最佳实践

### 1. **渐进式部署**
```python
# 在生产环境中的推荐模式
def parse_citations_robust(text_content):
    service = CitationParsingService()

    try:
        # 优先使用增强模式
        return service.parse_citations_from_text(text_content, use_third_party=True)
    except Exception as e:
        logger.warning(f"Enhanced parsing failed: {e}")
        # 回退到基础模式
        return service.parse_citations_from_text(text_content, use_third_party=False)
```

### 2. **性能优化**
```python
# 对于大批量处理，考虑批量模式
def batch_parse_citations(document_texts):
    service = CitationParsingService()
    results = []

    for text in document_texts:
        citations = service.parse_citations_from_text(text)
        results.append(citations)

    return results
```

### 3. **质量验证**
```python
# 验证解析质量
def validate_parsing_quality(citations):
    high_confidence = [c for c in citations if c['confidence_score'] >= 0.7]

    print(f"总引用数: {len(citations)}")
    print(f"高置信度引用: {len(high_confidence)}")
    print(f"质量率: {len(high_confidence)/len(citations)*100:.1f}%")
```

## 🔮 未来计划

### 即将支持的库

1. **Grobid** - PDF全文分析和引用提取
2. **spaCy + scispaCy** - 基于NLP的科学文献处理
3. **BERT-based模型** - 深度学习引用解析
4. **Crossref API** - 引用元数据验证和补全

### 自定义解析器接口
```python
# 将来支持自定义解析器插件
class CustomCitationParser:
    def parse(self, text: str) -> list[dict]:
        # 自定义解析逻辑
        pass

# 注册自定义解析器
service.register_parser(CustomCitationParser())
```

## 📈 贡献指南

如果您发现了新的有用引用解析库，欢迎提交PR：

1. 在`citation_parsing_service.py`中添加集成代码
2. 在`requirements-citation.txt`中添加依赖
3. 编写相应的测试用例
4. 更新本文档

## 💡 技术细节

### 集成架构

```python
# 核心集成模式
def parse_citations_from_text(self, text_content: str, use_third_party: bool = True):
    citations = []

    # 1. 第三方库解析（如果可用）
    if use_third_party:
        citations.extend(self._parse_with_refextract(text_content))
        citations.extend(self._parse_with_anystyle_api(text_content))

    # 2. 内置解析器（总是运行）
    fallback_citations = self._parse_with_regex(text_content)

    # 3. 智能去重合并
    return self._merge_and_deduplicate_citations(citations, fallback_citations)
```

### 去重算法

系统使用多层去重策略：
1. **文本相似度**: 基于字符级相似度计算
2. **语义相似度**: 基于结构化字段比较
3. **置信度优先**: 保留高置信度结果

这确保了最佳质量的解析结果，同时避免重复。

---

**最后更新**: 2025-01-20
**版本**: 1.0.0
**兼容性**: AI Enhanced PDF Scholar v2.1.0+

GitHub Events

Total

Delete event: 2
Push event: 74
Pull request event: 3
Create event: 3

Last Year

Delete event: 2
Push event: 74
Pull request event: 3
Create event: 3

Dependencies

requirements.txt pypi

PyMuPDF *
PyQt6 *
llama-index ==0.12.45
llama-index-embeddings-google-genai *
llama-index-llms-google-genai *
llama-index-readers-file *
markdown *
pytest *
pytest-mock *
pytest-qt *
python-dotenv *

.github/actions/parallel-runner/action.yml actions

.github/actions/smart-trigger/action.yml actions

actions/cache v4 composite

.github/actions/turbo-cache/action.yml actions

actions/cache v4 composite

.github/workflows/build-optimized.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite

.github/workflows/deployment-advanced.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/e2e-advanced.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/integration-validation.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/main-pipeline.yml actions

actions/checkout v4 composite
dorny/paths-filter v3 composite

.github/workflows/performance-advanced.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/quality-enhanced.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/quality-lightning-simple.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/security-advanced.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/test-optimized.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/test-simple.yml actions

actions/checkout v4 composite

Dockerfile docker

development latest build
node 18-alpine build
production latest build
python 3.11-slim build

docker-compose.yml docker

nginx alpine
prom/prometheus latest

frontend/package-lock.json npm

1087 dependencies

frontend/package.json npm

@size-limit/preset-app ^11.0.0 development
@tailwindcss/aspect-ratio ^0.4.2 development
@tailwindcss/forms ^0.5.9 development
@tailwindcss/typography ^0.5.15 development
@testing-library/jest-dom ^6.6.3 development
@testing-library/react ^16.3.0 development
@testing-library/user-event ^14.6.1 development
@types/react ^18.2.37 development
@types/react-dom ^18.2.15 development
@typescript-eslint/eslint-plugin ^6.10.0 development
@typescript-eslint/parser ^6.10.0 development
@vitejs/plugin-react ^4.1.1 development
@vitest/coverage-v8 ^3.2.4 development
@vitest/ui ^3.2.4 development
autoprefixer ^10.4.16 development
cross-env ^7.0.3 development
cssnano ^6.0.1 development
eslint ^8.57.0 development
eslint-plugin-react-hooks ^4.6.0 development
eslint-plugin-react-refresh ^0.4.4 development
jsdom ^26.1.0 development
lint-staged ^15.2.0 development
postcss ^8.4.31 development
prettier ^3.1.0 development
rimraf ^5.0.5 development
rollup-plugin-visualizer ^5.12.0 development
size-limit ^11.0.1 development
tailwindcss ^3.3.5 development
typescript ^5.2.2 development
vite ^6.0.7 development
vite-bundle-analyzer ^0.9.4 development
vite-plugin-pwa ^0.21.1 development
vitest ^3.2.4 development
@radix-ui/react-dropdown-menu ^2.1.15
@radix-ui/react-toast ^1.2.14
@tanstack/react-query ^5.8.4
@tanstack/react-query-devtools ^5.83.0
@types/node ^24.0.13
axios ^1.6.2
class-variance-authority ^0.7.1
clsx ^2.0.0
date-fns ^2.30.0
framer-motion ^10.16.5
lucide-react ^0.294.0
pdfjs-dist ^4.7.76
react ^18.2.0
react-dom ^18.2.0
react-dropzone ^14.2.3
react-hot-toast ^2.4.1
react-markdown ^9.0.1
react-pdf ^9.1.1
react-router-dom ^6.20.1
react-syntax-highlighter ^15.6.1
remark-gfm ^4.0.0
tailwind-merge ^2.0.0
zustand ^4.4.7

pyproject.toml pypi

PyMuPDF >=1.26.0,<1.30.0
cachetools >=6.1.0
fastapi >=0.116.0,<0.120.0
google-generativeai >=0.8.5
llama-index-core >=0.12.49,<0.13.0
llama-index-embeddings-google-genai >=0.2.1,<0.3.0
llama-index-llms-google-genai >=0.2.4,<0.3.0
llama-index-readers-file >=0.4.11,<0.5.0
markdown >=3.6.0
openai >=1.95.0
pydantic >=2.11.0,<2.15.0
python-dotenv >=1.0.0
python-multipart >=0.0.19,<0.1.0
requests >=2.32.4
tenacity >=9.1.0
typing-extensions >=4.14.0
urllib3 >=2.5.0
uvicorn [standard]>=0.35.0,<0.40.0

requirements-citation.txt pypi

PyPDF2 >=3.0.1
jellyfish >=0.11.2
pdfplumber >=0.9.0
python-Levenshtein >=0.21.1
refextract >=0.3.0
requests >=2.31.0
unidecode >=1.3.6

requirements-dev.txt pypi

bandit >=1.8.0 development
memory-profiler >=0.61.0 development
mkdocs >=1.4.0 development
mkdocs-material >=9.0.0 development
mypy >=1.11.0 development
pip-audit >=2.7.0 development
playwright >=1.40.0 development
pre-commit >=4.0.0 development
py-spy >=0.3.14 development
pytest >=8.0.0 development
pytest-asyncio >=0.24.0 development
pytest-benchmark >=4.0.0 development
pytest-cov >=5.0.0 development
pytest-mock >=3.12.0 development
pytest-xdist >=3.8.0 development
ruff >=0.8.0 development

requirements-prod.txt pypi

PyMuPDF >=1.26.0,<1.30.0
aiocache >=0.12.3
aiofiles >=24.0.0
cachetools >=6.1.0
fastapi >=0.116.0,<0.120.0
google-generativeai >=0.8.5
llama-index-core >=0.12.49,<0.13.0
llama-index-embeddings-google-genai >=0.2.1,<0.3.0
llama-index-llms-google-genai >=0.2.4,<0.3.0
llama-index-readers-file >=0.4.11,<0.5.0
markdown >=3.6.0
openai >=1.95.0
orjson >=3.10.0
pydantic >=2.11.0,<2.15.0
python-dotenv >=1.0.0
python-multipart >=0.0.19,<0.1.0
requests >=2.32.4
tenacity >=9.1.0
typing-extensions >=4.14.0
urllib3 >=2.5.0
uvicorn >=0.35.0,<0.40.0
uvloop >=0.21.0

requirements-test.txt pypi

fastapi >=0.116.0,<0.120.0 test
pydantic >=2.11.0,<2.15.0 test
pytest >=8.0.0 test
pytest-asyncio >=0.24.0 test
pytest-cov >=5.0.0 test
pytest-mock >=3.12.0 test
pytest-xdist >=3.8.0 test
python-dotenv >=1.0.0 test
python-multipart >=0.0.19,<0.1.0 test
sqlalchemy >=2.0.0,<2.1.0 test
typing-extensions >=4.14.0 test

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science