citation

an python library to extarct the citation informations for academic usage and study

https://github.com/areopaguaworkshop/citation

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, nature.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

an python library to extarct the citation informations for academic usage and study

Basic Info

Host: GitHub
Owner: Areopaguaworkshop
Language: Python
Default Branch: main
Size: 9.2 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 11 months ago

Metadata Files

Readme Citation

🔍 Citation Extractor

Bridging the Trust Gap in the AI Era
Because every claim deserves a source, and every source deserves proper citation.

Why This Matters • Features • Quick Start • Usage • Contributing

PyPI version PyPI downloads

🚨 Why This Matters

We're living in an era where AI can write beautifully, but can't cite properly.

Large Language Models (LLMs) like ChatGPT, Claude, and Gemini are incredible at generating human-like text, but they have a fundamental flaw: they lack reliable citation mechanisms. When an LLM tells you about a scientific study, historical event, or technical concept, you're left wondering:

📚 Where did this information come from?
🔍 How can I verify these claims?
📝 How do I properly cite this in my research?

This creates a trust gap that undermines the reliability of AI-generated content, especially in academic, professional, and research contexts.

Citation Extractor exists to fill this gap.

While LLMs struggle with proper citations, this tool excels at extracting structured, verifiable citation data from any source. It's the missing piece that makes AI-generated content trustworthy and academically sound.

🌟 Features

🎯 Universal Source Support

📄 PDFs: Academic papers, books, theses, book chapters
🌐 Web URLs: Articles, blog posts, online publications
🎥 Media Files: Video lectures, podcasts, audio recordings

🧠 AI-Powered Intelligence

Smart Document Classification: Automatically detects if it's a journal article, book, thesis, or book chapter
Multilingual OCR: Handles English, Chinese (Simplified & Traditional), and more
Flexible LLM Backend: Works with Ollama (local) or cloud APIs (Gemini, OpenAI)

📚 Research-Grade Output

CSL-JSON Standard: Compatible with Zotero, Mendeley, EndNote, and all major reference managers
Multiple Citation Styles: Chicago, APA, MLA, and any CSL style you need
Structured Metadata: Author, title, publication date, DOI, ISBN, and more

⚡ Streamlined Performance

Smart Page Selection: Processes only the most relevant pages for speed
Iterative Extraction: Efficiently extracts citation data with early stopping when sufficient information is found
Offline Processing: Works entirely offline for PDF documents without requiring external API calls
Batch Processing: Handle multiple documents efficiently

🚀 Quick Start

Installation

bash pip install cite-extractor

System Dependencies

```bash

Ubuntu/Debian

sudo apt-get install tesseract-ocr mediainfo

macOS

brew install tesseract mediainfo

For local LLM support (optional)

Install Ollama: https://ollama.ai/

```

First Citation

```bash

Extract from a PDF

citation "path/to/research-paper.pdf"

Extract from a URL

citation "https://www.nature.com/articles/s41586-023-06627-7"

Extract from a video

citation "path/to/conference-talk.mp4" ```

📖 Usage

Command Line Interface

```bash

Basic usage

citation "document.pdf"

Specify document type

citation "thesis.pdf" --type thesis

Use different LLM

citation "paper.pdf" --llm gemini/gemini-1.5-flash

Custom output directory

citation "book.pdf" --output-dir ./citations

Specific page range for large documents

citation "book.pdf" --page-range "1-5, -3"

Different citation style

citation "article.pdf" --citation-style apa ```

Python API

```python from citation.main import CitationExtractor from citation.citationstyle import formatbibliography

Initialize with your preferred LLM

extractor = CitationExtractor(llm_model="ollama/qwen3")

Extract citation data

csldata = extractor.extractcitation("research-paper.pdf")

if csldata: # Format as bibliography bibliography, intext = formatbibliography([csldata], "chicago-author-date")

print("📚 Bibliography:")
print(bibliography)

print("\n📝 In-text citation:")
print(in_text)

```

Advanced Configuration

```bash

For non-English documents

citation "chinese-paper.pdf" --lang chi_sim+eng

Verbose output for debugging

citation "document.pdf" --verbose

Custom citation style (place .csl file in citation/styles/)

citation "paper.pdf" --citation-style nature ```

🎯 Use Cases

📚 Academic Researchers

Automatically cite papers you're reading
Build bibliographies from PDF collections
Ensure proper attribution in literature reviews

🎓 Students

Generate citations for thesis references
Create bibliographies for term papers
Verify and format existing citations

📰 Content Creators

Add credible sources to blog posts
Cite academic backing for claims
Build trust with properly attributed content

🤖 AI Developers

Add citation capabilities to AI applications
Verify sources for AI-generated content
Build trustworthy AI systems

🛠️ Supported LLM Providers

| Provider | Models | Setup | |----------|---------|-------| | Ollama (Local) | qwen3, llama3, mistral | Install Ollama | | Google Gemini | gemini-1.5-flash, gemini-1.5-pro | Set API key | | OpenAI | gpt-4, gpt-3.5-turbo | Set API key |

🌈 Examples

Extract from Academic Paper

bash citation "https://arxiv.org/pdf/2301.07041.pdf"

Extract from News Article

bash citation "https://www.bbc.com/news/science-environment-64234567"

Extract from Video Lecture

bash citation "MIT_6.034_Lecture_1.mp4"

🤝 Contributing

We're thrilled to have you join this mission! 🎉

This project addresses a fundamental need in our AI-driven world, and we believe it can make a real difference in how we handle information credibility. Whether you're a developer, researcher, or just someone who cares about proper attribution, there's a place for you here.

🚀 How to Contribute

🐛 Report Issues: Found a bug or have a feature request?
💡 Suggest Improvements: Ideas for better citation extraction?
🔧 Submit Code: Bug fixes, new features, or optimizations
📚 Improve Documentation: Help others understand and use the tool
🌍 Add Language Support: Extend OCR and extraction to new languages
🎨 Citation Styles: Add support for more academic citation styles

💻 Development Setup

```bash git clone https://github.com/your-username/citation-extractor.git cd citation-extractor

Install development dependencies

pip install -e ".[dev]"

Run tests

pytest

Format code

black . ```

🎯 Priority Areas

🔍 Enhanced Source Detection: Better recognition of document types
🌐 Web Scraping: Improved extraction from various websites
🎥 Media Support: Better metadata extraction from videos/audio
📊 Batch Processing: GUI for handling multiple documents
🔗 Integration: Plugins for popular reference managers

🏆 Acknowledgments

This project stands on the shoulders of giants: - DSPy: For flexible LLM integration - Tesseract: For OCR capabilities - citeproc-py: For citation formatting - The Open Source Community: For making tools like this possible

📄 License

MIT License - feel free to use this in your projects, commercial or otherwise.

🔗 Links

📦 PyPI: https://pypi.org/project/cite-extractor/
🐛 Issues: Report bugs or request features
💬 Discussions: Join the conversation

Made with ❤️ for the research community
Because every claim deserves a source, and every source deserves respect.

⭐ Star this repo if you find it useful! ⭐

Owner

Name: Ajia
Login: Areopaguaworkshop
Kind: user
Location: London
Company: 光从东方来

Website: https://gcdfl.org/
Repositories: 1
Profile: https://github.com/Areopaguaworkshop

2021-2024 SOAS University of London, MPhil/PhD Candidate 2019-2021 Boston College School of Theology and Ministry, MT

GitHub Events

Total

Push event: 10
Create event: 3

Last Year

Push event: 10
Create event: 3

Dependencies

pyproject.toml pypi

PyMuPDF >=1.23.0
citeproc-py >=0.7.0
crawl4ai >=0.7.0
dspy-ai >=2.6.27
lxml >=4.9.0
ocrmypdf >=16.10.4
pymediainfo >=7.0.1
pypinyin >=0.51.0
python-dateutil >=2.8.0
requests >=2.31.0
trafilatura >=1.6.0
urllib3 >=2.0.0

citation

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

🔍 Citation Extractor

🚨 Why This Matters

🌟 Features

🎯 Universal Source Support

🧠 AI-Powered Intelligence

📚 Research-Grade Output

⚡ Streamlined Performance

🚀 Quick Start

Installation

System Dependencies

Ubuntu/Debian

macOS

For local LLM support (optional)

Install Ollama: https://ollama.ai/

First Citation

Extract from a PDF

Extract from a URL

Extract from a video

📖 Usage

Command Line Interface

Basic usage

Specify document type

Use different LLM

Custom output directory

Specific page range for large documents

Different citation style

Python API

Initialize with your preferred LLM

Extract citation data

Advanced Configuration

For non-English documents

Verbose output for debugging

Custom citation style (place .csl file in citation/styles/)

🎯 Use Cases

📚 Academic Researchers

🎓 Students

📰 Content Creators

🤖 AI Developers

🛠️ Supported LLM Providers

🌈 Examples

Extract from Academic Paper

Extract from News Article

Extract from Video Lecture

🤝 Contributing

🚀 How to Contribute

💻 Development Setup

Install development dependencies

Run tests

Format code

🎯 Priority Areas

🏆 Acknowledgments

📄 License

🔗 Links

Owner

GitHub Events

Total

Last Year

Dependencies