citation
an python library to extarct the citation informations for academic usage and study
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Repository
an python library to extarct the citation informations for academic usage and study
Basic Info
- Host: GitHub
- Owner: Areopaguaworkshop
- Language: Python
- Default Branch: main
- Size: 9.2 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
🔍 Citation Extractor
Bridging the Trust Gap in the AI Era
Because every claim deserves a source, and every source deserves proper citation.
Why This Matters • Features • Quick Start • Usage • Contributing
🚨 Why This Matters
We're living in an era where AI can write beautifully, but can't cite properly.
Large Language Models (LLMs) like ChatGPT, Claude, and Gemini are incredible at generating human-like text, but they have a fundamental flaw: they lack reliable citation mechanisms. When an LLM tells you about a scientific study, historical event, or technical concept, you're left wondering:
- 📚 Where did this information come from?
- 🔍 How can I verify these claims?
- 📝 How do I properly cite this in my research?
This creates a trust gap that undermines the reliability of AI-generated content, especially in academic, professional, and research contexts.
Citation Extractor exists to fill this gap.
While LLMs struggle with proper citations, this tool excels at extracting structured, verifiable citation data from any source. It's the missing piece that makes AI-generated content trustworthy and academically sound.
🌟 Features
🎯 Universal Source Support
- 📄 PDFs: Academic papers, books, theses, book chapters
- 🌐 Web URLs: Articles, blog posts, online publications
- 🎥 Media Files: Video lectures, podcasts, audio recordings
🧠 AI-Powered Intelligence
- Smart Document Classification: Automatically detects if it's a journal article, book, thesis, or book chapter
- Multilingual OCR: Handles English, Chinese (Simplified & Traditional), and more
- Flexible LLM Backend: Works with Ollama (local) or cloud APIs (Gemini, OpenAI)
📚 Research-Grade Output
- CSL-JSON Standard: Compatible with Zotero, Mendeley, EndNote, and all major reference managers
- Multiple Citation Styles: Chicago, APA, MLA, and any CSL style you need
- Structured Metadata: Author, title, publication date, DOI, ISBN, and more
⚡ Streamlined Performance
- Smart Page Selection: Processes only the most relevant pages for speed
- Iterative Extraction: Efficiently extracts citation data with early stopping when sufficient information is found
- Offline Processing: Works entirely offline for PDF documents without requiring external API calls
- Batch Processing: Handle multiple documents efficiently
🚀 Quick Start
Installation
bash
pip install cite-extractor
System Dependencies
```bash
Ubuntu/Debian
sudo apt-get install tesseract-ocr mediainfo
macOS
brew install tesseract mediainfo
For local LLM support (optional)
Install Ollama: https://ollama.ai/
```
First Citation
```bash
Extract from a PDF
citation "path/to/research-paper.pdf"
Extract from a URL
citation "https://www.nature.com/articles/s41586-023-06627-7"
Extract from a video
citation "path/to/conference-talk.mp4" ```
📖 Usage
Command Line Interface
```bash
Basic usage
citation "document.pdf"
Specify document type
citation "thesis.pdf" --type thesis
Use different LLM
citation "paper.pdf" --llm gemini/gemini-1.5-flash
Custom output directory
citation "book.pdf" --output-dir ./citations
Specific page range for large documents
citation "book.pdf" --page-range "1-5, -3"
Different citation style
citation "article.pdf" --citation-style apa ```
Python API
```python from citation.main import CitationExtractor from citation.citationstyle import formatbibliography
Initialize with your preferred LLM
extractor = CitationExtractor(llm_model="ollama/qwen3")
Extract citation data
csldata = extractor.extractcitation("research-paper.pdf")
if csldata: # Format as bibliography bibliography, intext = formatbibliography([csldata], "chicago-author-date")
print("📚 Bibliography:")
print(bibliography)
print("\n📝 In-text citation:")
print(in_text)
```
Advanced Configuration
```bash
For non-English documents
citation "chinese-paper.pdf" --lang chi_sim+eng
Verbose output for debugging
citation "document.pdf" --verbose
Custom citation style (place .csl file in citation/styles/)
citation "paper.pdf" --citation-style nature ```
🎯 Use Cases
📚 Academic Researchers
- Automatically cite papers you're reading
- Build bibliographies from PDF collections
- Ensure proper attribution in literature reviews
🎓 Students
- Generate citations for thesis references
- Create bibliographies for term papers
- Verify and format existing citations
📰 Content Creators
- Add credible sources to blog posts
- Cite academic backing for claims
- Build trust with properly attributed content
🤖 AI Developers
- Add citation capabilities to AI applications
- Verify sources for AI-generated content
- Build trustworthy AI systems
🛠️ Supported LLM Providers
| Provider | Models | Setup |
|----------|---------|-------|
| Ollama (Local) | qwen3, llama3, mistral | Install Ollama |
| Google Gemini | gemini-1.5-flash, gemini-1.5-pro | Set API key |
| OpenAI | gpt-4, gpt-3.5-turbo | Set API key |
🌈 Examples
Extract from Academic Paper
bash
citation "https://arxiv.org/pdf/2301.07041.pdf"
Extract from News Article
bash
citation "https://www.bbc.com/news/science-environment-64234567"
Extract from Video Lecture
bash
citation "MIT_6.034_Lecture_1.mp4"
🤝 Contributing
We're thrilled to have you join this mission! 🎉
This project addresses a fundamental need in our AI-driven world, and we believe it can make a real difference in how we handle information credibility. Whether you're a developer, researcher, or just someone who cares about proper attribution, there's a place for you here.
🚀 How to Contribute
- 🐛 Report Issues: Found a bug or have a feature request?
- 💡 Suggest Improvements: Ideas for better citation extraction?
- 🔧 Submit Code: Bug fixes, new features, or optimizations
- 📚 Improve Documentation: Help others understand and use the tool
- 🌍 Add Language Support: Extend OCR and extraction to new languages
- 🎨 Citation Styles: Add support for more academic citation styles
💻 Development Setup
```bash git clone https://github.com/your-username/citation-extractor.git cd citation-extractor
Install development dependencies
pip install -e ".[dev]"
Run tests
pytest
Format code
black . ```
🎯 Priority Areas
- 🔍 Enhanced Source Detection: Better recognition of document types
- 🌐 Web Scraping: Improved extraction from various websites
- 🎥 Media Support: Better metadata extraction from videos/audio
- 📊 Batch Processing: GUI for handling multiple documents
- 🔗 Integration: Plugins for popular reference managers
🏆 Acknowledgments
This project stands on the shoulders of giants: - DSPy: For flexible LLM integration - Tesseract: For OCR capabilities - citeproc-py: For citation formatting - The Open Source Community: For making tools like this possible
📄 License
MIT License - feel free to use this in your projects, commercial or otherwise.
🔗 Links
- 📦 PyPI: https://pypi.org/project/cite-extractor/
- 🐛 Issues: Report bugs or request features
- 💬 Discussions: Join the conversation
Made with ❤️ for the research community
Because every claim deserves a source, and every source deserves respect.
⭐ Star this repo if you find it useful! ⭐
Owner
- Name: Ajia
- Login: Areopaguaworkshop
- Kind: user
- Location: London
- Company: 光从东方来
- Website: https://gcdfl.org/
- Repositories: 1
- Profile: https://github.com/Areopaguaworkshop
2021-2024 SOAS University of London, MPhil/PhD Candidate 2019-2021 Boston College School of Theology and Ministry, MT
GitHub Events
Total
- Push event: 10
- Create event: 3
Last Year
- Push event: 10
- Create event: 3
Dependencies
- PyMuPDF >=1.23.0
- citeproc-py >=0.7.0
- crawl4ai >=0.7.0
- dspy-ai >=2.6.27
- lxml >=4.9.0
- ocrmypdf >=16.10.4
- pymediainfo >=7.0.1
- pypinyin >=0.51.0
- python-dateutil >=2.8.0
- requests >=2.31.0
- trafilatura >=1.6.0
- urllib3 >=2.0.0