https://github.com/gautierdag/bibextract
MCP tool to extract latex/bibtex from arvix source
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
Repository
MCP tool to extract latex/bibtex from arvix source
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
bibextract
A Python package (with Rust backend) for extracting survey content and bibliography from arXiv papers.
There are a lot of ArXiv MCP tools already. This is another.
What it does differently is that it extracts content directly from the LaTeX source of the paper, rather than parsing the PDF.
It also focuses entirely on survey/background/related work sections. Right now this tool will ignore all the other sections.
Once it extracts the content, it also extracts looks at the BBL file and tries to reconstruct the .bibtex file and normalise the entries. Not all BBL files work (see the tests/fixtures for examples). Once it has a title/author/year, it will try to look up the arXiv ID or DOI of the paper, and use that in the bibtex entry instead of the raw entry from the BBL file.
This citation normalisation means that you can pass multiple papers to it and it will extract the related work content and bibliography from all of them, merging them into a single output, with limited overlap.
The goal of this tool is to make it easy to get LLM agents to read/cite/write background sections of papers. In a loop, an agent could read a paper, extract the related work section, and then use all the ArXiv IDs in that section to extract the related work sections of those papers, and so on. This way, you can build a large corpus of related work content without having to manually search for papers.
Some future todos
- [ ] improve test coverage
- [ ] add more
.bblfiles to tests - [ ] improve the MCP docs for the tool
Installation
Installing via Smithery
To install bibextract for Claude Desktop automatically via Smithery:
bash
npx -y @smithery/cli install @gautierdag/bibextract --client claude
fastMCP server implementation
bash
uv run bibextract_mcp.py
fastMCP from URL
```bash
obviously check the file before running it, don't trust random scripts from the internet
uv run --python 3.12 https://raw.githubusercontent.com/gautierdag/bibextract/refs/heads/main/bibextract_mcp.py ```
From PyPI
bash
uv add bibextract
From Source
Install Rust (if not already installed):
bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source ~/.cargo/envInstall maturin:
bash pip install maturinClone and build:
bash git clone https://github.com/gautier/bibextract.git cd bibextract maturin develop
Usage
Python API
```python import bibextract
Process one or more arXiv papers
result = bibextract.extract_survey(['2104.08653', '1912.02292'])
Access the extracted content
surveytext = result['surveytext'] # Raw LaTeX with sections bibtex = result['bibtex'] # BibTeX bibliography
Save to files
with open('survey.tex', 'w') as f: f.write(survey_text)
with open('bibliography.bib', 'w') as f: f.write(bibtex) ```
Command Line (original Rust binary)
```bash
Build the CLI tool
cargo build --release
Process papers
./target/release/bibextract --paper-ids 2104.08653 1912.02292 --output survey.tex ```
Development
Running Tests
bash
cargo test
pytest tests
License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: Gautier Dagan
- Login: gautierdag
- Kind: user
- Location: Edinburgh
- Website: www.gautier.tech
- Twitter: __gautier__
- Repositories: 53
- Profile: https://github.com/gautierdag
PhD Student in Natural Language Processing at the University of Edinburgh
GitHub Events
Total
- Release event: 2
- Delete event: 3
- Push event: 17
- Pull request event: 1
- Create event: 5
Last Year
- Release event: 2
- Delete event: 3
- Push event: 17
- Pull request event: 1
- Create event: 5
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 15 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 15 minutes
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 2
Top Authors
Issue Authors
Pull Request Authors
- smithery-ai[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 236 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: bibextract
Extract survey content and bibliography from arXiv papers
- Homepage: https://github.com/gautierdag/bibextract
- Documentation: https://bibextract.readthedocs.io/
- License: MIT
-
Latest release: 0.1.1
published 11 months ago