https://github.com/gautierdag/bibextract

MCP tool to extract latex/bibtex from arvix source

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

arvix mcp mcp-server

Last synced: 10 months ago · JSON representation

Repository

MCP tool to extract latex/bibtex from arvix source

Basic Info

Host: GitHub
Owner: gautierdag
License: mit
Language: Rust
Default Branch: main
Homepage:
Size: 226 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

arvix mcp mcp-server

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

bibextract

A Python package (with Rust backend) for extracting survey content and bibliography from arXiv papers.

There are a lot of ArXiv MCP tools already. This is another.

What it does differently is that it extracts content directly from the LaTeX source of the paper, rather than parsing the PDF.

It also focuses entirely on survey/background/related work sections. Right now this tool will ignore all the other sections.

Once it extracts the content, it also extracts looks at the BBL file and tries to reconstruct the .bibtex file and normalise the entries. Not all BBL files work (see the tests/fixtures for examples). Once it has a title/author/year, it will try to look up the arXiv ID or DOI of the paper, and use that in the bibtex entry instead of the raw entry from the BBL file.

This citation normalisation means that you can pass multiple papers to it and it will extract the related work content and bibliography from all of them, merging them into a single output, with limited overlap.

The goal of this tool is to make it easy to get LLM agents to read/cite/write background sections of papers. In a loop, an agent could read a paper, extract the related work section, and then use all the ArXiv IDs in that section to extract the related work sections of those papers, and so on. This way, you can build a large corpus of related work content without having to manually search for papers.

Some future todos

[ ] improve test coverage
[ ] add more .bbl files to tests
[ ] improve the MCP docs for the tool

Installation

Installing via Smithery

To install bibextract for Claude Desktop automatically via Smithery:

bash npx -y @smithery/cli install @gautierdag/bibextract --client claude

fastMCP server implementation

bash uv run bibextract_mcp.py

fastMCP from URL

```bash

obviously check the file before running it, don't trust random scripts from the internet

uv run --python 3.12 https://raw.githubusercontent.com/gautierdag/bibextract/refs/heads/main/bibextract_mcp.py ```

From PyPI

bash uv add bibextract

From Source

Install Rust (if not already installed):

bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source ~/.cargo/env
Install maturin:

bash pip install maturin
Clone and build:

bash git clone https://github.com/gautier/bibextract.git cd bibextract maturin develop

Usage

Python API

```python import bibextract

Process one or more arXiv papers

result = bibextract.extract_survey(['2104.08653', '1912.02292'])

Access the extracted content

surveytext = result['surveytext'] # Raw LaTeX with sections bibtex = result['bibtex'] # BibTeX bibliography

Save to files

with open('survey.tex', 'w') as f: f.write(survey_text)

with open('bibliography.bib', 'w') as f: f.write(bibtex) ```

Command Line (original Rust binary)

```bash

Build the CLI tool

cargo build --release

Process papers

./target/release/bibextract --paper-ids 2104.08653 1912.02292 --output survey.tex ```

Development

Running Tests

bash cargo test pytest tests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

Name: Gautier Dagan
Login: gautierdag
Kind: user
Location: Edinburgh

Website: www.gautier.tech
Twitter: __gautier__
Repositories: 53
Profile: https://github.com/gautierdag

PhD Student in Natural Language Processing at the University of Edinburgh

GitHub Events

Total

Release event: 2
Delete event: 3
Push event: 17
Pull request event: 1
Create event: 5

Last Year

Release event: 2
Delete event: 3
Push event: 17
Pull request event: 1
Create event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 15 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 15 minutes
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

Pull Request Authors

smithery-ai[bot] (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 236 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: bibextract

Extract survey content and bibliography from arXiv papers

Homepage: https://github.com/gautierdag/bibextract
Documentation: https://bibextract.readthedocs.io/
License: MIT
Latest release: 0.1.1
published about 1 year ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 236 Last month

Rankings

Dependent packages count: 8.9%

Average: 29.6%

Dependent repos count: 50.2%

Maintainers (1)

gautierdag

Last synced: 10 months ago

https://github.com/gautierdag/bibextract

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

bibextract

Some future todos

Installation

Installing via Smithery

fastMCP server implementation

fastMCP from URL

obviously check the file before running it, don't trust random scripts from the internet

From PyPI

From Source

Usage

Python API

Process one or more arXiv papers

Access the extracted content

Save to files

Command Line (original Rust binary)

Build the CLI tool

Process papers

Development

Running Tests

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: bibextract

Rankings

Maintainers (1)