geobenchx

LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.

https://github.com/solirinai/geobenchx

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.

Basic Info

Host: GitHub
Owner: Solirinai
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 415 KB

Statistics

Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

GeoBenchX

LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.

In this work, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline will be released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.

Paper on Arxiv.org GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks

Citation

If you use this benchmark or code in your research, please cite:

BibTeX: ```bibtex @misc{krechetova2025geobenchxbenchmarkingllmsmultistep, title={GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks}, author={Varvara Krechetova and Denis Kochedykov}, year={2025}, eprint={2503.18129}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.18129}, }

Owner

Name: Varvara Krechetova
Login: Solirinai
Kind: user

Repositories: 1
Profile: https://github.com/Solirinai

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Krechetova"
    given-names: "Varvara"
title: "A Comparative Analysis of LLM Agents for Geospatial Tasks"
version: 1.0.0
date-released: 2025-03-22
url: "https://github.com/Solirinai/GeoBenchX"
preferred-citation:
  type: article
  authors:
    - family-names: "Krechetova"
      given-names: "Varvara"
  title: "A Comparative Analysis of LLM Agents for Geospatial Tasks"
  year: 2025

GitHub Events

Total

Watch event: 6
Push event: 12
Create event: 2

Last Year

Watch event: 6
Push event: 12
Create event: 2

Dependencies

requirements.txt pypi

anthropic *
basemap *
contextily *
folium *
gdal *
geopandas *
google-generativeai *
langchain_anthropic *
langchain_core *
langchain_google_genai *
langchain_openai *
langgraph *
matplotlib *
nbformat *
networkx *
numpy *
openai *
overpy *
pandas *
pathlib *
plotly *
pydantic *
python-dotenv *
rasterio *
requests *
scikit-learn *
scipy *
shapely *
statsmodels *
streamlit *
tiktoken *
typing_extensions *
wbgapi *
zipfile *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

geobenchx

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

GeoBenchX

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies