geobenchx
LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Repository
LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.
Basic Info
- Host: GitHub
- Owner: Solirinai
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 415 KB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
GeoBenchX
LLM-agents benchmark set of geospatial tasks requiring multi-step tool use; and LLM-as-Judge based evaluation framework.
In this work, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline will be released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.
Paper on Arxiv.org GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks
Citation
If you use this benchmark or code in your research, please cite:
BibTeX: ```bibtex @misc{krechetova2025geobenchxbenchmarkingllmsmultistep, title={GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks}, author={Varvara Krechetova and Denis Kochedykov}, year={2025}, eprint={2503.18129}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.18129}, }
Owner
- Name: Varvara Krechetova
- Login: Solirinai
- Kind: user
- Repositories: 1
- Profile: https://github.com/Solirinai
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Krechetova"
given-names: "Varvara"
title: "A Comparative Analysis of LLM Agents for Geospatial Tasks"
version: 1.0.0
date-released: 2025-03-22
url: "https://github.com/Solirinai/GeoBenchX"
preferred-citation:
type: article
authors:
- family-names: "Krechetova"
given-names: "Varvara"
title: "A Comparative Analysis of LLM Agents for Geospatial Tasks"
year: 2025
GitHub Events
Total
- Watch event: 6
- Push event: 12
- Create event: 2
Last Year
- Watch event: 6
- Push event: 12
- Create event: 2
Dependencies
- anthropic *
- basemap *
- contextily *
- folium *
- gdal *
- geopandas *
- google-generativeai *
- langchain_anthropic *
- langchain_core *
- langchain_google_genai *
- langchain_openai *
- langgraph *
- matplotlib *
- nbformat *
- networkx *
- numpy *
- openai *
- overpy *
- pandas *
- pathlib *
- plotly *
- pydantic *
- python-dotenv *
- rasterio *
- requests *
- scikit-learn *
- scipy *
- shapely *
- statsmodels *
- streamlit *
- tiktoken *
- typing_extensions *
- wbgapi *
- zipfile *