automating-poi-categorization-agcg

A hybrid system combining semantic embeddings and rule-based logic to classify Points of Interest (POIs) using a hierarchical category tree.

https://github.com/project-terraforma/automating-poi-categorization-agcg

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A hybrid system combining semantic embeddings and rule-based logic to classify Points of Interest (POIs) using a hierarchical category tree.

Basic Info
  • Host: GitHub
  • Owner: project-terraforma
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 283 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

Automating POI Categorization

A hybrid system combining semantic embeddings and rule-based logic to classify Points of Interest (POIs) using a hierarchical category tree.

Description

This project is designed to automatically categorize POIs (like restaurants, gyms, or clinics) into a structured taxonomy. POI data is collected by scraping publicly available information from business websites. For each POI, we use the business name and website content (extracted via web scraping) to generate a descriptive input.

The model then uses Sentence-BERT (SBERT) embeddings to match this input against a tree of categories and subcategories. This is enhanced by a rule-based scoring system that matches category-specific keywords to improve prediction accuracy. This combination ensures scalable, interpretable, and flexible POI classification, especially when dealing with sparse or noisy data.

Getting Started

Dependencies

  • Operating System: Windows 10, macOS, or Linux (Python 3.8+ recommended)
  • Python Libraries:
    • transformers – for loading sentence embedding models
    • sentence-transformers – high-level wrapper for semantic embedding
    • torch – backend for running SBERT model inference
    • pandas – used for handling and filtering POI datasets
    • numpy – array computations for scoring and embedding math
    • beautifulsoup4 – used for HTML parsing in the web scraper
    • requests – makes HTTP calls to fetch POI websites
    • jupyter – local interactive development
    • google-colab – cloud-based alternative to Jupyter Notebooks

Setup

  1. Clone the repo sh git clone https://github.com/project-terraforma/Automating-POI-Categorization-AGCG.git
  2. Navigate into the project directory sh cd Automating-POI-Categorization-AGCG
  3. Install Python dependencies

Make sure you have a Python environment set up, then install required packages: sh pip install -r requirements.txt 5. Prevent accidental pushes to the base repository

Change the Git remote to your own fork or local version: sh git remote set-url origin https://github.com/<your-username>/<your-repo-name>.git git remote -v # Confirm the remote URL was updated 8. Start the project

Launch Jupyter Notebook and open the main notebook to begin: sh jupyter notebook

Navigate to the notebooks/ folder and open main.ipynb.

Authors

Adam Axtopani Gonzales – adamurlnum2@gmail.com

Carlos Garcia

Version History

  • 0.1
    • Initial Release

Acknowledgments

[ Project Sponsor ] Overture Maps Foundation

Sponsored this project and gave us the opportunity to approach this problem with their open source data

Overture Maps POC's from Microsoft Corporation

Krill Fedotov, Marko Radoicic, & Nikola Bozovic

A source of guidance and expertise when tackling this project together

Owner

  • Name: project-terraforma
  • Login: project-terraforma
  • Kind: organization

Citation (CITATION.cff)

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9

Dependencies

requirements.txt pypi
  • beautifulsoup4 *
  • numpy *
  • overturemaps *
  • pandas *
  • pyarrow *
  • requests *
  • sentence-transformers *