https://github.com/centre-for-humanities-computing/lex-db

A repository for interacting with the lex database for the Lex AI project.

https://github.com/centre-for-humanities-computing/lex-db

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

A repository for interacting with the lex database for the Lex AI project.

Basic Info
  • Host: GitHub
  • Owner: centre-for-humanities-computing
  • Language: Python
  • Default Branch: main
  • Size: 153 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 4
  • Releases: 2
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

lex-db

A repository for interacting with the lex database for the Lex AI project. This project provides a wrapper around a SQLite database to enable querying encyclopedia articles via API requests, supporting both vector (semantic) search and full-text/keyword search.

Features

  • SQLite database access with sqlite-vec for vector search
  • Full-text search capabilities using FTS5
  • FastAPI-based REST API with automatic OpenAPI documentation
  • Hybrid querying via metadata filtering and text search
  • Vector index management and semantic search

Requirements

  • Python 3.12+
  • Astral UV for package management
  • SQLite compiled with the sqlite-vec extension (required for vector search)

Installation

  1. Clone the repository: bash git clone https://github.com/yourusername/lex-db.git cd lex-db

  2. Install dependencies using Make: bash make install

  3. Create a .env file (or modify the existing one) to set the database path: ```

    Database settings

    DATABASEURL=PATH/TO/DBFILE.db ```

Usage

Scripts

  • create_fts_index.py: Creates a full-text search index on a specified column in a table. bash uv run src/scripts/create_fts_index.py <table_name> <column_name>

  • create_vector_index.py: Creates a new vector index for semantic search on a given column. bash uv run src/scripts/create_vector_index.py <table_name> <column_name>

  • update_vector_indexes.py: Populates vector indexes with embeddings using OpenAI or another embedding provider. bash uv run src/scripts/update_vector_indexes.py

⚠️ Note: While create_openai_embedding_batches.py and add_batch_embeddings_to_index.py exist for batch processing, they are not recommended due to reliability issues with the OpenAI Batch API. Use update_vector_indexes.py instead.

Running the API Server

Start the FastAPI server: bash make run The server will be available at http://0.0.0.0:8000.

API Endpoints

List Database Tables

  • GET /api/tables
    • Returns a list of all tables in the database.
    • Example response: json { "tables": ["articles", "vector_index", "metadata"] }

Filter Articles by Metadata

  • GET /api/articles
    • Retrieve articles filtered by ID or full-text search query.
    • Supports optional query parameters:
    • query: Text-based search in article content.
    • ids: Filter by article IDs (supports comma-separated string, JSON array, or repeated ids parameter).
    • limit: Maximum number of results (1–100, default: 50).

Examples: - By IDs only (comma-separated): GET /api/articles?ids=1,2,5 - By IDs (repeated parameters): GET /api/articles?ids=1&ids=2&ids=5 - By IDs and text search: GET /api/articles?query=Rundetårn&ids=1,2&limit=10 - Full-text search only: GET /api/articles?query=Denmark

Response:
Returns structured search results including matched articles with metadata and scores.

Vector Search

  • POST /api/vector-search/indexes/{index_name}/query
    • Perform semantic search on a specific vector index.
    • Path Parameter:
    • index_name: Name of the vector index to search.
    • Request Body (JSON): json { "query_text": "What is the capital of Denmark?", "top_k": 5 }
    • query_text: The search query (required).
    • top_k: Number of top results to return (optional, default: 5).

Example Request: ```http POST /api/vector-search/indexes/article_embeddings/query Content-Type: application/json

{ "querytext": "Scandinavian history", "topk": 3 } ```

Response:
Returns a list of semantically similar documents with metadata and similarity scores.

List All Vector Indexes

  • GET /api/vector-search/indexes
    • Retrieve metadata for all available vector indexes.
    • Example response: json [ { "index_name": "article_embeddings", "embedding_model": "text-embedding-3-small", "dimension": 1536, "created_at": "2025-04-05T12:00:00Z" } ]

Get Metadata for a Specific Vector Index

  • GET /api/vector-search/indexes/{index_name}
    • Retrieve metadata for a specific vector index.
    • Path Parameter:
    • index_name: The name of the vector index.
    • Returns details such as model used, dimension, and creation timestamp.

API Documentation

Once the server is running, access auto-generated API documentation at: - Swagger UI: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc

The OpenAPI 3.1 specification is available at: /openapi/openapi.yaml

You can generate clients in various languages using OpenAPI Generator: bash openapi-generator-cli generate -i openapi/openapi.yaml -g <language> -o ./client Replace <language> with your target (e.g., python, typescript-fetch, java).

Development

This project uses a Makefile to streamline development tasks:

| Command | Description | |-------------------------|-----------| | make install | Install dependencies | | make run | Start the API server | | make lint | Format code and fix lint issues (using Ruff) | | make lint-check | Check formatting and linting without applying fixes | | make static-type-check| Run static type checking with Mypy | | make test | Run tests using Pytest | | make pr | Run all pre-PR checks (linting, type checking, testing) | | make help | Show all available commands |

License

N/A

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 6
  • Push event: 3
  • Public event: 1
  • Pull request event: 2
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 6
  • Push event: 3
  • Public event: 1
  • Pull request event: 2

Dependencies

pyproject.toml pypi
  • fastapi >=0.110.0
  • mypy >=1.15.0
  • openai >=1.79.0
  • pydantic >=2.6.0
  • pydantic-settings >=2.1.0
  • pytest >=8.3.5
  • pytest-cov >=6.1.1
  • python-dotenv >=1.0.0
  • ruff >=0.11.9
  • sentence-transformers >=4.1.0
  • setuptools >=80.7.1
  • sqlite-utils >=3.38
  • sqlite-vec >=0.1.6
  • tiktoken >=0.9.0
  • types-pyyaml >=6.0.12.20250516
  • types-setuptools >=80.8.0.20250521
  • uvicorn >=0.27.0
setup.py pypi
uv.lock pypi
  • annotated-types 0.7.0
  • anyio 4.9.0
  • certifi 2025.4.26
  • charset-normalizer 3.4.2
  • click 8.2.0
  • click-default-group 1.2.4
  • colorama 0.4.6
  • coverage 7.8.0
  • distro 1.9.0
  • fastapi 0.115.12
  • filelock 3.18.0
  • fsspec 2025.3.2
  • h11 0.16.0
  • httpcore 1.0.9
  • httpx 0.28.1
  • huggingface-hub 0.31.4
  • idna 3.10
  • iniconfig 2.1.0
  • jinja2 3.1.6
  • jiter 0.10.0
  • joblib 1.5.0
  • lex-db 0.1.0
  • markupsafe 3.0.2
  • mpmath 1.3.0
  • mypy 1.15.0
  • mypy-extensions 1.1.0
  • networkx 3.4.2
  • numpy 2.2.6
  • nvidia-cublas-cu12 12.6.4.1
  • nvidia-cuda-cupti-cu12 12.6.80
  • nvidia-cuda-nvrtc-cu12 12.6.77
  • nvidia-cuda-runtime-cu12 12.6.77
  • nvidia-cudnn-cu12 9.5.1.17
  • nvidia-cufft-cu12 11.3.0.4
  • nvidia-cufile-cu12 1.11.1.6
  • nvidia-curand-cu12 10.3.7.77
  • nvidia-cusolver-cu12 11.7.1.2
  • nvidia-cusparse-cu12 12.5.4.2
  • nvidia-cusparselt-cu12 0.6.3
  • nvidia-nccl-cu12 2.26.2
  • nvidia-nvjitlink-cu12 12.6.85
  • nvidia-nvtx-cu12 12.6.77
  • openai 1.79.0
  • packaging 25.0
  • pillow 11.2.1
  • pluggy 1.5.0
  • pydantic 2.11.4
  • pydantic-core 2.33.2
  • pydantic-settings 2.9.1
  • pytest 8.3.5
  • pytest-cov 6.1.1
  • python-dateutil 2.9.0.post0
  • python-dotenv 1.1.0
  • pyyaml 6.0.2
  • regex 2024.11.6
  • requests 2.32.3
  • ruff 0.11.9
  • safetensors 0.5.3
  • scikit-learn 1.6.1
  • scipy 1.15.3
  • sentence-transformers 4.1.0
  • setuptools 80.7.1
  • six 1.17.0
  • sniffio 1.3.1
  • sqlite-fts4 1.0.3
  • sqlite-utils 3.38
  • sqlite-vec 0.1.6
  • starlette 0.46.2
  • sympy 1.14.0
  • tabulate 0.9.0
  • threadpoolctl 3.6.0
  • tiktoken 0.9.0
  • tokenizers 0.21.1
  • torch 2.7.0
  • tqdm 4.67.1
  • transformers 4.51.3
  • triton 3.3.0
  • types-pyyaml 6.0.12.20250516
  • types-setuptools 80.8.0.20250521
  • typing-extensions 4.13.2
  • typing-inspection 0.4.0
  • urllib3 2.4.0
  • uvicorn 0.34.2