taulu

Taulu is a Python package designed to segment tabular data in scanned or photographed documents.

https://github.com/ghentcdh/taulu

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary

Keywords

data-extraction historic-documents htr ocr segmentation tabular-data
Last synced: 6 months ago · JSON representation ·

Repository

Taulu is a Python package designed to segment tabular data in scanned or photographed documents.

Basic Info
  • Host: GitHub
  • Owner: GhentCDH
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 12.1 MB
Statistics
  • Stars: 4
  • Watchers: 3
  • Forks: 0
  • Open Issues: 3
  • Releases: 16
Topics
data-extraction historic-documents htr ocr segmentation tabular-data
Created 11 months ago · Last pushed 6 months ago
Metadata Files
Readme Citation

README.md

Banner
Segmentation of tables from images

PyPi version of taulu GitHub Actions Workflow Status

Data Requirements

This package assumes that you are working with images of tables that have clearly visible rules (the lines that divide the table into cells).

To fully utilize the automated workflow, your tables should include a recognizable header. This header will be used to identify the position of the first cell in the input image and determine the expected widths of the table's cells.

For optimal segmentation, ensure that the tables are rotated so the borders are approximately vertical and horizontal. Minor page warping is acceptable.

Installation

Using pip

sh pip install taulu

Using uv

sh uv add taulu

Example

bash git clone https://github.com/GhentCDH/taulu.git cd taulu/examples bash run.bash

During this example, you will need to annotate the header image. You do this by simply clicking twice per line, once for each endpoint. It does not matter in which order you annotate the lines. Example:

Table Header Annotation Example

Below is an example of table cell identification using the Taulu package:

Table Cell Identification Example

Workflow

This package is structured in a modular way, with several components that work together.

The algorithm identifies the header's location in the input image, which provides a starting point. From there, it scans the image to find intersections of the rules (borders) and segments the image into cells accordingly.

The output is a TableGrid object that contains the detected intersections, enabling you to segment the image into rows, columns, and cells.

Here is a visualization of the workflow and the components:

mermaid flowchart LR h(header.png) --> A[HeaderAligner] t(table.png) --> C[PageCropper] j(header.json) --> T[HeaderTemplate] C --> F[GridDetector] A --> H((h)) C --> H T --> S((s)) H --> S F --> R S --> R(result) T --> R

The components are:

  • HeaderAligner: Uses template matching to identify the header's location in the input images.
  • PageCropper: An optional component that crops the image to a region containing a given color. This is useful if your image contains a lot of background, but can be skipped if the table occupies most of the image. Only works if your table has a distinct color from the background.
  • HeaderTemplate: Stores table template information by reading an annotation JSON file. You can create this file by running HeaderTemplate.annotate_image on a cropped image of your tables header.
  • GridDetector: Processes the image to identify intersections of horizontal and vertical lines (borders).
  • h: A transformation matrix that maps points from the header template to the input image.
  • s: The starting point of the segmentation algorithm (typically the top-left intersection, just below the header).

Parameters

The taulu algorithm has a few parameters which you might need to tune in order for it to fit your data's characteristics. The following is a summary of the most important parameters and how you could tune them to your data.

GridDetector

  • kernel_size, cross_width, cross_height: The GridDetector uses a kernel to detect intersections of rules in the image. By default, cross_height follows the value of cross_width. The kernel looks like this:

kernel diagram

The goal is to make this kernel look like the actual corners in your images after thresholding and dilation. The example script shows the dilated result, which you can use to estimate the cross_width and cross_height values that fit your image. Note that the optimal values will depend on the morph_size parameter too. - morph_size: The GridDetector uses a dilation step in order to connect lines in the image that might be broken up after thresholding. With a larger morph_size, larger gaps in the lines will be connected, but it will also lead to much thicker lines. As such, this parameter affects the optimal cross_width and cross_height. - region: This parameter influences the search algorithm. The algorithm starts at an already-detected intersection, and jumps right with a distance that is derived from the annotated header template. At the new location, the algorithm then finds the best corner-match that is within a square of size region around that point, and selects that as the detected corner. Visualized:

search algorithm region

A larger region will be more forgiving for warping or other artefacts, but could lead to false positives too. - k, w: These parameters affect the thresholding algorithm that's used in the GridDetector. k adjusts the threshold. Larger values of k correspond with a larger threshold, meaning more pixels will be mapped to zero. You should increase this parameter until most of the noise is gone in your image, without removing too many pixels from the actual lines of the table. w is less important, but adjusts the window size of the sauvola thresholding algorithm that is used under the hood.

HeaderTemplate

  • intersection((row, height)): this method calculates the intersection of a horizontal and vertical line in the annotated header template. For example, running template.intersection((1, 1)) corresponds with this intersection:

intersection diagram

This point can then be transformed to the image using the aligner, and this can serve as the starting point of the search algorithm. Note that in this case, the first column is skipped. This can often be useful since the GridDetector kernel looks for crosses, and the left-most intersection often only has a T shape (the left leg of the cross might be missing). If that is the case with your data too, it is a good idea to set the starting point to the (1, 1) intersection, and add in the first row later using the add_left_col(width) function. When doing this, you also need to set the parameter of the cell_widths function to 1. See this example. - cell_height(fraction: float): this method defines a single cell height for all of the rows. The fraction is multiplied with the height of the annotated header template to get the cell height relative to it.

Owner

  • Name: Ghent Centre for Digital Humanities
  • Login: GhentCDH
  • Kind: organization
  • Location: Belgium

Citation (CITATION.cff)

cff-version: 1.2.0
title: taulu
message: "If you use this software, please cite it using the metadata from this file."
type: software
authors:
  - given-names: Miel
    family-names: Peeters
    affiliation: GhentCDH
  - name: GhentCDH
    city: Gent
    country: BE
    website: 'https://www.ghentcdh.ugent.be/'
repository-code: 'https://github.com/ghentcdh/taulu'
abstract: "Taulu is a Python package designed to segment tabular data in scanned or photographed documents."
keywords:
  - OCR
  - tabular data
  - data extraction
  - HTR
  - historic documents
version: v0.7.5
date-released: '2025-05-28'

GitHub Events

Total
  • Create event: 14
  • Release event: 11
  • Issues event: 9
  • Watch event: 3
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 39
  • Pull request review event: 3
  • Pull request review comment event: 2
  • Pull request event: 10
Last Year
  • Create event: 14
  • Release event: 11
  • Issues event: 9
  • Watch event: 3
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 39
  • Pull request review event: 3
  • Pull request review comment event: 2
  • Pull request event: 10

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 111
  • Total Committers: 3
  • Avg Commits per committer: 37.0
  • Development Distribution Score (DDS): 0.018
Past Year
  • Commits: 111
  • Committers: 3
  • Avg Commits per committer: 37.0
  • Development Distribution Score (DDS): 0.018
Top Committers
Name Email Commits
Miel Peeters p****l@g****m 109
Vincent Ducatteeuw 8****w 1
Joren Six j****x@u****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 208 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 15
  • Total maintainers: 1
pypi.org: taulu

Segment a table from an image

  • Versions: 15
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 208 Last month
Rankings
Dependent packages count: 9.2%
Average: 30.6%
Dependent repos count: 51.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/main.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • astral-sh/setup-uv v5 composite
  • softprops/action-gh-release v2 composite
pyproject.toml pypi
  • numpy >=2.2.4
  • opencv-python >=4.11.0.86
  • pandas >=2.2.3
  • scikit-image >=0.25.2
uv.lock pypi
  • colorama 0.4.6
  • exceptiongroup 1.2.2
  • imageio 2.37.0
  • iniconfig 2.1.0
  • lazy-loader 0.4
  • networkx 3.4.2
  • numpy 2.2.4
  • opencv-python 4.11.0.86
  • packaging 24.2
  • pandas 2.2.3
  • pillow 11.1.0
  • pluggy 1.5.0
  • pytest 8.3.5
  • python-dateutil 2.9.0.post0
  • pytz 2025.2
  • scikit-image 0.25.2
  • scipy 1.15.2
  • six 1.17.0
  • taulu 0.6.0
  • tifffile 2025.3.30
  • tomli 2.2.1
  • tzdata 2025.2