Tabbed: A Python package for reading variably structured text files at scale

Tabbed: A Python package for reading variably structured text files at scale - Published in JOSS (2025)

https://github.com/mscaudill/tabbed

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

csv csv-files csv-parser delimited-data parsing reader text-reader txt
Last synced: 3 months ago · JSON representation

Repository

A Python package for reading variably structured text files at scale

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 5
Topics
csv csv-files csv-parser delimited-data parsing reader text-reader txt
Created about 1 year ago · Last pushed 3 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

A Python package for reading variably structured text files at scale

DOI Documentation Python version Code style: black pytest Coverage PyPI - License

Tabbed is a Python library for reading variably structured text files. It automatically deduces data start locations, data types and performs iterative and value-based conditional reading of data rows.

Key Features | Usage | Documentation | Dependencies | Installation | Contributing | Acknowledgments


Key Features

  • Structural Inference:
    A common variant of the standard text file is one that contains metadata prior to a header or data section. Tabbed can locate the metadata, header and data locations in a file.

  • Type inference:
    Tabbed can parse int, float, complex, time, date and datetime instances at high-speed via a polling strategy.

  • Conditional Reading:
    Tabbed can filter rows during reading with equality, membership, rich comparison, regular expression matching and custom callables via simple keyword arguments.

  • Partial and Iterative Reading:
    Tabbed supports reading of large text files that consumes only as much memory as you choose.

Usage

Below is a sample file with a Metadata section and Header using the tab character as the delimiter.

annotations.txt ```AsciiDoc Experiment ID Experiment Animal ID Animal Researcher Test Directory path 

Number Start Time End Time Time From Start Channel Annotation 0 02/09/22 09:17:38.948 02/09/22 09:17:38.948 0.0000 ALL Started Recording 1 02/09/22 09:37:00.000 02/09/22 09:37:00.000 1161.0520 ALL start 2 02/09/22 09:37:00.000 02/09/22 09:37:08.784 1161.0520 ALL exploring 3 02/09/22 09:37:08.784 02/09/22 09:37:13.897 1169.8360 ALL grooming 4 02/09/22 09:37:13.897 02/09/22 09:38:01.262 1174.9490 ALL exploring 5 02/09/22 09:38:01.262 02/09/22 09:38:07.909 1222.3140 ALL grooming 6 02/09/22 09:38:07.909 02/09/22 09:38:20.258 1228.9610 ALL exploring 7 02/09/22 09:38:20.258 02/09/22 09:38:25.435 1241.3100 ALL grooming 8 02/09/22 09:38:25.435 02/09/22 09:40:07.055 1246.4870 ALL exploring 9 02/09/22 09:40:07.055 02/09/22 09:40:22.334 1348.1070 ALL grooming 10 02/09/22 09:40:22.334 02/09/22 09:41:36.664 1363.3860 ALL exploring ```

Dialect and Type Inference

Tabbed can detect the dialect via clevercsv and infer the data types.

```python from tabbed.reading import Reader from tabbed.samples import paths

infile = open(paths.annotations, 'r') reader = Reader(infile) dialect = reader.sniffer.dialect types, _ = reader.sniffer.types(poll=10)

print(dialect) # a clevercsv SimpleDialect print('---') print(types) ```

Output ```

SimpleDialect('\t', '"', None)

[, , , , , ] ```

Metadata and Header detection

Tabbed can automatically locate the metadata, header and data rows.

python print(reader.header) print('---') print(reader.metadata())

Output ``` Header(line=6, names=['Number', 'StartTime', 'EndTime', 'TimeFromStart', 'Channel', 'Annotation'],

string='Number\tStart Time\tEnd Time\tTime From Start\tChannel\tAnnotation')

MetaData(lines=(0, 6), string='Experiment ID\tExperiment\nAnimal ID\tAnimal\nResearcher\tTest\nDirectory path\t\n\n') ```

Filtered Reading with Tabs

Tabbed supports row and column filtering with equality, membership, rich comparison and regular expression matching. Its also fully iterative allowing users to choose the amount of memory to consume during file reading.

```python from itertools import chain

tab rows whose Start_Time is between 9:38 and 9:40 and set reader to read

only the Number and Start_Time columns

reader.tab( StartTime='>= 2/09/2022 9:38:00 and <2/09/2022 9:40:00', columns=['Number', 'StartTime'] )

read the data to an iterator reading only 2 rows at a time

gen = reader.read(chunksize=2)

convert to an in-memory list

data = list(chain.from_iterable(gen)) print(data)

close the reader when done or open under context-management

reader.close() ```

Output {'Number': 5, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 1, 262000)} {'Number': 6, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 7, 909000)} {'Number': 7, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 20, 258000)} {'Number': 8, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 25, 435000)}

Documentation

The official documentation is hosted on github.io.

Dependencies

Tabbed depends on the excellent clevercsv package for dialect detection. The rest is pure Python.

Installation

Tabbed is hosted on pypi and can be installed with pip into a virtual environment.

bash pip install tabbed

To get a development version of Tabbed from source start by cloning the repository

bash git clone git@github.com:mscaudill/tabbed.git

Go to the directory you just cloned and create an editable install with pip. bash pip install -e .[dev]

Contributing

We're excited you want to contribute! Please check out our Contribution guide.

Acknowledgements


We are grateful for the support of the Ting Tsung and Wei Fong Chao Foundation and the Jan and Dan Duncan Neurological Research Institute at Texas Children's that generously supports Tabbed.


Owner

  • Name: Matt Caudill
  • Login: mscaudill
  • Kind: user
  • Location: Houston, TX
  • Company: Baylor College of Medicine & Texas Childerns NRI

JOSS Publication

Tabbed: A Python package for reading variably structured text files at scale
Published
November 11, 2025
Volume 10, Issue 115, Page 8964
Authors
Matthew S. Caudill ORCID
Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States of America, Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston, TX, United States of America
Editor
Neea Rusch ORCID
Tags
Data Science File Parsing Text Processing

GitHub Events

Total
  • Create event: 4
  • Issues event: 20
  • Release event: 1
  • Issue comment event: 23
  • Push event: 127
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Create event: 4
  • Issues event: 20
  • Release event: 1
  • Issue comment event: 23
  • Push event: 127
  • Pull request event: 2
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 287
  • Total Committers: 2
  • Avg Commits per committer: 143.5
  • Development Distribution Score (DDS): 0.038
Past Year
  • Commits: 287
  • Committers: 2
  • Avg Commits per committer: 143.5
  • Development Distribution Score (DDS): 0.038
Top Committers
Name Email Commits
mscaudill m****l@g****m 276
Brad Sheppard b****d@b****u 11
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 13
  • Total pull requests: 1
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 0.92
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 13
  • Pull requests: 1
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.92
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jolars (10)
  • ymahlau (3)
Pull Request Authors
  • BradShepps (1)
Top Labels
Issue Labels
bug (5) enhancement (3) question (1) documentation (1)
Pull Request Labels

Dependencies

pyproject.toml pypi
  • ipython *
  • matplotlib *
  • notebook *
  • numpy *
  • psutil *
  • requests *
  • scikit-learn *
  • scipy *
  • wget *