Tabbed: A Python package for reading variably structured text files at scale
Tabbed: A Python package for reading variably structured text files at scale - Published in JOSS (2025)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
A Python package for reading variably structured text files at scale
Basic Info
- Host: GitHub
- Owner: mscaudill
- License: bsd-3-clause
- Language: Python
- Default Branch: master
- Homepage: https://mscaudill.github.io/tabbed/
- Size: 2.47 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 5
Topics
Metadata Files
README.md
A Python package for reading variably structured text files at scale
Tabbed is a Python library for reading variably structured text files. It automatically deduces data start locations, data types and performs iterative and value-based conditional reading of data rows.
Key Features | Usage | Documentation | Dependencies | Installation | Contributing | Acknowledgments
Key Features
Structural Inference:
A common variant of the standard text file is one that contains metadata prior to a header or data section. Tabbed can locate the metadata, header and data locations in a file.Type inference:
Tabbed can parseint,float,complex,time,dateanddatetimeinstances at high-speed via a polling strategy.Conditional Reading:
Tabbed can filter rows during reading with equality, membership, rich comparison, regular expression matching and custom callables via simple keyword arguments.Partial and Iterative Reading:
Tabbed supports reading of large text files that consumes only as much memory as you choose.
Usage
Below is a sample file with a Metadata section and Header using the tab character as the delimiter.
annotations.txt ```AsciiDoc Experiment ID Experiment Animal ID Animal Researcher Test Directory path
Number Start Time End Time Time From Start Channel Annotation 0 02/09/22 09:17:38.948 02/09/22 09:17:38.948 0.0000 ALL Started Recording 1 02/09/22 09:37:00.000 02/09/22 09:37:00.000 1161.0520 ALL start 2 02/09/22 09:37:00.000 02/09/22 09:37:08.784 1161.0520 ALL exploring 3 02/09/22 09:37:08.784 02/09/22 09:37:13.897 1169.8360 ALL grooming 4 02/09/22 09:37:13.897 02/09/22 09:38:01.262 1174.9490 ALL exploring 5 02/09/22 09:38:01.262 02/09/22 09:38:07.909 1222.3140 ALL grooming 6 02/09/22 09:38:07.909 02/09/22 09:38:20.258 1228.9610 ALL exploring 7 02/09/22 09:38:20.258 02/09/22 09:38:25.435 1241.3100 ALL grooming 8 02/09/22 09:38:25.435 02/09/22 09:40:07.055 1246.4870 ALL exploring 9 02/09/22 09:40:07.055 02/09/22 09:40:22.334 1348.1070 ALL grooming 10 02/09/22 09:40:22.334 02/09/22 09:41:36.664 1363.3860 ALL exploring ```
Dialect and Type Inference
Tabbed can detect the dialect via clevercsv and infer the data types.
```python from tabbed.reading import Reader from tabbed.samples import paths
infile = open(paths.annotations, 'r') reader = Reader(infile) dialect = reader.sniffer.dialect types, _ = reader.sniffer.types(poll=10)
print(dialect) # a clevercsv SimpleDialect print('---') print(types) ```
Output ```
SimpleDialect('\t', '"', None)
[
Metadata and Header detection
Tabbed can automatically locate the metadata, header and data rows.
python
print(reader.header)
print('---')
print(reader.metadata())
Output ``` Header(line=6, names=['Number', 'StartTime', 'EndTime', 'TimeFromStart', 'Channel', 'Annotation'],
string='Number\tStart Time\tEnd Time\tTime From Start\tChannel\tAnnotation')
MetaData(lines=(0, 6), string='Experiment ID\tExperiment\nAnimal ID\tAnimal\nResearcher\tTest\nDirectory path\t\n\n') ```
Filtered Reading with Tabs
Tabbed supports row and column filtering with equality, membership, rich comparison and regular expression matching. Its also fully iterative allowing users to choose the amount of memory to consume during file reading.
```python from itertools import chain
tab rows whose Start_Time is between 9:38 and 9:40 and set reader to read
only the Number and Start_Time columns
reader.tab( StartTime='>= 2/09/2022 9:38:00 and <2/09/2022 9:40:00', columns=['Number', 'StartTime'] )
read the data to an iterator reading only 2 rows at a time
gen = reader.read(chunksize=2)
convert to an in-memory list
data = list(chain.from_iterable(gen)) print(data)
close the reader when done or open under context-management
reader.close() ```
Output
{'Number': 5, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 1, 262000)}
{'Number': 6, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 7, 909000)}
{'Number': 7, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 20, 258000)}
{'Number': 8, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 25, 435000)}
Documentation
The official documentation is hosted on github.io.
Dependencies
Tabbed depends on the excellent clevercsv package for dialect detection. The rest is pure Python.
Installation
Tabbed is hosted on pypi and can be installed with pip into a virtual environment.
bash
pip install tabbed
To get a development version of Tabbed from source start by cloning the
repository
bash
git clone git@github.com:mscaudill/tabbed.git
Go to the directory you just cloned and create an editable install with pip.
bash
pip install -e .[dev]
Contributing
We're excited you want to contribute! Please check out our Contribution guide.
Acknowledgements
We are grateful for the support of the Ting Tsung and Wei Fong Chao Foundation and the Jan and Dan Duncan Neurological Research Institute at Texas Children's that generously supports Tabbed.
Owner
- Name: Matt Caudill
- Login: mscaudill
- Kind: user
- Location: Houston, TX
- Company: Baylor College of Medicine & Texas Childerns NRI
- Repositories: 3
- Profile: https://github.com/mscaudill
JOSS Publication
Tabbed: A Python package for reading variably structured text files at scale
Authors
Tags
Data Science File Parsing Text ProcessingGitHub Events
Total
- Create event: 4
- Issues event: 20
- Release event: 1
- Issue comment event: 23
- Push event: 127
- Pull request event: 2
- Fork event: 1
Last Year
- Create event: 4
- Issues event: 20
- Release event: 1
- Issue comment event: 23
- Push event: 127
- Pull request event: 2
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| mscaudill | m****l@g****m | 276 |
| Brad Sheppard | b****d@b****u | 11 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 13
- Total pull requests: 1
- Average time to close issues: 5 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 0.92
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 13
- Pull requests: 1
- Average time to close issues: 5 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 0.92
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jolars (10)
- ymahlau (3)
Pull Request Authors
- BradShepps (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- ipython *
- matplotlib *
- notebook *
- numpy *
- psutil *
- requests *
- scikit-learn *
- scipy *
- wget *
