lector
A fast reader for messy CSV files with optional type inference.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary
Keywords
Repository
A fast reader for messy CSV files with optional type inference.
Basic Info
- Host: GitHub
- Owner: graphext
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://lector.readthedocs.io/en/latest/
- Size: 245 KB
Statistics
- Stars: 17
- Watchers: 5
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Lector
Lector aims to be a fast reader for potentially messy CSV files with configurable column type inference. It combines automatic detection of file encodings, CSV dialects (separator, escaping etc.) and preambles (initial lines containing metadata or junk unrelated to the actual tabular data). Its goal is to just-read-the-effing-CSV file without manual configuration in most cases. Each of the detection components is configurable and can be swapped out easily with custom implementations.
Also, since both pandas and Apache Arrow will destructively cast columns to the wrong type in some cases (e.g. large ID-like integer strings to floats), it provides an alternative and customisable inference and casting mechanism.
Under the hood it uses pyarrow's CSV parser for reading, and its compute functions for optional type inference.
Lector is used at Graphext behind the scenes whenever a user uploads a new dataset, and so implicitly has been validated across 1000s of different CSV files from all kinds of sources. Note, however, that this is Graphext's first foray into open-sourcing our code and still work-in-progress. So at least initially we won't provide any guarantees as to support of this library.
For quick usage examples see the Usage section below or the notebook in this repo.
For detailed documentation visit https://lector.readthedocs.io/.
Installing
While this library is not available yet on pypi, you can easily install it from Github with
pip install git+https://github.com/graphext/lector
Usage
Let's assume we receive a CSV file containing some initial metadata, using the semicolon as separator, having some missing fields, and being encoded in Latin-1 (you'd be surprised how common such files are in the real world).
Create example CSV file
``` python csv = """ Some preamble content here This is still "part of the metadata preamble" id;genre;metric;count;content;website;tags;vecs;date 1234982348728374;a;0.1;1;; http://www.graphext.com;"[a,b,c]";"[1.3, 1.4, 1.67]";11/10/2022 ;b;0.12;;"Natural language text is different from categorical data."; https://www.twitter.com;[d];"[0, 1.9423]";01/10/2022 9007199254740993;a;3.14;3;"The Project · Gutenberg » EBook « of Die Fürstin.";http://www.google.com;"['e', 'f']";["84.234, 12509.99"];13/10/2021 """.encode("ISO-8859-1") with open("example.csv", "wb") as fp: fp.write(csv) ```To read this with lector into a pandas DataFrame, simply use
python
df = lector.read_csv("example.csv", to_pandas=True)
Printing the DataFrame and its column types produces the following output:
```
id genre metric count \
0 1234982348728374 a 0.10 1
1
content website \
0
tags vecs date
0 [a, b, c] [1.3, 1.4, 1.67] 2022-10-11 1 [d] [0.0, 1.9423] 2022-10-01 2 [e, f] [84.234, 12509.99] 2021-10-13
id Int64 genre category metric float64 count UInt8 content string website category tags object vecs object date datetime64[ns] dtype: object ```
This is pretty sweet, because
- we didn't have to tell lector how to read this file (text encoding, lines to skip, separator etc.)
- we didn't have to tell lector the data types of the columns, but it inferred the correct and most efficient ones automatically, e.g.:
- a nullable
Int64extension type was necessary to correctly represent values in theidcolumn - the
categorycolumn was automatically converted to the efficientdictionary(categorical) type - the
countcolumn uses the smallest integer type necessary - the
textcolumn, containing natural language text, has not been converted to a categortical type, but kept as string values (it is unlikely to benefit from dictionary-encoding) - the
datecolumn was converted to datetime's correctly, even though the original strings are not in an ISO format - the
tagsandvecscolumns have been imported withobjectdtype (since pandas doesn't officially support iterables as elements in a column), but its values are in fact numpy array of the correct dtype!
- a nullable
Neither pandas nor arrow will do this. In fact, they cannot even import this data correctly, without attempting to do any smart type inference. Compare e.g. with pandas attempt to read the same CSV file:
Pandas and Arrow fail
Firstly, to get something close to the above, you'll have to spend a good amount of time manually inspecting the CSV file and come up with the following verbose pandas call: ``` python dtypes = { "id": "Int64", "genre": "category", "metric": "float", "count": "UInt8", "content": "string", "website": "category", "tags": "object", "vecs": "object" } df = pd.read_csv( fp, encoding="ISO-8859-1", skiprows=3, sep=";", dtype=dtypes, parse_dates=["date"], infer_datetime_format=True ) ``` While this _parses_ the CSV file alright, the result is, urm, lacking. Let's see: ``` id genre metric count \ 0 1234982348728374 a 0.10 1 1Development
To install a local copy for development, including all dependencies for test, documentation and code quality, use the following commands:
bash
clone git+https://github.com/graphext/lector
cd lector
pip install -v -e ".[dev]"
pre-commit install
The pre-commit command will make sure that whenever you try to commit changes to this repo code quality and formatting tools will be executed. This ensures e.g. a common coding style, such that any changes to be commited are functional changes only, not changes due to different personal coding style preferences. This in turn makes it either to collaborate via pull requests etc.
To test installation you may execute the pytest suite to make sure everything's setup correctly, e.g.:
bash
pytest -v .
Documentation
The documentation is created using Sphinx and is available here: https://lector.readthedocs.io/.
You can build and view the static html locally like any other Sphinx project:
bash
(cd docs && make clean html)
(cd docs/build/html && python -m http.server)
To Do
- Parallelize type inference? While type inference is already pretty fast, it can potentially be sped up by processing columns in parallel.
- Testing. The current pytest setup is terrible. I've given
hypothesis_csva try here, but I'm probably making bad use of it. Tests are convoluted and probably not even good a catching corner cases.
License
This project is licensed under the terms of the Apache License 2.0.
Links
- Documentation: https://lector.readthedocs.io/
- Source: https://github.com/graphext/lector
- Graphext: https://www.graphext.com
- Graphext on Twitter: https://twitter.com/graphext
Owner
- Name: graphext
- Login: graphext
- Kind: organization
- Website: www.graphext.com
- Twitter: graphext
- Repositories: 9
- Profile: https://github.com/graphext
Data science for business
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Buhrmann" given-names: "Thomas" title: "Lector" version: 0.3.3 date-released: 2023-12-07 url: "https://github.com/graphext/lector"
GitHub Events
Total
- Watch event: 1
- Push event: 2
- Create event: 1
Last Year
- Watch event: 1
- Push event: 2
- Create event: 1
Dependencies
- clevercsv *