https://github.com/alan-turing-institute/clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

Keywords

csv csv-converter csv-export csv-files csv-format csv-import csv-parser csv-parsing csv-reader csv-reading data-analysis data-mining data-science datascience machine-learning python python-library python3

Keywords from Contributors

pypi genomics archival packaging interactive projection game-development tokenizer profiling sequences

Last synced: 6 months ago · JSON representation

Repository

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

Basic Info

Host: GitHub
Owner: alan-turing-institute
License: mit
Language: Python
Default Branch: master
Homepage: https://clevercsv.readthedocs.io
Size: 3.48 MB

Statistics

Stars: 1,306
Watchers: 18
Forks: 78
Open Issues: 14
Releases: 4

Topics

csv csv-converter csv-export csv-files csv-format csv-import csv-parser csv-parsing csv-reader csv-reading data-analysis data-mining data-science datascience machine-learning python python-library python3

Created about 7 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License Code of conduct

README.md

CleverCSV provides a drop-in replacement for the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:

CleverCSV on Github
CleverCSV on PyPI
Documentation on ReadTheDocs
Demo of CleverCSV on Binder (interactive!)
Research Paper on CSV dialect detection (PDF)
Reproducible Research Repo
Blog post on messy CSV files
Discussion forum: a place to ask questions and share ideas!

Quick Start

Click here to go to the introduction with more details about CleverCSV. If you're in a hurry, below is a quick overview of how to get started with the CleverCSV Python package and the command line interface.

For the Python package:

```python

Import the package

import clevercsv

Load the file as a list of rows

This uses the imdb.csv file in the examples directory

rows = clevercsv.read_table('./imdb.csv')

Load the file as a Pandas Dataframe

Note that df = pd.read_csv('./imdb.csv') would fail here

df = clevercsv.read_dataframe('./imdb.csv')

Use CleverCSV as drop-in replacement for the Python CSV module

This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer

Note that csv.Sniffer would fail here

with open('./imdb.csv', newline='') as csvfile: ... dialect = clevercsv.Sniffer().sniff(csvfile.read()) ... csvfile.seek(0) ... reader = clevercsv.reader(csvfile, dialect) ... rows = list(reader) ```

And for the command line interface:

```python

Install the full version of CleverCSV (this includes the command line interface)

$ pip install clevercsv[full]

Detect the dialect

$ clevercsv detect ./imdb.csv Detected: SimpleDialect(',', '', '\')

Generate code to import the file

$ clevercsv code ./imdb.csv

import clevercsv

with open("./imdb.csv", "r", newline="", encoding="utf-8") as fp: reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\") rows = list(reader)

Explore the CSV file as a Pandas dataframe

$ clevercsv explore -p imdb.csv Dropping you into an interactive shell. CleverCSV has loaded the data into the variable: df

df ```

Introduction

CSV files are awesome! They are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
CSV files are terrible! They can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for recording metadata!

CleverCSV is a Python package that aims to solve some of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of row lengths of the parsed file and the data type of the resulting cells. With our method we achieve 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files compared to the Python standard library.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

bib @article{van2019wrangling, title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns}, author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.}, journal = {Data Mining and Knowledge Discovery}, year = {2019}, volume = {33}, number = {6}, pages = {1799--1820}, issn = {1573-756X}, doi = {10.1007/s10618-019-00646-y}, }

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

CleverCSV is available on PyPI. You can install either the full version, which includes the command line interface and all optional dependencies, using

bash $ pip install clevercsv[full]

or you can install a lighter, core version of CleverCSV with

bash $ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Python Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

python import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

detect_dialect: takes a path to a CSV file and returns the detected dialect
read_table: automatically detects the dialect and encoding of the file, and returns the data as a list of rows. A version that returns a generator is also available: stream_table
read_dataframe: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame. Note that this function requires Pandas to be installed.
read_dicts: detect the dialect and return the rows of the file as dictionaries, assuming the first row contains the headers. A streaming version called stream_dicts is also available.
write_table: write a table (a list of lists) to a file using the RFC-4180 dialect.
write_dicts: write a list of dictionaries to a file using the RFC-4180 dialect.

Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

```python import clevercsv

with open("data.csv", "r", newline="") as fp: # you can use verbose=True to see what CleverCSV does dialect = clevercsv.Sniffer().sniff(fp.read(), verbose=False) fp.seek(0) reader = clevercsv.reader(fp, dialect) rows = list(reader) ```

Since CleverCSV v0.8.0, dialect detection is a lot faster than in previous versions. However, for large files, you can speed up detection even more by supplying a sample of the document to the sniffer instead of the whole file, for example: python dialect = clevercsv.Sniffer().sniff(fp.read(10000)) You can also speed up encoding detection by installing cCharDet, it will automatically be used when it is available on the system.

That's the basics! If you want more details, you can look at the code of the package, the test suite, or the API documentation. If you run into any issues or have comments or suggestions, please open an issue on GitHub.

Command-Line Tool

To use the command line tool, make sure that you install the full version of CleverCSV (see above).

The clevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

```text usage: clevercsv [-h] [-V] [-v] command ...

Available commands: help Display help information detect Detect the dialect of a CSV file view View the CSV file on the command line using TabView standardize Convert a CSV file to one that conforms to RFC-4180 code Generate Python code to import a CSV file explore Explore the CSV file in an interactive Python shell ```

Each of the commands has further options (for instance, the code and explore commands have support for importing the CSV file as a Pandas DataFrame). Use clevercsv help <command> or man clevercsv <command> for more information. Below are some examples for each command.

Note that each command accepts the -n or --num-chars flag to set the number of characters used to detect the dialect. This can be especially helpful to speed up dialect detection on large files.

Code

Code generation is useful when you don't want to detect the dialect of the same file over and over again. You simply run the following command and copy the generated code to a Python script!

```text $ clevercsv code imdb.csv

Code generated with CleverCSV

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp: reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\") rows = list(reader) ```

We also have a version that reads a Pandas dataframe:

```text $ clevercsv code --pandas imdb.csv

Code generated with CleverCSV

import clevercsv

df = clevercsv.read_dataframe("imdb.csv", delimiter=",", quotechar="", escapechar="\") ```

Detect

Detection is useful when you only want to know the dialect.

text $ clevercsv detect imdb.csv Detected: SimpleDialect(',', '', '\\')

The --plain flag gives the components of the dialect on separate lines, which makes combining it with grep easier.

text $ clevercsv detect --plain imdb.csv delimiter = , quotechar = escapechar = \

Explore

The explore command is great for a command-line based workflow, or when you quickly want to start working with a CSV file in Python. This command detects the dialect of a CSV file and starts an interactive Python shell with the file already loaded! You can either have the file loaded as a list of lists:

```text $ clevercsv explore milk.csv Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: rows

len(rows) 381 ```

or you can load the file as a Pandas dataframe:

```text $ clevercsv explore -p imdb.csv Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: df

df.head() fn tid ... War Western 0 titles01/tt0012349 tt0012349 ... 0 0 1 titles01/tt0015864 tt0015864 ... 0 0 2 titles01/tt0017136 tt0017136 ... 0 0 3 titles01/tt0017925 tt0017925 ... 0 0 4 titles01/tt0021749 tt0021749 ... 0 0

[5 rows x 44 columns] ```

Standardize

Use the standardize command when you want to rewrite a file using the RFC-4180 standard:

text $ clevercsv standardize --output imdb_standard.csv imdb.csv

In this particular example the use of the escape character is replaced by using quotes.

View

This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardize command support the --transpose flag, if you want to transpose the file before viewing or saving:

text $ clevercsv view --transpose imdb.csv

Version Control Integration

If you'd like to make sure that you never commit a messy (non-standard) CSV file to your repository, you can install a pre-commit hook. First, install pre-commit using the installation instructions. Next, add the following configuration to the .pre-commit-config.yaml file in your repository:

yaml repos: - repo: https://github.com/alan-turing-institute/CleverCSV-pre-commit rev: v0.6.6 # or any later version hooks: - id: clevercsv-standardize

Finally, run pre-commit install to set up the git hook. Pre-commit will now use CleverCSV to standardize your CSV files following RFC-4180 whenever you commit a CSV file to your repository.

Contributing

If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!

If you encounter an issue in CleverCSV, please open an issue or submit a pull request. Don't hesitate, you're helping to make this project better for everyone! If GitHub's not your thing but you still want to contact us, you can send an email to gertjanvandenburg at gmail dot com instead. You can also ask questions on Gitter.

Note that all contributions to the project must adhere to the Code of Conduct.

The CleverCSV package was originally written by Gertjan van den Burg and came out of scientific research on wrangling messy CSV files by Gertjan van den Burg, Alfredo Nazabal, and Charles Sutton.

Notes

CleverCSV is licensed under the MIT license. Please cite our research if you use CleverCSV in your work.

Owner

Name: The Alan Turing Institute
Login: alan-turing-institute
Kind: organization
Email: info@turing.ac.uk

Website: https://turing.ac.uk
Repositories: 477
Profile: https://github.com/alan-turing-institute

The UK's national institute for data science and artificial intelligence.

GitHub Events

Total

Create event: 12
Release event: 1
Issues event: 7
Watch event: 48
Delete event: 9
Issue comment event: 23
Push event: 16
Pull request event: 22
Fork event: 6

Last Year

Create event: 12
Release event: 1
Issues event: 7
Watch event: 48
Delete event: 9
Issue comment event: 23
Push event: 16
Pull request event: 22
Fork event: 6

Committers

Last synced: 9 months ago

All Time

Total Commits: 767
Total Committers: 8
Avg Commits per committer: 95.875
Development Distribution Score (DDS): 0.039

Past Year

Commits: 15
Committers: 2
Avg Commits per committer: 7.5
Development Distribution Score (DDS): 0.267

Top Committers

Name	Email	Commits
Gertjan van den Burg	g**g@g**m	737
dependabot[bot]	4****]	22
Jakob Gerhard Martinussen	j**m@g**m	3
odidev	o**v@p**m	1
Stefano Rivera	s**o@r**t	1
Martin Weinelt	h**a@d**e	1
JB Desbas	j**s@g**m	1
Dan Homola	d**a@h**z	1

Committer Domains (Top 20 + Academic)

hotmail.cz: 1 darmstadt.ccc.de: 1 rivera.za.net: 1 puresoftware.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 48
Total pull requests: 97
Average time to close issues: 2 months
Average time to close pull requests: 10 days
Total issue authors: 34
Total pull request authors: 12
Average comments per issue: 3.38
Average comments per pull request: 0.57
Merged pull requests: 73
Bot issues: 0
Bot pull requests: 41

Past Year

Issues: 4
Pull requests: 14
Average time to close issues: about 2 months
Average time to close pull requests: 30 days
Issue authors: 4
Pull request authors: 5
Average comments per issue: 2.75
Average comments per pull request: 0.57
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 10

View more stats

Top Authors

Issue Authors

tooptoop4 (5)
hcheng2002cn (4)
baldurmen (4)
seperman (3)
kloczek (2)
DeflateAwning (2)
timeisflowing (1)
rkiddy (1)
rmitula (1)
jbdesbas (1)
RahulSinghYYC (1)
jlumbroso (1)
sanmai-NL (1)
ben-bitdotio (1)
BergLucas (1)

Pull Request Authors

dependabot[bot] (59)
GjjvdBurg (43)
baldurmen (3)
jacksonllee (2)
0xSheller (2)
JakobGM (2)
no23reason (2)
TomVS (1)
jbdesbas (1)
mweinelt (1)
ben-bitdotio (1)
stefanor (1)
odidev (1)

Top Labels

Issue Labels

enhancement (2) wontfix (1)

Pull Request Labels

dependencies (59) github_actions (11)

Packages

Total packages: 3
Total downloads:
- pypi 234,781 last-month
Total docker downloads: 12,309

Total dependent packages: 7
(may contain duplicates)
Total dependent repositories: 28
(may contain duplicates)
Total versions: 235
Total maintainers: 1

pypi.org: clevercsv

A Python package for handling messy CSV files

Homepage: https://github.com/alan-turing-institute/CleverCSV
Documentation: https://clevercsv.readthedocs.io/
License: MIT
Latest release: 0.8.3
published about 1 year ago

Versions: 49
Dependent Packages: 7
Dependent Repositories: 28
Downloads: 234,781 Last month
Docker Downloads: 12,309

Rankings

Dependent packages count: 1.6%

Downloads: 1.7%

Stargazers count: 1.9%

Docker downloads count: 2.2%

Average: 2.6%

Dependent repos count: 2.7%

Forks count: 5.4%

Maintainers (1)

GjjvdBurg

Last synced: 6 months ago

proxy.golang.org: github.com/alan-turing-institute/CleverCSV

Documentation: https://pkg.go.dev/github.com/alan-turing-institute/CleverCSV#section-documentation
License: mit
Latest release: v0.8.3
published about 1 year ago

Versions: 93
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 2.1%

Forks count: 3.2%

Average: 4.1%

Dependent packages count: 5.4%

Dependent repos count: 5.7%

Last synced: 6 months ago

proxy.golang.org: github.com/alan-turing-institute/clevercsv

Documentation: https://pkg.go.dev/github.com/alan-turing-institute/clevercsv#section-documentation
License: mit
Latest release: v0.8.3
published about 1 year ago

Versions: 93
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 2.1%

Forks count: 3.2%

Average: 4.1%

Dependent packages count: 5.4%

Dependent repos count: 5.7%

Last synced: 6 months ago

https://github.com/alan-turing-institute/clevercsv

Science Score: 57.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Quick Start

Import the package

Load the file as a list of rows

This uses the imdb.csv file in the examples directory

Load the file as a Pandas Dataframe

Note that df = pd.read_csv('./imdb.csv') would fail here

Use CleverCSV as drop-in replacement for the Python CSV module

This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer

Note that csv.Sniffer would fail here

Install the full version of CleverCSV (this includes the command line interface)

Detect the dialect

Generate code to import the file

Explore the CSV file as a Pandas dataframe

Introduction

Installation

Usage

Python Library

Command-Line Tool

Code

Code generated with CleverCSV

Code generated with CleverCSV

Detect

Explore

Standardize

View

Version Control Integration

Contributing

Notes

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: clevercsv

Rankings

Maintainers (1)

proxy.golang.org: github.com/alan-turing-institute/CleverCSV

Rankings

proxy.golang.org: github.com/alan-turing-institute/clevercsv

Rankings

Dependencies