https://github.com/vincentlaucsb/csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.

Keywords

c-plus-plus c-plus-plus-11 c-plus-plus-14 c-plus-plus-17 csv csv-parser csv-reader json parser statistics tab-separated

Keywords from Contributors

annotation

Last synced: 5 months ago · JSON representation

Repository

A high-performance, fully-featured CSV parser and serializer for modern C++.

Basic Info

Host: GitHub
Owner: vincentlaucsb
License: mit
Language: C++
Default Branch: master
Homepage:
Size: 10.3 MB

Statistics

Stars: 1,009
Watchers: 25
Forks: 174
Open Issues: 31
Releases: 32

Topics

c-plus-plus c-plus-plus-11 c-plus-plus-14 c-plus-plus-17 csv csv-parser csv-reader json parser statistics tab-separated

Created over 8 years ago · Last pushed 6 months ago

Metadata Files

Readme License

Vince's CSV Parser

Motivation
Documentation
Integration
Features & Examples
Contributing

Motivation

There's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.

Performance and Memory Requirements

A high performance CSV parser allows you to take advantage of the deluge of large datasets available. By using overlapped threads, memory mapped IO, and minimal memory allocation, this parser can quickly tackle large CSV files--even if they are larger than RAM.

In fact, according to Visual Studio's profier this CSV parser spends almost 90% of its CPU cycles actually reading your data as opposed to getting hung up in hard disk I/O or pushing around memory.

Show me the numbers

On my computer (12th Gen Intel(R) Core(TM) i5-12400 @ 2.50 GHz/Western Digital Blue 5400RPM HDD), this parser can read * the 69.9 MB 2015_StateDepartment.csv in 0.19 seconds (360 MBps) * a 1.4 GB Craigslist Used Vehicles Dataset in 1.18 seconds (1.2 GBps) * a 2.9GB Car Accidents Dataset in 8.49 seconds (352 MBps)

Robust Yet Flexible

RFC 4180 and Beyond

This CSV parser is much more than a fancy string splitter, and parses all files following RFC 4180.

However, in reality we know that RFC 4180 is just a suggestion, and there's many "flavors" of CSV such as tab-delimited files. Thus, this library has: * Automatic delimiter guessing * Ability to ignore comments in leading rows and elsewhere * Ability to handle rows of different lengths * Ability to handle arbitrary line endings (as long as they are some combination of carriage return and newline)

By default, rows of variable length are silently ignored, although you may elect to keep them or throw an error.

Encoding

This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files. It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks.

Well Tested

This CSV parser has an extensive test suite and is checked for memory safety with Valgrind. If you still manage to find a bug, do not hesitate to report it.

Documentation

In addition to the Features & Examples below, a fully-fledged online documentation contains more examples, details, interesting features, and instructions for less common use cases.

Integration

This library was developed with Microsoft Visual Studio and is compatible with >g++ 7.5 and clang. All of the code required to build this library, aside from the C++ standard library, is contained under include/.

C++ Version

While C++17 is recommended, C++11 is the minimum version required. This library makes extensive use of string views, and uses Martin Moene's string view library if std::string_view is not available.

Single Header

This library is available as a single .hpp file under single_include/csv.hpp.

CMake Instructions

If you're including this in another CMake project, you can simply clone this repo into your project directory, and add the following to your CMakeLists.txt:

```

Optional: Defaults to C++ 17

set(CSVCXXSTANDARD 11)

add_subdirectory(csv-parser)

...

addexecutable( ...) targetlink_libraries( csv)

```

Avoid cloning with FetchContent

Don't want to clone? No problem. There's also a simple example documenting how to use CMake's FetchContent module to integrate this library.

Features & Examples

Reading an Arbitrarily Large File (with Iterators)

With this library, you can easily stream over a large file without reading its entirety into memory.

C++ Style ```cpp

include "csv.hpp"

using namespace csv;

...

CSVReader reader("verybigfile.csv");

for (CSVRow& row: reader) { // Input iterator for (CSVField& field: row) { // By default, get<>() produces a std::string. // A more efficient get() is also available, where the resulting // string_view is valid as long as the parent CSVRow is alive std::cout << field.get<>() << ... } }

... ```

Old-Fashioned C Style Loop ```cpp ...

CSVReader reader("verybigfile.csv"); CSVRow row;

while (reader.read_row(row)) { // Do stuff with row here }

... ```

Memory-Mapped Files vs. Streams

By default, passing in a file path string to the constructor of CSVReader causes memory-mapped IO to be used. In general, this option is the most performant.

However, std::ifstream may also be used as well as in-memory sources via std::stringstream.

Note: Currently CSV guessing only works for memory-mapped files. The CSV dialect must be manually defined for other sources.

```cpp CSVFormat format; // custom formatting options go here

CSVReader mmap("some_file.csv", format);

std::ifstream infile("somefile.csv", std::ios::binary); CSVReader ifstreamreader(infile, format);

std::stringstream mycsv; CSVReader sstreamreader(my_csv, format); ```

Indexing by Column Names

Retrieving values using a column name string is a cheap, constant time operation.

```cpp

include "csv.hpp"

using namespace csv;

...

CSVReader reader("verybigfile.csv"); double sum = 0;

for (auto& row: reader) { // Note: Can also use index of column with [] operator sum += row["Total Salary"].get(); }

... ```

Numeric Conversions

If your CSV has lots of numeric values, you can also have this parser (lazily) convert them to the proper data type.

Type checking is performed on conversions to prevent undefined behavior and integer overflow
- Negative numbers cannot be blindly converted to unsigned integer types
get<float>(), get<double>(), and get<long double>() are capable of parsing numbers written in scientific notation.
Note: Conversions to floating point types are not currently checked for loss of precision.

```cpp

include "csv.hpp"

using namespace csv;

...

CSVReader reader("verybigfile.csv");

for (auto& row: reader) { if (row["timestamp"].is_int()) { // Can use get<>() with any integer type, but negative // numbers cannot be converted to unsigned types row["timestamp"].get();

    // You can also attempt to parse hex values
    int value;
    if (row["hexValue"].try_parse_hex(value)) {
        std::cout << "Hex value is " << value << std::endl;
    }

    // Non-imperial decimal numbers can be handled this way
    long double decimalValue;
    if (row["decimalNumber"].try_parse_decimal(decimalValue, ',')) {
        std::cout << "Decimal value is " << decimalValue << std::endl;
    }

    // ..
}

}

```

Converting to JSON

You can serialize individual rows as JSON objects, where the keys are column names, or as JSON arrays (which don't contain column names). The outputted JSON contains properly escaped strings with minimal whitespace and no quoting for numeric values. How these JSON fragments are assembled into a larger JSON document is an exercise left for the user.

```cpp

include

include "csv.hpp"

using namespace csv;

...

CSVReader reader("verybigfile.csv"); std::stringstream my_json;

for (auto& row: reader) { myjson << row.tojson() << std::endl; myjson << row.tojson_array() << std::endl;

// You can pass in a vector of column names to
// slice or rearrange the outputted JSON
my_json << row.to_json({ "A", "B", "C" }) << std::endl;
my_json << row.to_json_array({ "C", "B", "A" }) << std::endl;

}

```

Specifying the CSV Format

Although the CSV parser has a decent guessing mechanism, in some cases it is preferrable to specify the exact parameters of a file.

```cpp

include "csv.hpp"

include ...

using namespace csv;

CSVFormat format; format.delimiter('\t') .quote('~') .headerrow(2); // Header is on 3rd row (zero-indexed) // .noheader(); // Parse CSVs without a header row // .quote(false); // Turn off quoting

// Alternatively, we can use format.delimiter({ '\t', ',', ... }) // to tell the CSV guesser which delimiters to try out

CSVReader reader("wierdcsvdialect.csv", format);

for (auto& row: reader) { // Do stuff with rows here }

```

Trimming Whitespace

This parser can efficiently trim off leading and trailing whitespace. Of course, make sure you don't include your intended delimiter or newlines in the list of characters to trim.

cpp CSVFormat format; format.trim({ ' ', '\t' });

Handling Variable Numbers of Columns

Sometimes, the rows in a CSV are not all of the same length. Whether this was intentional or not, this library is built to handle all use cases.

```cpp CSVFormat format;

// Default: Silently ignoring rows with missing or extraneous columns format.variablecolumns(false); // Short-hand format.variablecolumns(VariableColumnPolicy::IGNORE_ROW);

// Case 2: Keeping variable-length rows format.variablecolumns(true); // Short-hand format.variablecolumns(VariableColumnPolicy::KEEP);

// Case 3: Throwing an error if variable-length rows are encountered format.variable_columns(VariableColumnPolicy::THROW); ```

Setting Column Names

If a CSV file does not have column names, you can specify your own:

cpp std::vector<std::string> col_names = { ... }; CSVFormat format; format.column_names(col_names);

Parsing an In-Memory String

```cpp

include "csv.hpp"

using namespace csv;

...

// Method 1: Using parse() std::string csv_string = "Actor,Character\r\n" "Will Ferrell,Ricky Bobby\r\n" "John C. Reilly,Cal Naughton Jr.\r\n" "Sacha Baron Cohen,Jean Giard\r\n";

auto rows = parse(csv_string); for (auto& r: rows) { // Do stuff with row here }

// Method 2: Using csv operator auto rows = "Actor,Character\r\n" "Will Ferrell,Ricky Bobby\r\n" "John C. Reilly,Cal Naughton Jr.\r\n" "Sacha Baron Cohen,Jean Giard\r\n"csv;

for (auto& r: rows) { // Do stuff with row here }

```

Writing CSV Files

```cpp

include "csv.hpp"

include ...

using namespace csv; using namespace std;

...

stringstream ss; // Can also use ofstream, etc.

auto writer = makecsvwriter(ss); // auto writer = maketsvwriter(ss); // For tab-separated files // DelimWriter writer(ss); // Your own custom format // setdecimalplaces(2); // How many places after the decimal will be written for floats

writer << vector({ "A", "B", "C" }) << deque({ "I'm", "too", "tired" }) << list({ "to", "write", "documentation." });

writer << array({ "The quick brown", "fox", "jumps over the lazy dog" }); writer << make_tuple(1, 2.0, "Three"); ... ```

You can pass in arbitrary types into DelimWriter by defining a conversion function for that type to std::string.

Owner

Name: Vincent La
Login: vincentlaucsb
Kind: user
Location: Mesa, Arizona

Website: http://vincela.com/
Repositories: 7
Profile: https://github.com/vincentlaucsb

Proud Gaucho (UCSB '18). Occasionally codes.

GitHub Events

Total

Issues event: 19
Watch event: 99
Issue comment event: 24
Push event: 9
Pull request event: 20
Fork event: 23
Create event: 1

Last Year

Issues event: 19
Watch event: 99
Issue comment event: 24
Push event: 9
Pull request event: 20
Fork event: 23
Create event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 306
Total Committers: 34
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.141

Past Year

Commits: 8
Committers: 5
Avg Commits per committer: 1.6
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
vincentlaucsb	v**9@g**m	263
Tamas Kenez	t****z	5
Ryan A. Pavlik	r**k@g**m	3
Bryce Schober	b**r@d**m	2
Pavel Artemkin	a****l	2
Slobodan Kletnikov	s**v@h**m	2
xgdgsc	x****c	2
Colin	g**l@g**m	1
Claude Gex	m**l@c**h	1
Baptiste Lemarcis	B**s@g**m	1
Asvin Goel	g**b@t**u	1
Alexander Bigerl	a**r@b**u	1
wilfzim	4****z	1
Josh Bradley	j**1@u**u	1
vincentlaucsb	v**a@u**u	1
Yosef Lin	y**l@t**m	1
Josh Perry	j**y@a**o	1
genshen	g**u@g**m	1
Toby Ealden	t**n@g**m	1
Ryan Marcus	r**n@r**s	1
Ricardo Padrela	r**a@g**m	1
Onur Temizkan	o**n@g**m	1
OliverWangData	4****a	1
Niels Lohmann	n**n@g**m	1
NickAnderson019	n**k@t**a	1
Miguel Cunha	m**a@g**m	1
Matt Wilson	s****t	1
LukasKerk	7****k	1
Ludovic Delfau	5****u	1
Kim Walisch	k**h@g**m	1
and 4 more...

Committer Domains (Top 20 + Academic)

thalana.co.za: 1 ryanmarc.us: 1 aerobotics.co: 1 tom.com: 1 umail.ucsb.edu: 1 umd.edu: 1 bigerl.eu: 1 telematique.eu: 1 claudegex.ch: 1 dynonavionics.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 129
Total pull requests: 90
Average time to close issues: 9 months
Average time to close pull requests: 3 months
Total issue authors: 92
Total pull request authors: 33
Average comments per issue: 1.95
Average comments per pull request: 0.86
Merged pull requests: 65
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 14
Pull requests: 23
Average time to close issues: 22 days
Average time to close pull requests: 4 months
Issue authors: 12
Pull request authors: 13
Average comments per issue: 0.21
Average comments per pull request: 0.26
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

vincentlaucsb (12)
bangusi (5)
oschonrock (4)
codewithrajranjan (3)
elfring (3)
CleanHit (3)
viniciusjl (2)
hsdk123 (2)
chi-Dev04 (2)
edgimar (2)
sjoubert (2)
definable (2)
CrustyAuklet (2)
dangdkhanh (2)
freshduer (2)

Pull Request Authors

vincentlaucsb (44)
hirohira9119 (5)
tamaskenez (5)
DoozyX (4)
rpavlik (3)
tsengjun (3)
txemaotero (2)
rajgoel (2)
phaedon (2)
BaptisteLemarcis (2)
ludovicdelfau (2)
longqimin (2)
jschueller (2)
alongL (2)
Montage-eloise (2)

Top Labels

Issue Labels

bug (21) enhancement (18) help wanted (10) good first issue (6) wontfix (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2

proxy.golang.org: github.com/vincentlaucsb/csv-parser

Documentation: https://pkg.go.dev/github.com/vincentlaucsb/csv-parser#section-documentation
License: mit
Latest release: v1.1.0
published over 7 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 9.0%

Average: 9.6%

Dependent repos count: 10.2%

Last synced: 5 months ago

https://github.com/vincentlaucsb/csv-parser

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Vince's CSV Parser

Motivation

Performance and Memory Requirements

Show me the numbers

Robust Yet Flexible

RFC 4180 and Beyond

Encoding

Well Tested

Documentation

Sponsors

Integration

C++ Version

Single Header

CMake Instructions

Optional: Defaults to C++ 17

set(CSVCXXSTANDARD 11)

...

Avoid cloning with FetchContent

Features & Examples

Reading an Arbitrarily Large File (with Iterators)

include "csv.hpp"

Memory-Mapped Files vs. Streams

Indexing by Column Names

include "csv.hpp"

Numeric Conversions

include "csv.hpp"

Converting to JSON

include

include "csv.hpp"

Specifying the CSV Format

include "csv.hpp"

include ...

Trimming Whitespace

Handling Variable Numbers of Columns

Setting Column Names

Parsing an In-Memory String

include "csv.hpp"

Writing CSV Files

include "csv.hpp"

include ...

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/vincentlaucsb/csv-parser

Rankings