https://github.com/hosseinmoein/dataframe

C++ DataFrame for statistical, financial, and ML analysis in modern C++

Keywords

ai cpp data-analysis data-science dataframe financial-data-analysis financial-engineering heterogeneous-data large-data machine-learning multidimensional-data numerical-analysis pandas polars statistical statistical-analysis tensor tensorboard trading-algorithms trading-strategies

Keywords from Contributors

conan cplusplus multi-platform

Last synced: 10 months ago · JSON representation

Repository

C++ DataFrame for statistical, financial, and ML analysis in modern C++

Basic Info

Host: GitHub
Owner: hosseinmoein
License: bsd-3-clause
Language: C++
Default Branch: master
Homepage: https://hosseinmoein.github.io/DataFrame/
Size: 47.8 MB

Statistics

Stars: 2,793
Watchers: 74
Forks: 341
Open Issues: 0
Releases: 33

Topics

ai cpp data-analysis data-science dataframe financial-data-analysis financial-engineering heterogeneous-data large-data machine-learning multidimensional-data numerical-analysis pandas polars statistical statistical-analysis tensor tensorboard trading-algorithms trading-strategies

Created over 8 years ago · Last pushed 10 months ago

Metadata Files

Readme Contributing Funding License

README.md

DataFrame Lion

DataFrame documentation with code samples

This is a C++ analytical library designed for data analysis similar to libraries in Python and R. For example, you would compare this to Pandas or R data.frame. The depth and breadth of functionalities offered by C++ DataFrame alone are greater than functionalities offered by packages such as Pandas, data.frame, and Polars combined.
You can slice the data in many different ways. You can join, merge, group-by the data. You can run various statistical, summarization, financial, and ML algorithms on the data. You can add your custom algorithms easily. You can multi-column sort, custom pick and delete the data. And more …
DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as Mean, STDEV, Moving Averages, ... to more involved analysis such as PCA, Polynomial Fit, FFT, Eigens ... including a good collection of trading indicators. You can also easily add your own algorithms.
DataFrame also employs extensive multithreading in almost all its API’s, for large datasets. That makes DataFrame especially suitable for analyzing large datasets.
For basic operations to start you off, see Hello World and/or Cheat Sheet. For a complete list of features with code samples, see documentation.

I have followed a few principles in this library:

Performance

You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long as you don’t have a loop). I have been asked by many people to write a comparison for DataFrame vs. Polars. So, I finally found some time to learn a bit about Polars and write a very simple benchmark.
I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version: 0.19.14 (Pandas version: 1.5.3, Numpy version: 1.24.2). And I used C++23 GCC-14 compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro (Intel chip, 96GB RAM).
In both cases, I created a dataframe with 3 random columns. The C++ DataFrame also required an additional index column of the same size. Polars doesn’t believe in index columns (that has its own pros and cons. I am not going through it here). Each program has three identical parts. First it generates and populates 3 columns with 300m random numbers each (in case of C++ DataFrame, it must also generate a sequential index column of the same size). That is the part I am not interested in. In the second part, it calculates the mean of the first column, the variance of the second column, and the Pearson correlation of the second and third columns. In the third part, it does a select (or filter as Polars calls it) on one of the columns.

Results:
The maximum dataset I could load into Polars was 300m rows per column. Any bigger dataset blew up the memory and caused OS to kill it. I ran C++ DataFrame with 10b rows per column and I am sure it would have run with bigger datasets too. So, I was forced to run both with 300m rows to compare. I ran each test 4 times and took the best time. Polars numbers varied a lot from one run to another, especially calculation and selection times. C++ DataFrame numbers were significantly more consistent.

| | C++ DataFrame | Polars | Pandas | | :--------------------- | ------------------------------------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Data Generation Time | 26.9459 secs | 28.4686 secs | 36.6799 secs | | Calculation Time | 1.2602 secs | 4.8766 secs | 40.3264 secs | | Selection Time | 0.4215 secs | 3.8766 secs | 8.3264 secs | | Overall Time | 28.9486 secs | 36.8763 secs | 85.8451 secs |

Please consider sponsoring DataFrame, especially if you are using it in production capacity. It is the strongest form of appreciation

Owner

Name: Hossein Moein
Login: hosseinmoein
Kind: user
Location: New York

Repositories: 19
Profile: https://github.com/hosseinmoein

Software Engineer

GitHub Events

Total

Create event: 11
Release event: 2
Issues event: 52
Watch event: 318
Issue comment event: 94
Push event: 190
Pull request review comment event: 9
Pull request review event: 9
Pull request event: 27
Fork event: 30

Last Year

Create event: 11
Release event: 2
Issues event: 52
Watch event: 318
Issue comment event: 94
Push event: 190
Pull request review comment event: 9
Pull request review event: 9
Pull request event: 27
Fork event: 30

Committers

Last synced: over 2 years ago

All Time

Total Commits: 1,532
Total Committers: 23
Avg Commits per committer: 66.609
Development Distribution Score (DDS): 0.51

Past Year

Commits: 189
Committers: 4
Avg Commits per committer: 47.25
Development Distribution Score (DDS): 0.386

Top Committers

Name	Email	Commits
hossein.moein@kensho.com	h**n@k**m	751
Hossein Moein	3****n	324
Hossein Moein	h**n@g**m	181
Hossein Moein	m**e@a**m	126
Justin K -Linux	j**7@g**m	69
SpaceIm	3****m	36
bplaa-yai	j**n@3**g	13
JimmyG	j**g@g**m	8
Jernej Makovsek	j**k@g**m	4
Marcello Mansueto	m**m@m**m	4
stepan.potys	s**s@o**m	3
Julien Marrec	j**c@g**m	2
Alexandre	E**o@p**m	1
Stepan	s**s@g**m	1
Justin K -Work	J**a@i**m	1
NikBomb	n**e@g**m	1
wujinghe	w**e@q**n	1
Konstantin Sorokin	k**s@s**u	1
Moshe Schorr	m**l@g**m	1
Guillaume Jacquenot	G****t	1
Enrico	e**a@g**m	1
theirix	t**x@g**m	1
Edouard Berthe	e**e@p**m	1

Committer Domains (Top 20 + Academic)

sigterm.ru: 1 qtrade.com.cn: 1 ibm.com: 1 onelogin.com: 1 mydatamodels.com: 1 3dw.org: 1 aol.com: 1 gradientboostedinvestments.com: 1 kensho.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 146
Total pull requests: 93
Average time to close issues: 14 days
Average time to close pull requests: 5 days
Total issue authors: 91
Total pull request authors: 14
Average comments per issue: 4.82
Average comments per pull request: 0.51
Merged pull requests: 69
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 28
Pull requests: 29
Average time to close issues: 7 days
Average time to close pull requests: 11 days
Issue authors: 18
Pull request authors: 4
Average comments per issue: 3.43
Average comments per pull request: 0.45
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

xkungfu (10)
sierret (7)
yuxiaojian01 (6)
kburchfiel (5)
mandeepsingh-private (3)
yegorrr (3)
adrian17 (3)
shaojunjie0912 (3)
federicomartinlara1976 (3)
TheBlackPlague (3)
Aratiganesh123 (3)
YingHREN (2)
justinjk007 (2)
SpaceIm (2)
thekvs (2)

Pull Request Authors

hosseinmoein (56)
SpaceIm (11)
mo42 (6)
GerHobbelt (4)
wujinghe (3)
andersc (3)
OMGtechy (2)
jchen8tw (2)
Githubprivaxy (1)
jmarrec (1)
eltociear (1)
edouardberthe (1)
hehuaijin (1)
Gjacquenot (1)

Top Labels

Issue Labels

question (60) Compiling (43) help wanted (36) enhancement (32) bug (12) Contribution (2) Comment (1)

Pull Request Labels

invalid (1)

Packages

Total packages: 2
Total downloads: unknown

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 2

proxy.golang.org: github.com/hosseinmoein/DataFrame

Documentation: https://pkg.go.dev/github.com/hosseinmoein/DataFrame#section-documentation
License: bsd-3-clause
Latest release: v2.3.0+incompatible
published over 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 11 months ago

proxy.golang.org: github.com/hosseinmoein/dataframe

Documentation: https://pkg.go.dev/github.com/hosseinmoein/dataframe#section-documentation
License: bsd-3-clause
Latest release: v2.3.0+incompatible
published over 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 11 months ago

https://github.com/hosseinmoein/dataframe

Science Score: 26.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DataFrame documentation with code samples

Performance

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/hosseinmoein/DataFrame

Rankings

proxy.golang.org: github.com/hosseinmoein/dataframe

Rankings

Dependencies