dama

Look at data in different ways

https://github.com/philippeller/dama

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 2 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Keywords

array axes axis data-science grid griddata histogram jupyter map numpy translation

Last synced: 6 months ago · JSON representation

Repository

Look at data in different ways

Basic Info

Host: GitHub
Owner: philippeller
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 33.4 MB

Statistics

Stars: 9
Watchers: 0
Forks: 1
Open Issues: 2
Releases: 0

Topics

array axes axis data-science grid griddata histogram jupyter map numpy translation

Created almost 8 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

dama - Data Manipulator

The dama python library guides you through your data and translates between different representations. Its aim is to offer a consistant and pythonic way to handle different datasaets and translations between them. A dataset can for instance be simple colum/row data, or it can be data on a grid.

One of the key features of dama is the seamless translation from one data represenation into any other. Convenience pyplot plotting functions are also available, in order to produce standard plots without any hassle.

Installation

pip install dama

Getting Started

python import numpy as np import dama as dm

Grid Data

GridData is a collection of individual GridArrays. Both have a defined grid, here we initialize the grid in the constructor through simple keyword arguments resulting in a 2d grid with axes x and y

python g = dm.GridData(x = np.linspace(0,3*np.pi, 30), y = np.linspace(0,2*np.pi, 20), )

Filling one array with some sinusoidal functions, called a here

python g['a'] = np.sin(g['x']) * np.cos(g['y'])

As a shorthand, we can also use attributes instead of items:

python g.a = np.sin(g.x) * np.cos(g.y)

in 1-d and 2-d they render as html in jupyter notebooks

It can be plotted easily in case of 1-d and 2-d grids

python g.plot(cbar=True);

png

Let's interpolate the values to 200 points along each axis and plot

python g.interp(x=200, y=200).plot(cbar=True);

png

Executions of (most) translation methods is lazy. That means that the computation only happens if a specific variable is used. This can have some side effects, that when you maipulate the original data before the translation is evaluated. just something to be aware of.

Masking, and item assignement also is supported

python g.a[g.a > 0.3]

y \ x	0	0.325	0.65	...	8.77	9.1	9.42
0	--	0.319	0.605	...	0.605	0.319	--
0.331	--	0.302	0.572	...	0.572	0.302	--
0.661	--	--	0.478	...	0.478	--	--
...	...	...	...	...	...	...	...
5.62	--	--	0.478	...	0.478	--	--
5.95	--	0.302	0.572	...	0.572	0.302	--
6.28	--	0.319	0.605	...	0.605	0.319	--

The objects are also numpy compatible and indexable by index (integers) or value (floats). Numpy functions with axis keywords accept either the name(s) of the axis, e.g. here x and therefore is independent of axis ordering, or the usual integer indices.

python g[10::-1, :np.pi:2]

y \ x	3.25	2.92	2.6	...	0.65	0.325	0
0	a = -0.108	a = 0.215	a = 0.516	...	a = 0.605	a = 0.319	a = 0
0.661	a = -0.0853	a = 0.17	a = 0.407	...	a = 0.478	a = 0.252	a = 0
1.32	a = -0.0265	a = 0.0528	a = 0.127	...	a = 0.149	a = 0.0784	a = 0
1.98	a = 0.0434	a = -0.0864	a = -0.207	...	a = -0.243	a = -0.128	a = -0
2.65	a = 0.0951	a = -0.189	a = -0.453	...	a = -0.532	a = -0.281	a = -0

python np.sum(g[10::-1, :np.pi:2].T, axis='x')

y	0	0.661	1.32	1.98	2.65
a	6.03	4.76	1.48	-2.42	-5.3

Comparison

As comparison to point out the convenience, an alternative way without using dama to achieve the above would look something like the follwoing for creating and plotting the array:

``` x = np.linspace(0,3np.pi, 30) y = np.linspace(0,2np.pi, 20)

xx, yy = np.meshgrid(x, y) a = np.sin(xx) * np.cos(yy)

import matplotlib.pyplot as plt

xwidths = np.diff(x) xpixelboundaries = np.concatenate([[x[0] - 0.5*xwidths[0]], x[:-1] + 0.5x_widths, [x[-1] + 0.5xwidths[-1]]]) ywidths = np.diff(y) ypixelboundaries = np.concatenate([[y[0] - 0.5y_widths[0]], y[:-1] + 0.5ywidths, [y[-1] + 0.5*ywidths[-1]]])

pc = plt.pcolormesh(xpixelboundaries, ypixelboundaries, a) plt.gca().setxlabel('x') plt.gca().setylabel('y') cb = plt.colorbar(pc) cb.set_label('a') ```

and for doing the interpolation:

``` from scipy.interpolate import griddata

interpx = np.linspace(0,3*np.pi, 200) interpy = np.linspace(0,2*np.pi, 200)

gridx, gridy = np.meshgrid(interpx, interpy)

points = np.vstack([xx.flatten(), yy.flatten()]).T values = a.flatten()

interpa = griddata(points, values, (gridx, grid_y), method='cubic') ```

PointData

Another representation of data is PointData, which is not any different of a dictionary holding same-length nd-arrays or a pandas DataFrame (And can actually be instantiated with those).

python p = dm.PointData() p.x = np.random.randn(100_000) p.a = np.random.rand(p.size) * p.x**2

python p

x	0.0341	0.212	0.517	...	1.27	0.827	1.57
a	0.00106	0.035	0.18	...	1.59	0.246	0.201

python p.plot()

png

Maybe a correlation plot would be more insightful:

python p.plot('x', 'a', '.');

png

This can now seamlessly be translated into Griddata, for example taking the data binwise in x in 20 bins, and in each bin summing up points:

python p.binwise(x=20).sum()

x	[-4.392 -3.962]	[-3.962 -3.532]	[-3.532 -3.102]	...	[2.916 3.346]	[3.346 3.776]	[3.776 4.206]
a	29	131	456	...	631	163	77.7

python p.binwise(x=20).sum().plot();

png

This is equivalent of making a weighted histogram, while the latter is faster.

python p.histogram(x=20).a

x	[-4.392 -3.962]	[-3.962 -3.532]	[-3.532 -3.102]	...	[2.916 3.346]	[3.346 3.776]	[3.776 4.206]
	29	131	456	...	631	163	77.7

python np.allclose(p.histogram(x=10).a, p.binwise(x=10).sum().a)

True

There is also KDE in n-dimensions available, for example:

python p.kde(x=1000).a.plot();

png

GridArrays can also hold multi-dimensional values, like RGB images or here 5 values from the percentile function. Let's plot those as bands:

python p.binwise(x=20).quantile(q=[0.1, 0.3, 0.5, 0.7, 0.9]).plot_bands()

png

When we specify x with an array, we e gives a list of points to binwise. So the resulting plot will consist of points, not bins.

python p.binwise(x=np.linspace(-3,3,10)).quantile(q=[0.1, 0.3, 0.5, 0.7, 0.9]).plot_bands(lines=True, filled=True, linestyles=[':', '--', '-'], lw=1)

png

This is not the same as using edges as in the example below, hence also the plots look different.

python p.binwise(x=dm.Edges(np.linspace(-3,3,10))).quantile(q=[0.1, 0.3, 0.5, 0.7, 0.9]).plot_bands(lines=True, filled=True, linestyles=[':', '--', '-'], lw=1)

png

Saving and loading

Dama supports the pickle protocol, and objects can be stored like: python dm.save("filename.pkl", obj)

And read back like: python obj = dm.read("filename.pkl")

Example gallery

This is just to illustrate some different, seemingly random applications, resulting in various plots. All starting from some random data points

python from matplotlib import pyplot as plt

python p = dm.PointData() p.x = np.random.rand(10_000) p.y = np.random.randn(p.size) * np.sin(p.x*3*np.pi) * p.x p.a = p.y/p.x

```python fig, ax = plt.subplots(4,4,figsize=(20,20)) ax = ax.flatten()

First row

p.y.plot(ax=ax[0]) p.plot('x', 'y', '.', ax=ax[1]) p.plot_scatter('x', 'y', c='a', s=1, cmap=dm.cm.spectrum, ax=ax[2]) p.interp(x=100, y=100, method="nearest").a.plot(ax=ax[3])

Second row

np.log(1 + p.histogram(x=100, y=100).counts).plot(ax=ax[4]) p.kde(x=100, y=100, bw=(0.02, 0.05)).density.plot(cmap=dm.cm.afterburnerr, ax=ax[5]) p.histogram(x=10, y=10).interp(x=100,y=100).a.plot(cmap="RdBu", ax=ax[6]) p.histogram(x=100, y=100).counts.medianfilter(10).plot(ax=ax[7])

Third row

p.binwise(x=100).quantile(q=[0.1, 0.3, 0.5, 0.7, 0.9]).y.plotbands(ax=ax[8]) p.binwise(x=100).quantile(q=[0.1, 0.3, 0.5, 0.7, 0.9]).y.gaussianfilter((2.5,0)).interp(x=500).plot_bands(filled=False, lines=True, linestyles=[':', '--', '-'],ax=ax[9]) p.binwise(a=100).mean().y.plot(ax=ax[10]) p.binwise(a=100).std().y.plot(ax=ax[10]) p.histogram(x=100, y=100).counts.std(axis='x').plot(ax=ax[11])

Fourth row

np.log(p.histogram(x=100, y=100).counts + 1).gaussianfilter(0.5).plotcontour(cmap=dm.cm.passionr, ax=ax[12]) p.histogram(x=30, y=30).gaussianfilter(1).lookup(p).plot_scatter('x', 'y', 'a', 1, cmap='Spectral', ax=ax[13]) h = p.histogram(y=100, x=np.logspace(-1,0,100)).a.T h[h>0].plot(ax=ax[14]) h[1/3:2/3].plot(ax=ax[15]) ```

<matplotlib.collections.QuadMesh at 0x7f8a0315d1f0>

png

```python

```

Owner

Name: Philipp Eller
Login: philippeller
Kind: user
Location: Munich
Company: Origins Data Science Lab

Website: https://philippeller.github.io/
Repositories: 31
Profile: https://github.com/philippeller

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 289
Total Committers: 2
Avg Commits per committer: 144.5
Development Distribution Score (DDS): 0.007

Past Year

Commits: 3
Committers: 1
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Philipp Eller	p**s@g**m	287
Aaron Fienberg	a**g@p**u	2

Committer Domains (Top 20 + Academic)

psu.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 5
Average time to close issues: 6 months
Average time to close pull requests: about 3 hours
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.2
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

philippeller (6)

Pull Request Authors

philippeller (3)
atfienberg (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 21 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 6
Total maintainers: 1

pypi.org: dama

Look at data in different ways

Homepage: https://github.com/philippeller/dama
Documentation: https://dama.readthedocs.io/
License: Apache 2.0
Latest release: 0.4.8
published over 1 year ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 21 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 19.3%

Dependent repos count: 21.7%

Forks count: 22.6%

Average: 23.1%

Downloads: 41.9%

Maintainers (1)

peller

Last synced: 6 months ago

Dependencies

setup.py pypi

KDEpy *
matplotlib >=2.0
numpy >=1.11
numpy_indexed *
scipy >=0.17
tabulate *

.github/workflows/pythonpackage.yml actions

actions/checkout v1 composite
actions/setup-python v1 composite

requirements.txt pypi

dama

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

dama - Data Manipulator

Installation

Getting Started

Grid Data

Comparison

PointData

Saving and loading

Example gallery

First row

Second row

Third row

Fourth row

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: dama

Rankings

Maintainers (1)

Dependencies