https://github.com/ml31415/numpy-groupies

Optimised tools for group-indexing operations: aggregated sum and more

Keywords

groupby numba numpy python

Keywords from Contributors

closember

Last synced: 5 months ago · JSON representation

Repository

Optimised tools for group-indexing operations: aggregated sum and more

Basic Info

Host: GitHub
Owner: ml31415
License: bsd-2-clause
Language: Python
Default Branch: master
Homepage:
Size: 646 KB

Statistics

Stars: 210
Watchers: 9
Forks: 22
Open Issues: 9
Releases: 15

Topics

groupby numba numpy python

Created over 12 years ago · Last pushed 9 months ago

Metadata Files

Readme License

README.md

Python Version from PEP 621 TOML PyPI - Downloads

numpy-groupies

This package consists of a small library of optimised tools for doing things that can roughly be considered "group-indexing operations". The most prominent tool is aggregate, which is described in detail further down the page.

Installation

If you have pip, then simply: pip install numpy_groupies Note that numpy_groupies doesn't have any compulsory dependencies (even numpy is optional) so you should be able to install it fairly easily even without a package manager. If you just want one particular implementation of aggregate (e.g. aggregate_numpy.py), you can download that one file, and copy-paste the contents of utils.py into the top of that file (replacing the from .utils import (...) line).

aggregate

aggregate_diagram ```python import numpy as np import numpygroupies as npg groupidx = np.array([ 3, 0, 0, 1, 0, 3, 5, 5, 0, 4]) a = np.array([13.2, 3.5, 3.5,-8.2, 3.0,13.4,99.2,-7.1, 0.0,53.7]) npg.aggregate(groupidx, a, func='sum', fillvalue=0)

>>> array([10.0, -8.2, 0.0, 26.6, 53.7, 92.1])

``aggregatetakes an array of values, and an array giving the group number for each of those values. It then returns the sum (or mean, or std, or any, ...etc.) of the values in each group. You have probably come across this idea before - see [Matlab'saccumarrayfunction](http://uk.mathworks.com/help/matlab/ref/accumarray.html?refresh=true), or [pandas` groupby concept](http://pandas.pydata.org/pandas-docs/dev/groupby.html), or MapReduce paradigm, or simply the basic histogram.

A couple of implemented functions do not reduce the data, instead it calculates values cumulatively while iterating over the data or permutates them. The output size matches the input size.

```python groupidx = np.array([4, 3, 3, 4, 4, 1, 1, 1, 7, 8, 7, 4, 3, 3, 1, 1]) a = np.array([3, 4, 1, 3, 9, 9, 6, 7, 7, 0, 8, 2, 1, 8, 9, 8]) npg.aggregate(groupidx, a, func='cumsum')

>>> array([3, 4, 5, 6,15, 9,15,22, 7, 0,15,17, 6,14,31,39])

```

Inputs

The function accepts various different combinations of inputs, producing various different shapes of output. We give a brief description of the general meaning of the inputs and then go over the different combinations in more detail:

group_idx - array of non-negative integers to be used as the "labels" with which to group the values in a.
a - array of values to be aggregated.
func='sum' - the function to use for aggregation. See the section below for more details.
size=None - the shape of the output array. If None, the maximum value in group_idx will set the size of the output.
fill_value=0 - value to use for output groups that do not appear anywhere in the group_idx input array.
order='C' - for multidimensional output, this controls the layout in memory, can be 'F' for fortran-style.
dtype=None - thedtype of the output. None means choose a sensible type for the given a, func, and fill_value.
axis=None - explained below.
ddof=0 - passed through into calculations of variance and standard deviation (see section on functions).

aggregate_dims_diagram

Form 1 is the simplest, taking group_idx and a of matching 1D lengths, and producing a 1D output.
Form 2 is similar to Form 1, but takes a scalar a, which is broadcast out to the length of group_idx. Note that this is generally not that useful.
Form 3 is more complicated. group_idx is the same length as the a.shape[axis]. The groups are broadcast out along the other axis/axes of a, thus the output is of shape n_groups x a.shape[0] x ... x a.shape[axis-1] x a.shape[axis+1] x ... a.shape[-1], i.e. the output has two or more dimensions.
Form 4 also produces output with two or more dimensions, but for very different reasons to Form 3. Here a is 1D and group_idx is exactly 2D, whereas in Form 3 a is ND, group_idx is 1D, and we provide a value for axis. The length of a must match group_idx.shape[1], the value of group_idx.shape[0] determines the number of dimensions in the output, i.e. group_idx[:,99] gives the (x,y,z) group indices for the a[99].
Form 5 is the same as Form 4 but with scalar a. As with Form 2, this is rarely that helpful.

Note on performance. The order of the output is unlikely to affect performance of aggregate (although it may affect your downstream usage of that output), however the order of multidimensional a or group_idx can affect performance: in Form 4 it is best if columns are contiguous in memory within group_idx, i.e. group_idx[:, 99] corresponds to a contiguous chunk of memory; in Form 3 it's best if all the data in a for group_idx[i] is contiguous, e.g. if axis=1 then we want a[:, 55] to be contiguous.

Available functions

By default, aggregate assumes you want to sum the values within each group, however you can specify another function using the func kwarg. This func can be any custom callable, however you will likely want one of the following optimized functions. Note that not all functions might be provided by all implementations.

'sum' - sum of items within each group (see example above).
'prod' - product of items within each group
'mean' - mean of items within each group
'var'- variance of items within each group. Use ddof kwarg for degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
'std' - standard deviation of items within each group. Use ddof kwarg for degrees of freedom (see var above).
'min' - minimum value of items within each group.
'max' - maximum value of items within each group.
'first' - first item in a from each group.
'last' - last item in a from each group.
'argmax' - the index in a of the maximum value in each group.
'argmin' - the index in a of the minimum value in each group.

The above functions also have a nan-form, which skip the nan values instead of propagating them to the result of the calculation: * 'nansum', 'nanprod', 'nanmean', 'nanvar', 'nanstd', 'nanmin', 'nanmax', 'nanfirst', 'nanlast', 'nanargmax', 'nanargmin'

The following functions are slightly different in that they always return boolean values. Their treatment of nans is also different from above: * 'all' - True if all items within a group are truethy. Note that np.all(nan) is True, i.e. nan is actually truethy. * 'any' - True if any items within a group are truethy. * 'allnan' - True if all items within a group are nan. * 'anynan' - True if any items within a group are nan.

The following functions don't reduce the data, but instead produce an output matching the size of the input: * 'cumsum' - cumulative sum of items within each group. * 'cumprod' - cumulative product of items within each group. (numba only) * 'cummin' - cumulative minimum of items within each group. (numba only) * 'cummax' - cumulative maximum of items within each group. (numba only) * 'sort' - sort the items within each group in ascending order, use reverse=True to invert the order.

Finally, there are three functions which don't reduce each group to a single value, instead they return the full set of items within the group: * 'array' - simply returns the grouped items, using the same order as appeared in a. (numpy only)

Examples

Compute sums of consecutive integers, and then compute products of those consecutive integers. ```python group_idx = np.arange(5).repeat(3)

group_idx: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

a = np.arange(group_idx.size)

a: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

x = npg.aggregate(group_idx, a) # sum is default

x: array([ 3, 12, 21, 30, 39])

x = npg.aggregate(group_idx, a, 'prod')

x: array([ 0, 60, 336, 990, 2184])

```

Get variance ignoring nans, setting all-nan groups to nan. python x = npg.aggregate(group_idx, a, func='nanvar', fill_value=nan)

Count the number of elements in each group. Note that this is equivalent to doing np.bincount(group_idx), indeed that is how the numpy implementation does it. python x = npg.aggregate(group_idx, 1)

Sum 1000 values into a three-dimensional cube of size 15x15x15. Note that in this example all three dimensions have the same size, but that doesn't have to be the case. ```python groupidx = np.random.randint(0, 15, size=(3, 1000)) a = np.random.random(groupidx.shape[1]) x = npg.aggregate(group_idx, a, func="sum", size=(15,15,15), order="F")

x.shape: (15, 15, 15)

np.isfortran(x): True

```

Use a custom function to generate some strings. ```python groupidx = np.array([1, 0, 1, 4, 1]) a = np.array([12.0, 3.2, -15, 88, 12.9]) x = npg.aggregate(groupidx, a, func=lambda g: ' or maybe '.join(str(gg) for gg in g), fill_value='')

x: ['3.2', '12.0 or maybe -15.0 or maybe 12.9', '', '', '88.0']

```

Use the axis arg in order to do a sum-aggregation on three rows simultaneously. ```python a = np.array([[99, 2, 11, 14, 20], [33, 76, 12, 100, 71], [67, 10, -8, 1, 9]]) groupidx = np.array([[3, 3, 7, 0, 0]]) x = npg.aggregate(groupidx, a, axis=1)

x : [[ 34, 0, 0, 101, 0, 0, 0, 11],

[171, 0, 0, 109, 0, 0, 0, 12],

[ 10, 0, 0, 77, 0, 0, 0, -8]]

```

Multiple implementations

There are multiple implementations of aggregate provided. If you use from numpy_groupies import aggregate, the best available implementation will automatically be selected. Otherwise you can pick a specific version directly like from numpy_groupies import aggregate_nb as aggregate or by importing aggregate from the implementing module from numpy_groupies.aggregate_weave import aggregate.

Currently the following implementations exist: * numpy - This is the default implementation. It uses plain numpy, mainly relying on np.bincount and basic indexing magic. It comes without other dependencies except numpy and shows reasonable performance for the occasional usage. * numba - This is the most performant implementation, based on jit compilation provided by numba and LLVM. * pure python - This implementation has no dependencies and uses only the standard library. It's horribly slow and should only be used, if there is no numpy available. * numpy ufunc - Only for benchmarking. This implementation uses the .at method of numpy's ufuncs (e.g. add.at), which would appear to be designed for performing exactly the same calculation that aggregate executes, however the numpy implementation is rather incomplete. * pandas - Only for reference. The pandas' groupby concept is the same as the task performed by aggregate. However, pandas is not actually faster than the default numpy implementation. Also, note that there may be room for improvement in the way that pandas is utilized here. Most notably, when computing multiple aggregations of the same data (e.g. 'min' and 'max') pandas could potentially be used more efficiently.

All implementations have the same calling syntax and produce the same outputs, to within some floating-point error. However some implementations only support a subset of the valid inputs and will sometimes throw NotImplementedError.

Benchmarks

Scripts for testing and benchmarking are included in this repository. For benchmarking, run python -m numpy_groupies.benchmarks.generic from the root of this repository.

Below we are using 500,000 indices uniformly picked from [0, 1000). The values of a are uniformly picked from the interval [0,1), with anything less than 0.2 then set to 0 (in order to serve as falsy values in boolean operations). For nan- operations another 20% of the values are set to nan, leaving the remainder on the interval [0.2,0.8).

The benchmarking results are given in ms for an i7-7560U running at 2.40GHz:

| function | ufunc | numpy | numba | pandas | |-----------|---------|---------|---------|---------| | sum | 1.950 | 1.728 | 0.708 | 11.832 | | prod | 2.279 | 2.349 | 0.709 | 11.649 | | min | 2.472 | 2.489 | 0.716 | 11.686 | | max | 2.457 | 2.480 | 0.745 | 11.598 | | len | 1.481 | 1.270 | 0.635 | 10.932 | | all | 37.186 | 3.054 | 0.892 | 12.587 | | any | 35.278 | 5.157 | 0.890 | 12.845 | | anynan | 5.783 | 2.126 | 0.762 | 144.740 | | allnan | 7.971 | 4.367 | 0.774 | 144.507 | | mean | ---- | 2.500 | 0.825 | 13.284 | | std | ---- | 4.528 | 0.965 | 12.193 | | var | ---- | 4.269 | 0.969 | 12.657 | | first | ---- | 1.847 | 0.811 | 11.584 | | last | ---- | 1.309 | 0.581 | 11.842 | | argmax | ---- | 3.504 | 1.411 | 293.640 | | argmin | ---- | 6.996 | 1.347 | 290.977 | | nansum | ---- | 5.388 | 1.569 | 15.239 | | nanprod | ---- | 5.707 | 1.546 | 15.004 | | nanmin | ---- | 5.831 | 1.700 | 14.292 | | nanmax | ---- | 5.847 | 1.731 | 14.927 | | nanlen | ---- | 3.170 | 1.529 | 14.529 | | nanall | ---- | 6.499 | 1.640 | 15.931 | | nanany | ---- | 8.041 | 1.656 | 15.839 | | nanmean | ---- | 5.636 | 1.583 | 15.185 | | nanvar | ---- | 7.514 | 1.682 | 15.643 | | nanstd | ---- | 7.292 | 1.666 | 15.104 | | nanfirst | ---- | 5.318 | 2.096 | 14.432 | | nanlast | ---- | 4.943 | 1.473 | 14.637 | | nanargmin | ---- | 7.977 | 1.779 | 298.911 | | nanargmax | ---- | 5.869 | 1.802 | 301.022 | | cumsum | ---- | 71.713 | 1.119 | 8.864 | | cumprod | ---- | ---- | 1.123 | 12.100 | | cummax | ---- | ---- | 1.062 | 12.133 | | cummin | ---- | ---- | 0.973 | 11.908 | | arbitrary | ---- | 147.853 | 46.690 | 129.779 | | sort | ---- | 167.699 | ---- | ---- |

Linux(x8664), Python 3.10.12, Numpy 1.25.2, Numba 0.58.0, Pandas 2.0.2_

Development

This project was started by @ml31415 and the numba and weave implementations are by him. The pure python and numpy implementations were written by @d1manson.

The authors hope that numpy's ufunc.at methods or some other implementation of aggregate within numpy or scipy will eventually be fast enough, to make this package redundant. Numpy 1.25 actually contained major improvements on ufunc speed, which reduced the speed gap between numpy and the numba implementation a lot.

Owner

Name: Michael
Login: ml31415
Kind: user
Location: Odessa, Ukraine
Company: OCCAM

Website: michael.loeffler.io
Repositories: 25
Profile: https://github.com/ml31415

GitHub Events

Total

Create event: 1
Release event: 1
Issues event: 6
Watch event: 13
Issue comment event: 11
Pull request event: 2
Fork event: 2

Last Year

Create event: 1
Release event: 1
Issues event: 6
Watch event: 13
Issue comment event: 11
Pull request event: 2
Fork event: 2

Committers

Last synced: 6 months ago

All Time

Total Commits: 240
Total Committers: 8
Avg Commits per committer: 30.0
Development Distribution Score (DDS): 0.575

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michael Löffler	m**l@l**o	102
Daniel Manson	d**1@u**k	59
Michael Löffler	ml@r****e	50
Deepak Cherian	d**n@u**m	14
dcherian	d**k@c**t	7
Pieter Eendebak	p**k@g**m	5
Bas Couwenberg	s**c@d**g	2
Antonio Valentino	a**o@t**t	1

Committer Domains (Top 20 + Academic)

tiscali.it: 1 debian.org: 1 cherian.net: 1 redcowmedia.de: 1 ucl.ac.uk: 1 loeffler.io: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 61
Total pull requests: 32
Average time to close issues: 8 months
Average time to close pull requests: about 1 month
Total issue authors: 34
Total pull request authors: 8
Average comments per issue: 4.16
Average comments per pull request: 1.06
Merged pull requests: 27
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 1 day
Average time to close pull requests: 4 days
Issue authors: 1
Pull request authors: 1
Average comments per issue: 3.5
Average comments per pull request: 6.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dcherian (11)
ml31415 (8)
d1manson (7)
ancri (3)
Illviljan (2)
josephnowak (2)
firmai (1)
math-artist (1)
swiftcode1121 (1)
ivirshup (1)
makihor (1)
bmorris3 (1)
cardosan (1)
shoyer (1)
cmdupuis3 (1)

Pull Request Authors

dcherian (19)
ml31415 (11)
sebastic (2)
avalentino (2)
Illviljan (1)
bmorris3 (1)
yulkang (1)
eendebakpt (1)

Top Labels

Issue Labels

enhancement (9) bug (5) question (2)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 143,794 last-month
Total docker downloads: 4,639

Total dependent packages: 23
(may contain duplicates)
Total dependent repositories: 161
(may contain duplicates)
Total versions: 37
Total maintainers: 4

pypi.org: numpy-groupies

Optimised tools for group-indexing operations: aggregated sum and more.

Documentation: https://numpy-groupies.readthedocs.io/
License: Copyright (c) 2016, numpy-groupies developers All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Latest release: 0.11.3
published 9 months ago

Versions: 23
Dependent Packages: 17
Dependent Repositories: 153
Downloads: 143,794 Last month
Docker Downloads: 4,639

Rankings

Dependent packages count: 0.7%

Downloads: 1.1%

Average: 1.1%

Dependent repos count: 1.2%

Docker downloads count: 1.5%

Maintainers (3)

dcherian d1manson ml31415

Last synced: 6 months ago

spack.io: py-numpy-groupies

This package consists of a couple of optimised tools for doing things that can roughly be considered "group-indexing operations". The most prominent tool is `aggregate`. `aggregate` takes an array of values, and an array giving the group number for each of those values. It then returns the sum (or mean, or std, or any, ...etc.) of the values in each group. You have probably come across this idea before, using `matlab` accumarray, `pandas` groupby, or generally MapReduce algorithms and histograms. There are different implementations of `aggregate` provided, based on plain `numpy`, `numba` and `weave`. Performance is a main concern, and so far we comfortably beat similar implementations in other packages (check the benchmarks).

Homepage: https://github.com/ml31415/numpy-groupies
License: []
Latest release: 0.11.2
published 11 months ago

Versions: 2
Dependent Packages: 1
Dependent Repositories: 0

Rankings

Dependent repos count: 0.0%

Stargazers count: 15.9%

Average: 16.5%

Forks count: 22.0%

Dependent packages count: 28.1%

Maintainers (1)

adamjstewart

Last synced: 6 months ago

conda-forge.org: numpy_groupies

Homepage: https://github.com/ml31415/numpy-groupies
License: BSD-2-Clause
Latest release: 0.9.20
published over 3 years ago

Versions: 12
Dependent Packages: 5
Dependent Repositories: 8

Rankings

Dependent packages count: 10.4%

Dependent repos count: 12.2%

Average: 21.9%

Stargazers count: 28.1%

Forks count: 36.7%

Last synced: 6 months ago

https://github.com/ml31415/numpy-groupies

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

numpy-groupies

Installation

aggregate

>>> array([10.0, -8.2, 0.0, 26.6, 53.7, 92.1])

>>> array([3, 4, 5, 6,15, 9,15,22, 7, 0,15,17, 6,14,31,39])

Inputs

Available functions

Examples

group_idx: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

a: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

x: array([ 3, 12, 21, 30, 39])

x: array([ 0, 60, 336, 990, 2184])

x.shape: (15, 15, 15)

np.isfortran(x): True

x: ['3.2', '12.0 or maybe -15.0 or maybe 12.9', '', '', '88.0']

x : [[ 34, 0, 0, 101, 0, 0, 0, 11],

[171, 0, 0, 109, 0, 0, 0, 12],

[ 10, 0, 0, 77, 0, 0, 0, -8]]

Multiple implementations

Benchmarks

Development

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: numpy-groupies

Rankings

Maintainers (3)

spack.io: py-numpy-groupies

Rankings

Maintainers (1)

conda-forge.org: numpy_groupies

Rankings

Dependencies