entropy-balance-weighting

High-speed, big data sample weighting using entropy balancing

https://github.com/uscensusbureau/entropy-balance-weighting

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

High-speed, big data sample weighting using entropy balancing

Basic Info
  • Host: GitHub
  • Owner: uscensusbureau
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 150 KB
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Citation

README.md

Accelerated Entropy Balancing Sample Weighting

This package implements an Entropy Balancing Weighting (EBW) routine for dense or sparse matrices. The goal of this package is to scale the EBW algorithm to allow it to work with large, complicated data sets. A particular focus is then on allowing sparse inputs, collinearity, and potential infeasibility without additional front-end data manipulation by the user.

Scaling up EBW problems is largely possible due to the sparsity of naturally-occurring data, e.g. models with group-specific moments. Utilizing sparsity of datasets, models with millions of observations and hundreds of thousands of constraints have successfully converged in minutes. In addition, extensions like bound constraints and handling collinear and inconsistent moments have been added to the standard approach.

The paper detailing the algorithm is available as a Census Working Paper.

Requirements:

  • Python 3.9+
  • Required packages (tested versions listed, should work on most others):
    • numpy (2.2, 1.26)
    • scipy (1.15.1, 1.13.1)
    • pypardiso (0.4.6)
    • sparsedotmkl (0.9.7)
    • numexpr (2.10.2)
  • Required packages for testing:
    • polars (1.0+)
    • pytest (8.3.0+)
  • Required packages for examples:
    • requests

Installation:

Once the release .whl file is downloaded and stored in <wheel location>, a simple pip install <wheel location> should work.

Problem and Usage:

The primary function in this package is entropy_balance_weighting.entropy_balance, which solves this problem for any consistent set of constraints that can be satisfied with positive weights.

The EBW problem solved here considers a set of $N$ sampled units, with initial unit weights $w0$, an $N\times K$ matrix of observable characteristics at the unit level $X$, and a $K\times 1$ vector of known external unit-level mean "targets" $m$ that the reweighted moments should match. Distance between the new weights and old weights is based on the Kullback-Leibler divergence and can be written in terms of the ratio $r \equiv w{1}/w_{0}$ between the new weights and the old weights:

$$\phi(r; w0)=\sum w{0,i} (\log(r{i}) - r{i} + 1).$$

The optimization problem can then be written

$$\min{r\in \mathbb{R}^N} \phi(r; w0)$$

$$\text{s.t. } (\text{Diag}({w{i,0})}X)^T r = m \cdot \sum{i=1}^N w_{i,0}, \hspace{3pt} r\geq 0.$$

The entropy_balance function takes three required keyword arguments: x_sample, the $N\times K$ array dense or sparse matrix (numpy.array or scipy.sparse.cscarray/csrarray) of unit-level observations; weights0, a numpy vector of the initial weights, best scaled to have mean 1; and mean_population_moments, a numpy vector of the externally-known unit-level mean of each column of $X$. As written above, the algorithm matches the aggregate moments, so keeping the relative scaling of $X$, $m$, and $w_{i,0}$ consistent with the formulation above makes the algorithm more stable, even if not theoretically necessary.

The function returns an EntropyBalanceResults object, which has a property EntropyBalanceResults.new_weights, the new weights that match the moments (if possible) while being minimum distance from the initial weights.

Passing in optional arguments to entropy_balance can allow for setting bounds in the resulting ratios and dealing with infeasibility of the linear system; see the section "Bounded and Infeasible" below.

Treatment/Control Weighting:

This package can also be used to reweight subsets of observations to all have the same weighted mean characteristics over multiple dimensions. See the examples/pums_example.py file for an example of how to write the moments as a special case of the more general problem above. In that example, data from the American Community Survey is reweighted at the state level so all states have equal post-weighting means of a number of household-level variables.

entropy_balance Example:

```

import numpy as np import entropybalanceweighting as ebw import scipy.sparse as sp

samplesize = 5 x = np.array([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 1.0]]) meanpopulationmoments = np.array([.35, .30, .33]) weights0 = np.ones(samplesize) x.T.dot(weights0) - samplesize * meanpopulation_moments # Calculate miss in weighted aggregate moments. array([-0.75, 0.5 , 0.35])

resebw = ebw.entropybalance(xsample = x, weights0 = weights0, meanpopulationmoments = meanpopulationmoments) resebw.newweights array([1.75 , 0.75 , 0.75 , 0.825, 0.825]) x.T.dot(resebw.newweights) - samplesize * meanpopulationmoments # Aggregate weighted moments now matched exactly. array([6.72795153e-14, 1.28785871e-14, 4.66293670e-15])

xsparse = sp.csrarray(x) # Can use sparse matrices for memory/computation savings ressparse = ebw.entropybalance(xsample = xsparse, weights0 = weights0, meanpopulationmoments = meanpopulationmoments) assert np.allclose(resebw.newweights, ressparse.newweights) ```

More Examples:

More detailed examples can be found in the ./examples directory in the main repository. In particular, the pums_example.py gives an example of downloading the US Census 2023 1-Year American Community Survey and reweighting to exactly match the state-level number of housing units from the US Census Population Estimates program.

Bounded and Infeasible:

Passing the kwarg options={"bounds": (lbound, ubound)} for scalar lbound and ubound will solve the problem above with $\text{lbound} \leq r_i \leq \text{ubound }\forall i$. For no upper bound, pass None for ubound.

There is a tradeoff between bound constraints and feasibility of the linear system matching the moments. To make bounds generally useful, the bounded entropy_balance algorithm runs "elastic mode", a variant problem that has the following properties:

  1. If the bounded problem is feasible, entropy_balance(..., options={"bounds": ...}) returns the feasible solution.
  2. If the problem is infeasible, entropy_balance returns a point $r^\star$ that is feasible with respect to the bounds. This $r^\star$ minimizes the $L^1$ norm of the moment violations over the other feasible $r$ that satisfy $\phi(r) \leq \phi(r^\star)$; that is, the only way to make the moments match any better would be to increase the criterion. In practice, this matches as many moments exactly as possible, making it possible to identify which moments are impossible to simultaneously match.

The bounded problem is necessarily slower than the unbounded problem, but in theory should converge in every run. After running with bounds, the EntropyBalanceResult.constraint_violations property can be used to judge if the original problem was feasible or not. The feasibility of the unconstrained problem can be checked by passing {"bounds": (0, None)}.

Bounds and Elastic Mode Example:

```

resbounded= ebw.entropybalance(xsample = x, weights0 = weights0, meanpopulationmoments = meanpopulationmoments, options={"bounds": (0.5, 1.5)}) resbounded.newweights array([1.5 , 0.75 , 0.75 , 0.825, 0.825]) resbounded.constraint_violations # New bounds make problem infeasible, so return new weights that give "best fit" of moments. array([-2.50000000e-01, -8.58102478e-11, -5.19522203e-11])

```

Penalty Formulation:

A secondary, currently slightly less supported, function in this package is entropy_balance_weighting.entropy_balance_penalty, which solves the problem

$$\min{r\in \mathbb{R}^N} \phi(r; w0) + \frac{1}{2} (A^T r-m)^T \text{Diag}(pk)(A^T r - m)$$ $$\text{s.t. } A=\text{Diag}(\frac{w{i,0}}{\sumj w{j,0}})X, \hspace{3pt} r\geq 0.$$

where $p_{k}$ is a vector of positive penalty parameters chosen by the user. As above, this method also supports additional bound constraints on $r$ if provided.

This version of the problem moves the constraints into a smooth penalty function, which ensures feasibility.

The advantage of this method is allowing for infeasible and collinear moments without any complications, and the simple form of the problem makes optimization fast.

The downside compared to the above methods is that for any finite $p$ the solution will not match the solution of the original EBW problem exactly, and the choice of $p$ is arbitrary for the end user. Additionally, unlike the bounded elastic mode version above, it is not possible to algorithmically use this version of the problem to determine the feasibility of the original problem.

Issues:

We appreciate any feedback you would like to provide us; please post any questions that you may have in the GitHub issues section.

Citation Information:

Please cite this package in any work where it proves useful. @software{Sanders_Accelerated_Entropy_Balance_2025, author = {Sanders, Carl}, month = mar, title = {{Accelerated Entropy Balance Sample Weighting}}, url = {https://github.com/uscensusbureau/entropy-balance-weighting}, version = {0.5.0}, year = {2025} }

Disclaimers:

This work was performed as part of the National Experimental Wellbeing Statistics project at the U.S. Census Bureau. The views expressed are those of the authors and not those of the U.S. Census Bureau. The U.S. Census Bureau has reviewed this data product to ensure appropriate access, use, and disclosure avoidance protection of the confidential source data used to produce this product (Data Management System (DMS) number: P-6000725; Disclosure Review Board (DRB) approval number: CBDRB-FY25-SEHSD003-037).

Owner

  • Name: US Census Bureau
  • Login: uscensusbureau
  • Kind: organization
  • Email: cedsci.feedback@census.gov
  • Location: United States

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Accelerated Entropy Balance Survey Reweighting
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Carl
    family-names: Sanders
    email: carl.e.sanders@census.gov
    affiliation: U.S. Census Bureau
repository-code: >-
  https://github.com/uscensusbureau/entropy-balance-weighting
abstract: >-
  This package implements an Entropy Balancing Weighting
  (EBW) routine for dense or sparse matrices.
keywords:
  - weighting
  - entropy balance
  - survey
  - sampling
version: 0.5.0
date-released: '2025-03-14'

GitHub Events

Total
  • Issues event: 1
  • Watch event: 2
  • Issue comment event: 1
  • Public event: 1
  • Push event: 14
Last Year
  • Issues event: 1
  • Watch event: 2
  • Issue comment event: 1
  • Public event: 1
  • Push event: 14

Dependencies

pyproject.toml pypi
  • numexpr *
  • numpy *
  • polars *
  • pyarrow *
  • pypardiso *
  • scipy *
  • sparse_dot_mkl *
requirements-dev.txt pypi
  • numexpr * development
  • numpy * development
  • polars * development
  • pypardiso * development
  • pytest * development
  • scipy * development
  • sparse_dot_mkl * development
requirements-lint.txt pypi
  • mypy *
  • ruff *