samplics

samplics: a Python Package for selecting, weighting and analyzing data from complex sampling designs. - Published in JOSS (2021)

https://github.com/samplics-org/samplics

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary

Keywords

estimation officialstatistics sample samplics sampling survey variance weighting

Scientific Fields

Economics Social Sciences - 63% confidence
Last synced: 6 months ago · JSON representation

Repository

Select, weight and analyze complex sample data

Basic Info
Statistics
  • Stars: 68
  • Watchers: 3
  • Forks: 12
  • Open Issues: 20
  • Releases: 12
Topics
estimation officialstatistics sample samplics sampling survey variance weighting
Created about 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing Funding License

README.md

Sample Analytics

DOI

Help Shape the Future of Samplics

We are driven by community feedback and would love to hear from you! Please take a few minutes to complete our user survey.

➡️ Share Your Feedback

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics-org.github.io/samplics/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize

samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.wald) samplesize.calculate(target=0.80, half_ci=0.10) ```

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize

sample_size = SampleSize(param=PopParam.prop, method=SizeMethod.wald, strat=True)

expectedproportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50} halfci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10} deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.fleiss, strat=True) samplesize.calculate(target=expectedproportions, halfci=half_ci, deff=deff) ```

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

```python

First we import the example dataset

from samplics.datasets import loadpsuframe psuframedict = loadpsuframe() psuframe = psuframe_dict["data"]

Code for the sample selection

from samplics.sampling import SampleSelection from samplics.utils import SelectMethod

psusamplesize = {"East":3, "West": 2, "North": 2, "South": 3} ppsdesign = SampleSelection( method=SelectMethod.ppssys, strat=True, wr=False )

psuframe["psuprob"] = ppsdesign.inclusionprobs( psuframe["cluster"], psusamplesize, psuframe["region"], psuframe["numberhouseholds_census"] ) ```

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

```python import pandas as pd

from samplics.datasets import loadpsusample, loadssusample from samplics.weighting import SampleWeight

Load PSU sample data

psusampledict = loadpsusample() psusample = psusample_dict["data"]

Load PSU sample data

ssusampledict = loadssusample() ssusample = ssusample_dict["data"]

fullsample = pd.merge( psusample[["cluster", "region", "psuprob"]], ssusample[["cluster", "household", "ssu_prob"]], on="cluster" )

fullsample["inclusionprob"] = fullsample["psuprob"] * fullsample["ssuprob"] fullsample["designweight"] = 1 / fullsample["inclusionprob"] ```

To adjust the design sample weight for nonresponse, we can use code similar to:

```python import numpy as np

from samplics.weighting import SampleWeight

Simulate response

np.random.seed(7) fullsample["responsestatus"] = np.random.choice( ["ineligible", "respondent", "non-respondent", "unknown"], size=full_sample.shape[0], p=(0.10, 0.70, 0.15, 0.05), )

Map custom response statuses to teh generic samplics statuses

status_mapping = { "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown" }

adjust sample weights

fullsample["nrweight"] = SampleWeight().adjust( sampweight=fullsample["designweight"], adjclass=fullsample["region"], respstatus=fullsample["responsestatus"], respdict=statusmapping ) ```

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

```python

Taylor-based

from samplics.utils.types import PopParam, RepMethod from samplics.datasets import load_nhanes2

nhanes2dict = loadnhanes2() nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zincmeanstr = TaylorEstimator(PopParam.mean) zincmeanstr.estimate( y=nhanes2["zinc"], sampweight=nhanes2["finalwgt"], stratum=nhanes2["stratid"], psu=nhanes2["psuid"], removenan=True, )

Replicate-based

from samplics.datasets import load_nhanes2brr

nhanes2brrdict = loadnhanes2brr() nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratiowgthgt = ReplicateEstimator(RepMethod.brr, PopParam.ratio).estimate( y=nhanes2brr["weight"], sampweight=nhanes2brr["finalwgt"], x=nhanes2brr["height"], repweights=nhanes2brr.loc[:, "brr1":"brr32"], remove_nan=True, )

```

To predict small area parameters, we can use code similar to:

```python import numpy as np import pandas as pd

Area-level basic method

from samplics.utils.types import FitMethod from samplics.datasets import loadexpendituremilk

milkexpdict = loadexpendituremilk() milkexp = milkexp_dict["data"]

from samplics.sae import EblupAreaModel

fhmodelreml = EblupAreaModel(method=FitMethod.reml) fhmodelreml.fit( yhat=milkexp["directest"], X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["smallarea"], errorstd=milkexp["stderror"], intercept=True, tol=1e-8, ) fhmodelreml.predict( X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["small_area"], intercept=True, )

Unit-level basic method

from samplics.datasets import loadcountycrop, loadcountycrop_means

Load County Crop sample data

countycropdict = loadcountycrop() countycrop = countycropdict["data"]

Load County Crop Area Means sample data

countycropmeansdict = loadcountycropmeans() countycropmeans = countycropmeansdict["data"]

from samplics.sae import EblupUnitModel

eblupbhfreml = EblupUnitModel() eblupbhfreml.fit( countycrop["cornarea"], countycrop[["cornpixel", "soybeanspixel"]], countycrop["countyid"], ) eblupbhfreml.predict( Xmean=countycropmeans[["avecornpixel", "avecorn_pixel"]], area=np.linspace(1, 12, 12), )

```

Installation

pip install samplics

Python 3.10 or newer is required and the main dependencies are numpy, polars, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Owner

  • Name: samplics
  • Login: samplics-org
  • Kind: organization
  • Location: United States of America

Sample Analytics

GitHub Events

Total
  • Create event: 7
  • Release event: 7
  • Issues event: 12
  • Watch event: 11
  • Issue comment event: 24
  • Push event: 67
  • Fork event: 1
Last Year
  • Create event: 7
  • Release event: 7
  • Issues event: 12
  • Watch event: 11
  • Issue comment event: 24
  • Push event: 67
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 1,341
  • Total Committers: 4
  • Avg Commits per committer: 335.25
  • Development Distribution Score (DDS): 0.475
Past Year
  • Commits: 78
  • Committers: 1
  • Avg Commits per committer: 78.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mamadou S Diallo m****o@q****g 704
Mamadou S Diallo me@m****o 633
Fernando Irarrázaval c****i 3
Jerin Varghese j****3@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 38
  • Total pull requests: 29
  • Average time to close issues: 5 months
  • Average time to close pull requests: about 14 hours
  • Total issue authors: 20
  • Total pull request authors: 4
  • Average comments per issue: 2.11
  • Average comments per pull request: 0.24
  • Merged pull requests: 25
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 7
  • Pull requests: 0
  • Average time to close issues: 8 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 1.14
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kburchfiel (6)
  • kevintcaron (4)
  • soodoku (4)
  • MamadouSDiallo (4)
  • michaelwalshe (3)
  • rchew (2)
  • ivanjpg (1)
  • quillan86 (1)
  • jeanbaptisteb (1)
  • BArFinrod (1)
  • rifkigst (1)
  • jerinv (1)
  • msulwa (1)
  • mondjef (1)
  • JuanVeraF (1)
Pull Request Authors
  • MamadouSDiallo (26)
  • dependabot[bot] (2)
  • jerinv (1)
  • cuchoi (1)
Top Labels
Issue Labels
enhancement (3)
Pull Request Labels
dependencies (2) python (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,537 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 120
  • Total maintainers: 1
pypi.org: samplics

Select, weight and analyze complex sample data

  • Versions: 120
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 1,537 Last month
Rankings
Dependent packages count: 4.8%
Stargazers count: 9.7%
Forks count: 10.9%
Downloads: 10.9%
Average: 11.6%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi
  • black ^22.3 develop
  • certifi ^2022.6.15 develop
  • codecov ^2.1 develop
  • flake8 ^4.0 develop
  • ipykernel ^6.13 develop
  • ipython ^8.4 develop
  • isort ^5.10 develop
  • jupyterlab ^3.4 develop
  • mypy ^0.960 develop
  • nb-black-only ^1.0 develop
  • nbsphinx 0.7.1 develop
  • nox ^2022.1.7 develop
  • pylint ^2.9 develop
  • pytest ^7.1 develop
  • pytest-cov ^3.0 develop
  • recommonmark ^0.7 develop
  • sphinx ^4.5 develop
  • sphinx-autobuild ^2021.3 develop
  • sphinx_bootstrap_theme ^0.8 develop
  • matplotlib ^3.4
  • numpy ^1.21.4
  • pandas ^1.3
  • python >=3.8,<3.11
  • scipy ^1.7
  • statsmodels ^0.13.1
.github/workflows/pages.yml actions
  • actions/checkout v3 composite
  • actions/configure-pages v2 composite
  • actions/deploy-pages v1 composite
  • actions/upload-pages-artifact v1 composite
.github/workflows/tests.yml actions
  • abatilo/actions-poetry v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/tests_coverage.yml actions
  • abatilo/actions-poetry v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v1 composite