samplics

samplics: a Python Package for selecting, weighting and analyzing data from complex sampling designs. - Published in JOSS (2021)

https://github.com/samplics-org/samplics

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Keywords

estimation officialstatistics sample samplics sampling survey variance weighting

Scientific Fields

Economics Social Sciences - 63% confidence

Last synced: 6 months ago · JSON representation

Repository

Select, weight and analyze complex sample data

Basic Info

Host: GitHub
Owner: samplics-org
License: mit
Language: Python
Default Branch: main
Homepage: https://samplics-org.github.io/samplics/
Size: 148 MB

Statistics

Stars: 68
Watchers: 3
Forks: 12
Open Issues: 20
Releases: 12

Topics

estimation officialstatistics sample samplics sampling survey variance weighting

Created about 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing Funding License

Sample Analytics

Help Shape the Future of Samplics

We are driven by community feedback and would love to hear from you! Please take a few minutes to complete our user survey.

➡️ Share Your Feedback

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

Sample size calculation and allocation: Wald and Fleiss methods for proportions.
Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

Weight adjustment due to nonresponse
Weight poststratification, calibration and normalization
Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

Taylor-based, also called linearization methods
Replication-based estimation i.e. Boostrap, BRR, and Jackknife
Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics-org.github.io/samplics/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize

samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.wald) samplesize.calculate(target=0.80, half_ci=0.10) ```

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize

sample_size = SampleSize(param=PopParam.prop, method=SizeMethod.wald, strat=True)

expectedproportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50} halfci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10} deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.fleiss, strat=True) samplesize.calculate(target=expectedproportions, halfci=half_ci, deff=deff) ```

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

```python

First we import the example dataset

from samplics.datasets import loadpsuframe psuframedict = loadpsuframe() psuframe = psuframe_dict["data"]

Code for the sample selection

from samplics.sampling import SampleSelection from samplics.utils import SelectMethod

psusamplesize = {"East":3, "West": 2, "North": 2, "South": 3} ppsdesign = SampleSelection( method=SelectMethod.ppssys, strat=True, wr=False )

psuframe["psuprob"] = ppsdesign.inclusionprobs( psuframe["cluster"], psusamplesize, psuframe["region"], psuframe["numberhouseholds_census"] ) ```

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

```python import pandas as pd

from samplics.datasets import loadpsusample, loadssusample from samplics.weighting import SampleWeight

Load PSU sample data

psusampledict = loadpsusample() psusample = psusample_dict["data"]

Load PSU sample data

ssusampledict = loadssusample() ssusample = ssusample_dict["data"]

fullsample = pd.merge( psusample[["cluster", "region", "psuprob"]], ssusample[["cluster", "household", "ssu_prob"]], on="cluster" )

fullsample["inclusionprob"] = fullsample["psuprob"] * fullsample["ssuprob"] fullsample["designweight"] = 1 / fullsample["inclusionprob"] ```

To adjust the design sample weight for nonresponse, we can use code similar to:

```python import numpy as np

from samplics.weighting import SampleWeight

Simulate response

np.random.seed(7) fullsample["responsestatus"] = np.random.choice( ["ineligible", "respondent", "non-respondent", "unknown"], size=full_sample.shape[0], p=(0.10, 0.70, 0.15, 0.05), )

Map custom response statuses to teh generic samplics statuses

status_mapping = { "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown" }

adjust sample weights

fullsample["nrweight"] = SampleWeight().adjust( sampweight=fullsample["designweight"], adjclass=fullsample["region"], respstatus=fullsample["responsestatus"], respdict=statusmapping ) ```

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

```python

Taylor-based

from samplics.utils.types import PopParam, RepMethod from samplics.datasets import load_nhanes2

nhanes2dict = loadnhanes2() nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zincmeanstr = TaylorEstimator(PopParam.mean) zincmeanstr.estimate( y=nhanes2["zinc"], sampweight=nhanes2["finalwgt"], stratum=nhanes2["stratid"], psu=nhanes2["psuid"], removenan=True, )

Replicate-based

from samplics.datasets import load_nhanes2brr

nhanes2brrdict = loadnhanes2brr() nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratiowgthgt = ReplicateEstimator(RepMethod.brr, PopParam.ratio).estimate( y=nhanes2brr["weight"], sampweight=nhanes2brr["finalwgt"], x=nhanes2brr["height"], repweights=nhanes2brr.loc[:, "brr1":"brr32"], remove_nan=True, )

```

To predict small area parameters, we can use code similar to:

```python import numpy as np import pandas as pd

Area-level basic method

from samplics.utils.types import FitMethod from samplics.datasets import loadexpendituremilk

milkexpdict = loadexpendituremilk() milkexp = milkexp_dict["data"]

from samplics.sae import EblupAreaModel

fhmodelreml = EblupAreaModel(method=FitMethod.reml) fhmodelreml.fit( yhat=milkexp["directest"], X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["smallarea"], errorstd=milkexp["stderror"], intercept=True, tol=1e-8, ) fhmodelreml.predict( X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["small_area"], intercept=True, )

Unit-level basic method

from samplics.datasets import loadcountycrop, loadcountycrop_means

Load County Crop sample data

countycropdict = loadcountycrop() countycrop = countycropdict["data"]

Load County Crop Area Means sample data

countycropmeansdict = loadcountycropmeans() countycropmeans = countycropmeansdict["data"]

from samplics.sae import EblupUnitModel

eblupbhfreml = EblupUnitModel() eblupbhfreml.fit( countycrop["cornarea"], countycrop[["cornpixel", "soybeanspixel"]], countycrop["countyid"], ) eblupbhfreml.predict( Xmean=countycropmeans[["avecornpixel", "avecorn_pixel"]], area=np.linspace(1, 12, 12), )

```

Installation

pip install samplics

Python 3.10 or newer is required and the main dependencies are numpy, polars, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Owner

Name: samplics
Login: samplics-org
Kind: organization
Location: United States of America

Twitter: samplics
Repositories: 2
Profile: https://github.com/samplics-org

Sample Analytics

GitHub Events

Total

Create event: 7
Release event: 7
Issues event: 12
Watch event: 11
Issue comment event: 24
Push event: 67
Fork event: 1

Last Year

Create event: 7
Release event: 7
Issues event: 12
Watch event: 11
Issue comment event: 24
Push event: 67
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 1,341
Total Committers: 4
Avg Commits per committer: 335.25
Development Distribution Score (DDS): 0.475

Past Year

Commits: 78
Committers: 1
Avg Commits per committer: 78.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Mamadou S Diallo	m**o@q**g	704
Mamadou S Diallo	me@m****o	633
Fernando Irarrázaval	c****i	3
Jerin Varghese	j**3@g**m	1

Committer Domains (Top 20 + Academic)

msdiallo.io: 1 quantifyafrica.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 38
Total pull requests: 29
Average time to close issues: 5 months
Average time to close pull requests: about 14 hours
Total issue authors: 20
Total pull request authors: 4
Average comments per issue: 2.11
Average comments per pull request: 0.24
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 7
Pull requests: 0
Average time to close issues: 8 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 1.14
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kburchfiel (6)
kevintcaron (4)
soodoku (4)
MamadouSDiallo (4)
michaelwalshe (3)
rchew (2)
ivanjpg (1)
quillan86 (1)
jeanbaptisteb (1)
BArFinrod (1)
rifkigst (1)
jerinv (1)
msulwa (1)
mondjef (1)
JuanVeraF (1)

Pull Request Authors

MamadouSDiallo (26)
dependabot[bot] (2)
jerinv (1)
cuchoi (1)

Top Labels

Issue Labels

enhancement (3)

Pull Request Labels

dependencies (2) python (2)

Packages

Total packages: 1
Total downloads:
- pypi 1,537 last-month

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 120
Total maintainers: 1

pypi.org: samplics

Select, weight and analyze complex sample data

Documentation: https://samplics.readthedocs.io/
License: mit
Latest release: 0.4.55
published 6 months ago

Versions: 120
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 1,537 Last month

Rankings

Dependent packages count: 4.8%

Stargazers count: 9.7%

Forks count: 10.9%

Downloads: 10.9%

Average: 11.6%

Dependent repos count: 21.6%

Maintainers (1)

msdiallo

Last synced: 6 months ago

Dependencies

pyproject.toml pypi

black ^22.3 develop
certifi ^2022.6.15 develop
codecov ^2.1 develop
flake8 ^4.0 develop
ipykernel ^6.13 develop
ipython ^8.4 develop
isort ^5.10 develop
jupyterlab ^3.4 develop
mypy ^0.960 develop
nb-black-only ^1.0 develop
nbsphinx 0.7.1 develop
nox ^2022.1.7 develop
pylint ^2.9 develop
pytest ^7.1 develop
pytest-cov ^3.0 develop
recommonmark ^0.7 develop
sphinx ^4.5 develop
sphinx-autobuild ^2021.3 develop
sphinx_bootstrap_theme ^0.8 develop
matplotlib ^3.4
numpy ^1.21.4
pandas ^1.3
python >=3.8,<3.11
scipy ^1.7
statsmodels ^0.13.1

.github/workflows/pages.yml actions

actions/checkout v3 composite
actions/configure-pages v2 composite
actions/deploy-pages v1 composite
actions/upload-pages-artifact v1 composite

.github/workflows/tests.yml actions

abatilo/actions-poetry v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/tests_coverage.yml actions

abatilo/actions-poetry v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v1 composite

samplics

Science Score: 49.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Sample Analytics

Help Shape the Future of Samplics

Usage

First we import the example dataset

Code for the sample selection

Load PSU sample data

Load PSU sample data

Simulate response

Map custom response statuses to teh generic samplics statuses

adjust sample weights

Taylor-based

Replicate-based

Area-level basic method

Unit-level basic method

Load County Crop sample data

Load County Crop Area Means sample data

Installation

Contribution

License

Contact

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: samplics

Rankings

Maintainers (1)

Dependencies