samplics
samplics: a Python Package for selecting, weighting and analyzing data from complex sampling designs. - Published in JOSS (2021)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Keywords
Scientific Fields
Repository
Select, weight and analyze complex sample data
Basic Info
- Host: GitHub
- Owner: samplics-org
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://samplics-org.github.io/samplics/
- Size: 148 MB
Statistics
- Stars: 68
- Watchers: 3
- Forks: 12
- Open Issues: 20
- Releases: 12
Topics
Metadata Files
README.md

Sample Analytics
Help Shape the Future of Samplics
We are driven by community feedback and would love to hear from you! Please take a few minutes to complete our user survey.
In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.
Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:
- Sample size calculation and allocation: Wald and Fleiss methods for proportions.
- Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
- Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.
Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:
- Weight adjustment due to nonresponse
- Weight poststratification, calibration and normalization
- Weight replication i.e. Bootstrap, BRR, and Jackknife
Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:
- Taylor-based, also called linearization methods
- Replication-based estimation i.e. Boostrap, BRR, and Jackknife
- Regression-based e.g. generalized regression (GREG)
Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.
For more details, visit https://samplics-org.github.io/samplics/
Usage
Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.
```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize
samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.wald) samplesize.calculate(target=0.80, half_ci=0.10) ```
Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.
```python from samplics.utils.types import SizeMethod, PopParam from samplics.sampling import SampleSize
sample_size = SampleSize(param=PopParam.prop, method=SizeMethod.wald, strat=True)
expectedproportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50} halfci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10} deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}
samplesize = SampleSize(param=PopParam.prop, method=SizeMethod.fleiss, strat=True) samplesize.calculate(target=expectedproportions, halfci=half_ci, deff=deff) ```
To select a sample of primary sampling units using PPS method,
we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.
```python
First we import the example dataset
from samplics.datasets import loadpsuframe psuframedict = loadpsuframe() psuframe = psuframe_dict["data"]
Code for the sample selection
from samplics.sampling import SampleSelection from samplics.utils import SelectMethod
psusamplesize = {"East":3, "West": 2, "North": 2, "South": 3} ppsdesign = SampleSelection( method=SelectMethod.ppssys, strat=True, wr=False )
psuframe["psuprob"] = ppsdesign.inclusionprobs( psuframe["cluster"], psusamplesize, psuframe["region"], psuframe["numberhouseholds_census"] ) ```
The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.
```python import pandas as pd
from samplics.datasets import loadpsusample, loadssusample from samplics.weighting import SampleWeight
Load PSU sample data
psusampledict = loadpsusample() psusample = psusample_dict["data"]
Load PSU sample data
ssusampledict = loadssusample() ssusample = ssusample_dict["data"]
fullsample = pd.merge( psusample[["cluster", "region", "psuprob"]], ssusample[["cluster", "household", "ssu_prob"]], on="cluster" )
fullsample["inclusionprob"] = fullsample["psuprob"] * fullsample["ssuprob"] fullsample["designweight"] = 1 / fullsample["inclusionprob"] ```
To adjust the design sample weight for nonresponse, we can use code similar to:
```python import numpy as np
from samplics.weighting import SampleWeight
Simulate response
np.random.seed(7) fullsample["responsestatus"] = np.random.choice( ["ineligible", "respondent", "non-respondent", "unknown"], size=full_sample.shape[0], p=(0.10, 0.70, 0.15, 0.05), )
Map custom response statuses to teh generic samplics statuses
status_mapping = { "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown" }
adjust sample weights
fullsample["nrweight"] = SampleWeight().adjust( sampweight=fullsample["designweight"], adjclass=fullsample["region"], respstatus=fullsample["responsestatus"], respdict=statusmapping ) ```
To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:
```python
Taylor-based
from samplics.utils.types import PopParam, RepMethod from samplics.datasets import load_nhanes2
nhanes2dict = loadnhanes2() nhanes2 = nhanes2_dict["data"]
from samplics.estimation import TaylorEstimator
zincmeanstr = TaylorEstimator(PopParam.mean) zincmeanstr.estimate( y=nhanes2["zinc"], sampweight=nhanes2["finalwgt"], stratum=nhanes2["stratid"], psu=nhanes2["psuid"], removenan=True, )
Replicate-based
from samplics.datasets import load_nhanes2brr
nhanes2brrdict = loadnhanes2brr() nhanes2brr = nhanes2brr_dict["data"]
from samplics.estimation import ReplicateEstimator
ratiowgthgt = ReplicateEstimator(RepMethod.brr, PopParam.ratio).estimate( y=nhanes2brr["weight"], sampweight=nhanes2brr["finalwgt"], x=nhanes2brr["height"], repweights=nhanes2brr.loc[:, "brr1":"brr32"], remove_nan=True, )
```
To predict small area parameters, we can use code similar to:
```python import numpy as np import pandas as pd
Area-level basic method
from samplics.utils.types import FitMethod from samplics.datasets import loadexpendituremilk
milkexpdict = loadexpendituremilk() milkexp = milkexp_dict["data"]
from samplics.sae import EblupAreaModel
fhmodelreml = EblupAreaModel(method=FitMethod.reml) fhmodelreml.fit( yhat=milkexp["directest"], X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["smallarea"], errorstd=milkexp["stderror"], intercept=True, tol=1e-8, ) fhmodelreml.predict( X=pd.getdummies(milkexp["majorarea"], dropfirst=True), area=milkexp["small_area"], intercept=True, )
Unit-level basic method
from samplics.datasets import loadcountycrop, loadcountycrop_means
Load County Crop sample data
countycropdict = loadcountycrop() countycrop = countycropdict["data"]
Load County Crop Area Means sample data
countycropmeansdict = loadcountycropmeans() countycropmeans = countycropmeansdict["data"]
from samplics.sae import EblupUnitModel
eblupbhfreml = EblupUnitModel() eblupbhfreml.fit( countycrop["cornarea"], countycrop[["cornpixel", "soybeanspixel"]], countycrop["countyid"], ) eblupbhfreml.predict( Xmean=countycropmeans[["avecornpixel", "avecorn_pixel"]], area=np.linspace(1, 12, 12), )
```
Installation
pip install samplics
Python 3.10 or newer is required and the main dependencies are numpy, polars, and statsmodel.
Contribution
If you would like to contribute to the project, please read contributing to samplics
License
Contact
created by Mamadou S. Diallo - feel free to contact me!
Owner
- Name: samplics
- Login: samplics-org
- Kind: organization
- Location: United States of America
- Twitter: samplics
- Repositories: 2
- Profile: https://github.com/samplics-org
Sample Analytics
GitHub Events
Total
- Create event: 7
- Release event: 7
- Issues event: 12
- Watch event: 11
- Issue comment event: 24
- Push event: 67
- Fork event: 1
Last Year
- Create event: 7
- Release event: 7
- Issues event: 12
- Watch event: 11
- Issue comment event: 24
- Push event: 67
- Fork event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Mamadou S Diallo | m****o@q****g | 704 |
| Mamadou S Diallo | me@m****o | 633 |
| Fernando Irarrázaval | c****i | 3 |
| Jerin Varghese | j****3@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 38
- Total pull requests: 29
- Average time to close issues: 5 months
- Average time to close pull requests: about 14 hours
- Total issue authors: 20
- Total pull request authors: 4
- Average comments per issue: 2.11
- Average comments per pull request: 0.24
- Merged pull requests: 25
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 7
- Pull requests: 0
- Average time to close issues: 8 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 1.14
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- kburchfiel (6)
- kevintcaron (4)
- soodoku (4)
- MamadouSDiallo (4)
- michaelwalshe (3)
- rchew (2)
- ivanjpg (1)
- quillan86 (1)
- jeanbaptisteb (1)
- BArFinrod (1)
- rifkigst (1)
- jerinv (1)
- msulwa (1)
- mondjef (1)
- JuanVeraF (1)
Pull Request Authors
- MamadouSDiallo (26)
- dependabot[bot] (2)
- jerinv (1)
- cuchoi (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,537 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 120
- Total maintainers: 1
pypi.org: samplics
Select, weight and analyze complex sample data
- Documentation: https://samplics.readthedocs.io/
- License: mit
-
Latest release: 0.4.55
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- black ^22.3 develop
- certifi ^2022.6.15 develop
- codecov ^2.1 develop
- flake8 ^4.0 develop
- ipykernel ^6.13 develop
- ipython ^8.4 develop
- isort ^5.10 develop
- jupyterlab ^3.4 develop
- mypy ^0.960 develop
- nb-black-only ^1.0 develop
- nbsphinx 0.7.1 develop
- nox ^2022.1.7 develop
- pylint ^2.9 develop
- pytest ^7.1 develop
- pytest-cov ^3.0 develop
- recommonmark ^0.7 develop
- sphinx ^4.5 develop
- sphinx-autobuild ^2021.3 develop
- sphinx_bootstrap_theme ^0.8 develop
- matplotlib ^3.4
- numpy ^1.21.4
- pandas ^1.3
- python >=3.8,<3.11
- scipy ^1.7
- statsmodels ^0.13.1
- actions/checkout v3 composite
- actions/configure-pages v2 composite
- actions/deploy-pages v1 composite
- actions/upload-pages-artifact v1 composite
- abatilo/actions-poetry v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- abatilo/actions-poetry v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v1 composite