https://github.com/amazon-science/ssepy

Python package for stratifying, sampling, and estimating model performance with fewer annotations.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

estimation sampling statistical-inference statistics stratified-sampling

Last synced: 5 months ago · JSON representation

Repository

Python package for stratifying, sampling, and estimating model performance with fewer annotations.

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 381 KB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 0

Topics

estimation sampling statistical-inference statistics stratified-sampling

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Contributing License Code of conduct

`ssepy`: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python

Apache-2.0

Given an unlabeled dataset and model predictions, how can we select which instances to annotate in one go to maximize the precision of our estimates of model performance on the entire dataset?

ssepy helps you estimate the mean of any random variable across a large dataset. When the focus is on a model’s performance, it treats each sample’s performance as a random variable and aims to estimate the average (i.e., mean) performance over the entire dataset.

The main idea:

Predict: Obtain a proxy or predicted value for each sample (e.g., a model’s predicted performance on that sample).
Stratify: Use these proxies to group the samples into strata.
Sample: From each stratum, draw a subset of samples according to the chosen allocation method (proportional, Neyman, or others).
Annotate: Acquire ground-truth labels or real outcomes for the sampled subset.
Estimate: Compute the overall mean (e.g., the mean model performance) using an estimator such as Horvitz-Thompson or a difference estimator.

See our paper here for a technical overview of the framework.

Getting started

In order to intall the package, run python pip install ssepy

Alternatively, clone the repo, cd into it, and run

python pip install .

You may want to initialize a conda environment before running this operation.

Test your setup using this example, which demonstrates data stratification, n allocation for annotation via proportional allocation, sampling via stratified simple random sampling, and estimation using the Horvitz-Thompson estimator:

```python import numpy as np from sklearn.cluster import KMeans from ssepy import ModelPerformanceEvaluator

np.random.seed(0)

Generate data

N = 100000 Y = np.random.normal(0, 1, N) # Ground truth

Unobserved target

print(np.mean(Y))

n = 100 # Annotation n

1. Proxy for ground truth

Yh = Y + np.random.normal(0, 0.1, N) evaluator = ModelPerformanceEvaluator(Yh = Yh, budget = n) # Initialize evaluator

2. Stratify on Yh

evaluator.stratifydata(clusteringalgo=KMeans(nclusters=5, randomstate=0, n_init="auto"), X=Yh) # 5 strata

3. Allocate n with proportional allocation and sample

evaluator.allocatebudget(allocationtype="proportional") sampled_idx = evaluator.sample()

4. Annotate

Yl = Y[sampled_idx]

5. Estimate target and variance of estimate

estimate, varianceestimate = evaluator.computeestimate(Yl, estimator="ht") print(estimate, variance_estimate) ```

For the difference estimator under simple random sampling, run

python evaluator = ModelPerformanceEvaluator(Yh=Yh, budget=n) # initialize sampler sampled_idx = evaluator.sample(sampling_method="srs") # 3. sample Yl = Y[sampled_idx] # 4. annotate estimate, variance_estimate = evaluator.compute_estimate(Yl, estimator="df") # 5. estimate print(estimate, variance_estimate)

See also some examples in the associated folder.

Features

The supported sample designs are: (SRS) simple random sampling without replacement, (SSRS) stratified simple random sampling without replacement with proportional and optimal/Neyman allocation, (Poisson) sampling. All sampling methods have associated (HT) Horvitz-Thompson and (DF) difference estimators.

Bugs and contribute

Feel free to reach out if you find any bugs or you would like other features to be implemented in the package.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 2
Delete event: 1
Issue comment event: 1
Push event: 5
Pull request event: 2
Create event: 2

Last Year

Watch event: 2
Delete event: 1
Issue comment event: 1
Push event: 5
Pull request event: 2
Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 2 months
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 2 months
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/documentation.yml actions

actions/checkout v2.3.1 composite
actions/setup-python v2 composite
snok/install-poetry v1 composite

.github/workflows/lint.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

poetry.lock pypi

alabaster 0.7.16
babel 2.15.0
certifi 2024.7.4
charset-normalizer 3.3.2
colorama 0.4.6
docutils 0.21.2
idna 3.7
imagesize 1.4.1
jinja2 3.1.4
markupsafe 2.1.5
numpy 1.26.4
packaging 24.1
pygments 2.18.0
requests 2.32.3
snowballstemmer 2.2.0
sphinx 7.3.7
sphinxcontrib-applehelp 1.0.8
sphinxcontrib-devhelp 1.0.6
sphinxcontrib-htmlhelp 2.0.5
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.7
sphinxcontrib-serializinghtml 1.1.10
urllib3 2.2.2

pyproject.toml pypi

numpy ^1.25.2
python ^3.11
sphinx ^7.3.7

https://github.com/amazon-science/ssepy

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ssepy: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python

Getting started

Generate data

Unobserved target

1. Proxy for ground truth

2. Stratify on Yh

3. Allocate n with proportional allocation and sample

4. Annotate

5. Estimate target and variance of estimate

Features

Bugs and contribute

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

`ssepy`: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python