hapi

https://github.com/lchen001/hapi

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: lchen001
License: apache-2.0
Language: Python
Default Branch: main
Size: 1010 KB

Statistics

Stars: 17
Watchers: 3
Forks: 2
Open Issues: 2
Releases: 0

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Contributing License Citation

----- [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme) [![Paper](http://img.shields.io/badge/paper-arxiv.2209.08443-B31B1B.svg)](https://arxiv.org/abs/2209.08443) [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE) [![Conference](http://img.shields.io/badge/NeurIPS-2022-4b44ce.svg)]() A longitudinal database of ML API predictions. [**Getting Started**](#%EF%B8%8F-quickstart) | [**Website**](http://hapi.stanford.edu/) | [**Contributing**](CONTRIBUTING.md) | [**About**](#%EF%B8%8F-about)

💡 What is HAPI?

History of APIs (HAPI) is a large-scale, longitudinal database of commercial ML API predictions. It contains 1.7 million predictions collected from 2020 to 2022 and spanning APIs from Amazon, Google, IBM, and Microsoft. The database include diverse machine learning tasks including image tagging, speech recognition and text mining.

⚡️ Quickstart

We provide a lightweight python package for getting started with HAPI.

Read the guide below or follow along in Google Colab:

bash pip install "hapi @ git+https://github.com/lchen001/hapi@main"

Import the library and download the data, optionally specifying the directory for the the download. If the directory is not specified, the data will be downloaded to ~/.hapi.

```python

import hapi

hapi.config.data_dir = "/path/to/data/dir"

hapi.download() ```

You can permanently set the data directory by adding the variable HAPI_DATA_DIR to your environment.

Once we've downloaded the database, we can list the available APIs, datasets, and tasks with hapi.summary(). This returns a Pandas DataFrame with columns task, dataset, api, date, path, cost_per_10k. ```python

df = hapi.summary() ```

To load the predictions into memory we use hapi.get_predictions(). The keyword arguments allow us to load predictions for a subset of tasks, datasets, apis and/or dates. ```python

predictions = hapi.getpredictions(task="mic", dataset="pascal", api=["googlemic", "ibm_mic"]) ```

The predictions are returned as a dictionary mapping from "{task}/{dataset}/{api}/{date}" to lists of dictionaries, each with keys "example_id", "predicted_label", and "confidence". For example: python { "mic/pascal/google_mic/20-10-28": [ { 'confidence': 0.9798267782, 'example_id': '2011_000494', 'predicted_label': ['bird', 'bird'] }, ... ], "mic/pascal/microsoft_mic/20-10-28": [...], ... }

To load the labels into memory we use hapi.get_labels(). The keyword arguments allow us to load labels for a subset of tasks and datasets. ```python

labels = hapi.get_labels(task="mic", dataset="pascal") ```

The labels are returned as a dictionary mapping from "{task}/{dataset}" to lists of dictionaries, each with keys "example_id" and "true_label".

💾 Manual Downloading

In this section, we discuss how to download the database without the HAPI Python API.

The database is stored in a GCP bucket named hapi-data. All model predictions are stored in hapi.tar.gz (Compressed size: 205.3MB, Full size: 1.2GB).

From the command line, you can download and extract the predictions with: bash wget https://storage.googleapis.com/hapi-data/hapi.tar.gz && tar -xzvf hapi.tar.gz However, we recommend downloading using the Python API as described above.

🌍 Datasets

In this section, we discuss how to download the benchmark datasets used in HAPI.

The predictions in HAPI are made on benchmark datasets from across the machine learning community. For example, HAPI includes predictions on PASCAL, a popular object detection dataset. Unfortunately, we lack the permissions required to redistribute these datasets, so we do not include the raw data in the download described above.

We provide instructions on how to download each of the datasets and, for a growing number of them, we provide automated scripts that can download the dataset. These scripts are implemented in the Meerkat Dataset Registry – a registry of machine learning datasets (similar to Torchvision Datasets).

To download a dataset and load it into memory, use hapi.get_dataset(): ```python

import hapi dp = hapi.getdataset("pascal") ``` This returns a Meerkat DataPanel – a DataFrame-like object that houses the dataset. See the Meerkat User Guide for more information. The DataPanel will have an "exampleid" column that corresponds to the "exampleid" key in the outputs of `hapi.getpredictions()andhapi.get_labels()`.

If the dataset is not yet available through the Meerkat Dataset Registry, a ValueError will be raised containing instructions for manually downloading the dataset. For example:

```python

dp = hapi.get_dataset("cmd")

ValueError: Data download for 'cmd' not yet available for download through the HAPI Python API. Please download manually following the instructions below:

CMD is a spoken command recognition dataset.

It can be downloaded here: https://pyroomacoustics.readthedocs.io/en/pypi-release/pyroomacoustics.datasets.googlespeechcommands.html. ```

✉️ About

HAPI was developed at Stanford in the Zou Group. Reach out to Lingjiao Chen (lingjiao [at] stanford [dot] edu) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu) if you would like to get involved!

Owner

Login: lchen001
Kind: user

Repositories: 2
Profile: https://github.com/lchen001

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this benchmark, please cite it as below."
authors:
- family-names: Chen
  given-names: Lingjiao
- family-names: Eyuboglu
  given-names: Sabri
  orcid: "https://orcid.org/0000-0002-8412-0266"
- family-names: Jin
  given-names: Zhihua
- family-names: Ré
  given-names: Christopher
- family-names: Zaharia
  given-names: Matei
- family-names: Zou
  given-names: James
title: "hapi"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2021-11-29
url: "https://github.com/lchen001/HAPI"

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 1
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

yueyu1030 (1)

Pull Request Authors

TrellixVulnTeam (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Pipfile pypi

dcbench * develop
ipython * develop
twine * develop
dcbench *

Pipfile.lock pypi

144 dependencies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science