shift15m

SHIFT15M: Fashion-specific dataset for set-to-set matching with several distribution shifts

https://github.com/st-tech/zozo-shift15m

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

covariate-shift cvpr cvpr2023 dataset dataset-shifts datasets deep-learning distributional-shift fashion fill-in-the-blank fill-in-the-n-blank machine-learning research set-matching target-shift

Last synced: 6 months ago · JSON representation ·

Repository

SHIFT15M: Fashion-specific dataset for set-to-set matching with several distribution shifts

Basic Info

Host: GitHub
Owner: st-tech
License: other
Language: Python
Default Branch: main
Homepage:
Size: 10.8 MB

Statistics

Stars: 172
Watchers: 64
Forks: 16
Open Issues: 9
Releases: 3

Topics

covariate-shift cvpr cvpr2023 dataset dataset-shifts datasets deep-learning distributional-shift fashion fill-in-the-blank fill-in-the-n-blank machine-learning research set-matching target-shift

Created over 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

README.md

GitHub code size in bytes GitHub issues GitHub commit activity GitHub last commit

SHIFT15M: Fashion-specific dataset for set-to-set matching with several distribution shifts

[arXiv]
[CVPRW2023]
accepted at CVPR2023 workshop on CVFAD as an oral paper (acceptance rate = 18.5%)

Set-to-set matching is the problem of matching two different sets of items based on some criteria. Especially when each item in the set is high-dimensional, such as an image, set-to-set matching is treated as one of the applied problems to be solved by utilizing neural networks. Most machine learning-based set-to-set matching generally assumes that the training and test data follow the same distribution. However, such assumptions are often violated in real-world machine learning problems. In this paper, we propose SHIFT15M, a dataset that can be used to properly evaluate set-to-set matching models in situations where the distribution of data changes between training and testing. Some benchmark experiments show that the performance of naive methods drops due to the effects of the distribution shift. In addition, we provide software to handle the SHIFT15M dataset in a very simple way. The URL for the software will appear after this manuscript is published.

We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.

SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.

Installation

From PyPi

bash $ pip install shift15m

From source

bash $ git clone https://github.com/st-tech/zozo-shift15m.git $ cd zozo-shift15m $ poetry build $ pip install dist/shift15m-xxxx-py3-none-any.whl

Download SHIFT15M dataset

Use Dataset class

You can download SHIFT15M dataset as follows:

```python from shift15m.datasets import NumLikesRegression

dataset = NumLikesRegression(root="./data", download=True) (xtrain, ytrain), (xtest, ytest) = dataset.loaddataset(targetshift=True) ```

Download directly by using download scripts

Please download the dataset as follows:

bash $ bash scripts/download_all.sh

Tasks

The following tasks are now available:

| Tasks | Task type | Shift type | # of input dim | # of output dim | | ---------------------------------------------------------------------------------------------------------------------- | ------------------- | ----------------------------- | ------------------- | --------------- | | NumLikesRegression | regression | target shift | (N, 25) | (N, 1) | | SumPricesRegression | regression | covariate shift, target shift | (N, 1) | (N, 1) | | ItemPriceRegression | regression | target shift | (N, 4096) | (N, 1) | | ItemCategoryClassification | classification | target shift | (N, 4096) | (N, 7) | | Set2SetMatching | set-to-set matching | covariate shift | (N, 4096)x(M, 4096) | (1) |

Benchmarks

As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.

Original Dataset Structure

The original dataset is maintained in json format, and a row consists of the following:

{ "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."}, "like_num":"xx", "set_id":"xxx", "items":[ {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"}, ... ], "publish_date":"yyyy-mm-dd", "tags": "tag_a, tag_b, tag_c, ..." }

Contributing

To learn more about making a contribution to SHIFT15M, please see the following materials:

License

The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.

Dataset metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name SHIFT15M Dataset

alternateName SHIFT15M

alternateName shift15m-dataset

url https://github.com/st-tech/zozo-shift15m

sameAs https://github.com/st-tech/zozo-shift15m

description SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts.

provider

property	value
name	`ZOZO Research`
sameAs	`https://ja.wikipedia.org/wiki/ZOZO`

license

property	value
name	`CC BY-NC 4.0`
url	`https://github.com/st-tech/zozo-shift15m/blob/main/LICENSE.CC`

Errata

01/08/2022, added tags info (#187)

Papers using this dataset

Papadopoulos, Stefanos I., et al. "Multimodal Quasi-AutoRegression: Forecasting the visual popularity of new fashion products." arXiv preprint arXiv:2204.04014 (2022).
Papadopoulos, Stefanos, et al. Fashion Trend Analysis and Prediction Model. 1, Zenodo, 2021, doi:10.5281/zenodo.5795089.

References

[1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).

Owner

Name: ZOZO, Inc.
Login: st-tech
Kind: organization

Website: https://corp.zozo.com/
Repositories: 30
Profile: https://github.com/st-tech

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use SHIFT15M in your research, please cite it using these metadata."
abstract: The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service that was actually in operation for several years. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).
authors:
  - family-names: Kimura
    given-names: Masanari
    orcid: https://orcid.org/0000-0002-9953-3469
    email: masanari.kimura@zozo.com
  - family-names: Nakamura
    given-names: Takuma
    orcid: https://orcid.org/0000-0001-7904-4724
  - family-names: Saito
    given-names: Yuki
    orcid: https://orcid.org/0000-0003-0492-414X
title: "SHIFT15M: Multiobjective Large-Scale Dataset with Distributional Shifts"
version: 1.0.0
date-released: 2021-08-20
license: Apache-2.0

GitHub Events

Total

Watch event: 8

Last Year

Watch event: 8

Packages

Total packages: 1
Total downloads:
- pypi 10 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 5
Total maintainers: 1

pypi.org: shift15m

Large-scale multiobective dataset with dataset shift.

Documentation: https://shift15m.readthedocs.io/
License: other
Latest release: 0.2.0
published over 3 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 10 Last month

Rankings

Dependent packages count: 10.1%

Dependent repos count: 21.6%

Average: 31.2%

Downloads: 61.8%

Maintainers (1)

shift15m

Last synced: 7 months ago

Dependencies

poetry.lock pypi

autoflake 1.4 develop
black 21.9b0 develop
click 8.0.3 develop
flake8 3.9.2 develop
importlib-metadata 4.8.1 develop
isort 5.9.3 develop
mccabe 0.6.1 develop
mypy-extensions 0.4.3 develop
pathspec 0.9.0 develop
platformdirs 2.4.0 develop
pycodestyle 2.7.0 develop
pyflakes 2.3.1 develop
regex 2021.10.8 develop
tomli 1.2.1 develop
typed-ast 1.4.3 develop
typing-extensions 3.10.0.2 develop
zipp 3.6.0 develop
alabaster 0.7.12
babel 2.9.1
beautifulsoup4 4.10.0
certifi 2021.10.8
charset-normalizer 2.0.7
colorama 0.4.4
cycler 0.10.0
docutils 0.17.1
furo 2021.10.9
idna 3.3
imagesize 1.2.0
jinja2 3.0.2
joblib 1.1.0
kiwisolver 1.3.2
markupsafe 2.0.1
matplotlib 3.4.3
numpy 1.21.1
packaging 21.0
pandas 1.3.4
pillow 9.0.1
pygments 2.10.0
pyparsing 2.4.7
python-dateutil 2.8.2
pytz 2021.3
requests 2.26.0
scikit-learn 0.24.2
scipy 1.6.1
seaborn 0.11.2
six 1.16.0
sklearn 0.0
snowballstemmer 2.1.0
soupsieve 2.2.1
sphinx 4.2.0
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
threadpoolctl 3.0.0
tqdm 4.62.3
urllib3 1.26.7

pyproject.toml pypi

autoflake ^1.4 develop
black ^21.7b0 develop
flake8 ^3.9.2 develop
isort ^5.9.3 develop
Sphinx ^4.0.2
furo ^2021.6.24-beta.37
matplotlib ^3.4.2
pandas ^1.2.4
python >=3.7.1,<4.0
requests ^2.26.0
seaborn ^0.11.1
sklearn ^0.0
tqdm ^4.62.0

.github/workflows/python-tests.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite