https://github.com/birder-project/vision-data-curation

A mirror of https://gitlab.com/birder/vision-data-curation

https://github.com/birder-project/vision-data-curation

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

computer-vision data-curation
Last synced: 5 months ago · JSON representation

Repository

A mirror of https://gitlab.com/birder/vision-data-curation

Basic Info
  • Host: GitHub
  • Owner: birder-project
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 183 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
computer-vision data-curation
Created 6 months ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

Status

This project is in early development. Most features are functional, but APIs may still change.

  • Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
  • Features in Progress: Duplicate removal, Rotation correction

Feedback and contributions are welcome.

Features

VDC provides modular tools for dataset cleanup:

  • Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
  • Example-based filtering - remove images similar to a set of unwanted examples
  • Image Quality Filtering - remove images based on aesthetic score or NSFW classification
  • Duplicate removal - identify and remove near-duplicate images from your dataset
  • Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

Coming soon:

  • Rotation correction (correct 90°/180°/270° orientation errors)

The Curation Pipeline

```mermaid flowchart LR A[Raw
Dataset] --> V[Validation] V --> R[Rotation*] R --> D[Dedup] D --> E[Example
Filter] E --> Q[Quality Filter
Aesthetic/NSFW] Q --> S[Cluster-based
Sampling] S --> F[Curated
Dataset]

U[Unwanted<br/>Examples] --> E

```

Note: * = WIP

Installation

From PyPI

sh pip install vision-data-curation

From Source

sh git clone https://gitlab.com/birder/vision-data-curation.git cd vision-data-curation pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

```sh

Remove corrupt/invalid images

python -m vdc.scripts.sanitizeimages data/rawimages/

Filter based on "Unwanted examples"

python -m vdc.scripts.filterbyexamples data/embeddings.csv --examples bad_examples.csv ```

  • Run python -m vdc.scripts to see available scripts
  • Run python -m vdc.scripts.<script> --help for options

Configuration:

  • Default settings live in vdc/conf/config.json
  • A config.json in your project root will take precedence (or pass --config to any script)

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Owner

  • Name: Birder Project
  • Login: birder-project
  • Kind: organization

GitHub Events

Total
  • Push event: 5
  • Create event: 5
Last Year
  • Push event: 5
  • Create event: 5

Dependencies

pyproject.toml pypi
requirements/_requirements-dev.txt pypi
  • altair * development
  • bandit * development
  • black * development
  • build * development
  • bumpver * development
  • coverage * development
  • debugpy * development
  • flake8 * development
  • flake8-pep585 * development
  • invoke * development
  • ipython * development
  • isort * development
  • mypy * development
  • parameterized * development
  • pyarrow * development
  • pylint * development
  • safetensors * development
  • setuptools * development
  • twine * development
  • urllib3 * development
  • wheel * development
requirements/requirements-dev.txt pypi
requirements/requirements-notebooks.txt pypi
  • ipykernel *
  • ipywidgets *
  • jupyterlab *
  • jupyterlab-code-formatter *
  • jupytext *
  • nbconvert *
  • pickleshare *
requirements/requirements-pytorch-cpu.txt pypi
  • torch ==2.8.0
  • torchaudio ==2.8.0
  • torchvision ==0.23.0
requirements/requirements-pytorch-gpu.txt pypi
  • torch ==2.8.0
  • torchaudio ==2.8.0
  • torchvision ==0.23.0
requirements/requirements.txt pypi
  • Pillow >=11.0.0
  • birder *
  • matplotlib >=3.9.0
  • numpy >=2.2.0
  • polars >=1.31.0
  • pt-kmeans >=0.3.1
  • pyarrow >=20.0.0
  • tqdm >=4.67.0
  • webdataset >=0.2.111