https://github.com/birder-project/vision-data-curation
A mirror of https://gitlab.com/birder/vision-data-curation
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Keywords
Repository
A mirror of https://gitlab.com/birder/vision-data-curation
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Vision Data Curation (VDC)
A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.
Status
This project is in early development. Most features are functional, but APIs may still change.
- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
- Features in Progress: Duplicate removal, Rotation correction
Feedback and contributions are welcome.
Features
VDC provides modular tools for dataset cleanup:
- Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
- Example-based filtering - remove images similar to a set of unwanted examples
- Image Quality Filtering - remove images based on aesthetic score or NSFW classification
- Duplicate removal - identify and remove near-duplicate images from your dataset
- Hierarchical K-Means sampling - select diverse, representative subsets from large datasets
Coming soon:
- Rotation correction (correct 90°/180°/270° orientation errors)
The Curation Pipeline
```mermaid
flowchart LR
A[Raw
Dataset] --> V[Validation]
V --> R[Rotation*]
R --> D[Dedup]
D --> E[Example
Filter]
E --> Q[Quality Filter
Aesthetic/NSFW]
Q --> S[Cluster-based
Sampling]
S --> F[Curated
Dataset]
U[Unwanted<br/>Examples] --> E
```
Note: * = WIP
Installation
From PyPI
sh
pip install vision-data-curation
From Source
sh
git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .
Developing directly from the project root allows for script and configuration execution as if fully installed.
Usage
Each step is a script under vdc.scripts.
Examples:
```sh
Remove corrupt/invalid images
python -m vdc.scripts.sanitizeimages data/rawimages/
Filter based on "Unwanted examples"
python -m vdc.scripts.filterbyexamples data/embeddings.csv --examples bad_examples.csv ```
- Run
python -m vdc.scriptsto see available scripts - Run
python -m vdc.scripts.<script> --helpfor options
Configuration:
- Default settings live in vdc/conf/config.json
- A
config.jsonin your project root will take precedence (or pass--configto any script)
License
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Owner
- Name: Birder Project
- Login: birder-project
- Kind: organization
- Repositories: 1
- Profile: https://github.com/birder-project
GitHub Events
Total
- Push event: 5
- Create event: 5
Last Year
- Push event: 5
- Create event: 5
Dependencies
- altair * development
- bandit * development
- black * development
- build * development
- bumpver * development
- coverage * development
- debugpy * development
- flake8 * development
- flake8-pep585 * development
- invoke * development
- ipython * development
- isort * development
- mypy * development
- parameterized * development
- pyarrow * development
- pylint * development
- safetensors * development
- setuptools * development
- twine * development
- urllib3 * development
- wheel * development
- ipykernel *
- ipywidgets *
- jupyterlab *
- jupyterlab-code-formatter *
- jupytext *
- nbconvert *
- pickleshare *
- torch ==2.8.0
- torchaudio ==2.8.0
- torchvision ==0.23.0
- torch ==2.8.0
- torchaudio ==2.8.0
- torchvision ==0.23.0
- Pillow >=11.0.0
- birder *
- matplotlib >=3.9.0
- numpy >=2.2.0
- polars >=1.31.0
- pt-kmeans >=0.3.1
- pyarrow >=20.0.0
- tqdm >=4.67.0
- webdataset >=0.2.111