Retriever
Retriever: Data Retrieval Tool - Published in JOSS (2017)
Soundata
Soundata: Reproducible use of audio datasets - Published in JOSS (2024)
open-mastr
open-mastr: A Python Package to Download and Process the German Energy Registry Marktstammdatenregister - Published in JOSS (2024)
Crowsetta
Crowsetta: A Python tool to work with any format for annotating animal vocalizations and bioacoustics data. - Published in JOSS (2023)
WGS2NCBI - Toolkit for preparing genomes for submission to NCBI
WGS2NCBI - Toolkit for preparing genomes for submission to NCBI - Published in JOSS (2019)
tsp ("Teaspoon")
tsp ("Teaspoon"): A library for ground temperature data - Published in JOSS (2022)
faker
Faker is a Python package that generates fake data for you.
torchxrayvision
TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.
ekpmeasure
Repository of analysis and computer control code for various experiments. Analysis module is designed to help the researcher wrangle large amounts of meta data
mtg-jamendo-dataset
Metadata, scripts and baselines for the MTG-Jamendo dataset
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
the-plunging-flow-by-3d-les
The Plunging of Hyperpycnal Plumes on Tilted Bed by Three-Dimensional Large-Eddy Simulations
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
dataset-phenotypes
Preparatory scripts for BIDS tabular phenotypic data in large neuroimaging datasets.
py-torchtext
Models, data loaders and abstractions for language processing, powered by PyTorch
vulntrain
A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.
knowprompt
[WWW 2022] KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction
interactive_data_editor
A Software to interactively edit data in a graphical manner
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
globalmatch
GlobalMatch: Registration of forest terrestrial point clouds by global matching of relative stem positions [ISPRS 2023]
awesome-remote-sensing-change-detection
List of datasets, codes, and contests related to remote sensing change detection
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
yegor256/cam
Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories
transformer-srl
Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.
phishing-dataset
Phishing dataset with more than 88,000 instances and 111 features. Web application available at. https://gregavrbancic.github.io/Phishing-Dataset/
@stdlib/datasets-spache-revised
A list of simple American-English words (revised Spache).
cvat
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
monitors4codegen
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to build applications around language servers.
@stdlib/datasets-harrison-boston-house-prices
A dataset derived from information collected by the US Census Service concerning housing in Boston, Massachusetts (1978).
datasets-herndon-venus-semidiameters
Fifteen observations of the vertical semidiameter of Venus, made by Lieutenant Herndon, with the meridian circle at Washington, in the year 1846.
@stdlib/datasets-harrison-boston-house-prices-corrected
A (corrected) dataset derived from information collected by the US Census Service concerning housing in Boston, Massachusetts (1978).
@stdlib/datasets-pace-boston-house-prices
A (corrected) dataset derived from information collected by the US Census Service concerning housing in Boston, Massachusetts (1978).
datasets-minard-napoleons-march
Data for Charles Joseph Minard's cartographic depiction of Napoleon's Russian campaign of 1812.
yeast-in-microstructures-dataset
Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
eurocropsml
EuroCropsML is a ready-to-use benchmark dataset for few-shot crop type classification using Sentinel-2 imagery.
pyslice
Data set templating library for model dataset creation and model running.
tyc-dataset
Official and maintained implementation of the dataset paper "The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures" [ICCVW 2023].
fitz-collection-raw-data
Raw data from the collections database in json and csv format
https://github.com/atomashevic/pymadoc
Python package to download and combine parts of MADOC dataset
thetis
Service to examine data processing pipelines (e.g., machine learning or deep learning pipelines) for uncertainty consistency (calibration), fairness, and other safety-relevant aspects.
@stdlib/datasets-cmudict
The Carnegie Mellon Pronouncing Dictionary (CMUdict).
@stdlib/datasets-us-states-abbr
A list of US state two-letter abbreviations in alphabetical order according to state name.
cppe5
Code for our paper CPPE - 5 (Medical Personal Protective Equipment), a new challenging object detection dataset
netcdf-fortran
Official GitHub repository for netCDF-Fortran libraries, which depend on the netCDF C library. Install the netCDF C library first.
speech-to-intent-dataset
Dataset Release for Intent Classification from Speech
@stdlib/datasets-cdc-nchs-us-infant-mortality-bw-1915-2013
US infant mortality data, by race, from 1915 to 2013, as provided by the Center for Disease Control and Prevention's National Center for Health Statistics.
@stdlib/datasets-cdc-nchs-us-births-1994-2003
US birth data from 1994 to 2003, as provided by the Center for Disease Control and Prevention's National Center for Health Statistics.
@stdlib/datasets-cdc-nchs-us-births-1969-1988
US birth data from 1969 to 1988, as provided by the Center for Disease Control and Prevention's National Center for Health Statistics.
@stdlib/datasets-nightingales-rose
Dataset for Nightingale's famous polar area diagram.