arkhe
Tools for cleaning rectangular data - :exclamation: This is a read-only mirror from https://codeberg.org/tesselle/arkhe
pytrack
a Map-Matching-based Python Toolbox for Vehicle Trajectory Reconstruction
https://github.com/johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://github.com/buchananja/dpyp
A convenience tool for small-scale data pipelines in Python
synr
An R package for handling synesthesia consistency test data. Explore, validate and summarize data.
rotating-photo-tree
An example lesson repository for use in lesson template screencasts
https://github.com/baimamboukar/python_data_cleaning
Data cleaning automation for emails in csv and excel files
https://github.com/erictleung/2017-new-coder-survey
:beginner: Code to help clean and format the 2017 New Coder Survey by freeCodeCamp
https://github.com/erictleung/2018-new-coder-survey
:beginner: Code to wrangle data from the 2018 New Coder Survey by freeCodeCamp
pydvl
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://github.com/cdcgov/clean-genes
A rust crate that automatically cleans up a gene alignment by trimming to ORF and identifying and/or removing problematic sequences.
data2neo
Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.
https://github.com/csu-agricultural-water-quality-program/als-data-cleaning-tool
A coding tool developed in R to take water analysis results exported from the ALS WEBTRIEVE™ data portal. Exported data are cleaned, merged, and exported into archiving (e.g., CSV) or visual (e.g., HTML) formats.
mierda
The Multidimensional Insufficient Effort Responding Detection Approach (mIERda) for Psychometric and Survey Data
equitystack
A structured repository of Python scripts and Jupyter notebooks for development sector data workflows — including public health, gender equity, women's economic empowerment (WEE), education, and MEL (Monitoring, Evaluation, and Learning). Includes plug-and-play templates, sample data, test coverage, and Colab-ready execution.
cleansumstats
Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
fastqrepair
A pipeline that can be used to recover corrupted FASTQ.gz files, drop or fix uncompliant reads, remove unpaired reads, and settles reads that became disordered
datalark
Like the mudlark finding treasures on the foreshore, the datalark seeks treasures hidden within messy data!
tutorials-early
Tutorials to learn reading, cleaning and validating case data, and converting line list data to incidence for visualizing epidemic curves.
https://github.com/OpenDCAI/DataFlow
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.