llm_data_cleaner

https://github.com/codedthinking/llm_data_cleaner

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: codedthinking
License: mit
Language: Python
Default Branch: main
Size: 80.1 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 4
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

LLM Data Cleaner

LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.

Why use it?

Less manual work – delegate repetitive cleaning tasks to a language model.
Consistent results – validate responses with Pydantic models.
Batch processing – send rows in chunks to respect API rate limits.

Installation

Requires Python 3.9+.

bash pip install git+https://github.com/codedthinking/llm_data_cleaner.git

Or with Poetry:

bash poetry add git+https://github.com/codedthinking/llm_data_cleaner.git

Step by step

Create Pydantic models describing the cleaned values.
Define a dictionary of instructions mapping column names to a prompt and schema.
Instantiate DataCleaner with your OpenAI API key.
Load your raw CSV file with pandas.
Call clean_dataframe(df, instructions).
Inspect the returned DataFrame which contains new cleaned_* columns.
Save or further process the cleaned data.

Example: inline models

```python import pandas as pd from pydantic import BaseModel from llmdatacleaner import DataCleaner

class AddressItem(BaseModel): index: int city: str | None country: str | None postal_code: str | None

class TitleItem(BaseModel): index: int profession: str | None

instructions = { "address": { "prompt": "Extract city, country and postal code if present.", "schema": AddressItem, }, "profession": { "prompt": "Normalize the profession to a standard job title.", "schema": TitleItem, }, }

cleaner = DataCleaner(apikey="YOUROPENAIAPIKEY") rawdf = pd.DataFrame({ "address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"], "profession": ["dev", "data eng"] }) cleaned = cleaner.cleandataframe(raw_df, instructions) print(cleaned) ```

Example: loading YAML instructions

```python from llmdatacleaner import DataCleaner, loadyamlinstructions import pandas as pd

instructions = loadyamlinstructions("instructions.yaml") cleaner = DataCleaner(apikey="YOUROPENAIAPIKEY", systemprompt="{columnprompt}") rawdf = pd.readcsv("data.csv") result = cleaner.cleandataframe(rawdf, instructions) ```

load_yaml_instructions reads the same structure shown above from a YAML file so cleaning rules can be shared without modifying code.

Authors

Miklós Koren
Gergely Attila Kiss

Preferred citation

If you use LLM Data Cleaner in your research, please cite the project as specified in CITATION.cff.

License

MIT

Owner

Name: codedthinking
Login: codedthinking
Kind: organization

Website: http://learn.codedthinking.com
Twitter: codedthinking
Repositories: 1
Profile: https://github.com/codedthinking

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this package, please cite it."
title: "LLM Data Cleaner"
version: "0.4.3"
authors:
  - family-names: Koren
    given-names: Miklós
  - family-names: Kiss
    given-names: Gergely Attila
preferred-citation:
  type: software
  title: "LLM Data Cleaner"
  version: "0.4.3"
  url: "https://github.com/codedthinking/llm_data_cleaner"
  authors:
    - family-names: Koren
      given-names: Miklós
    - family-names: Kiss
      given-names: Gergely Attila

GitHub Events

Total

Issues event: 25
Issue comment event: 9
Push event: 23
Pull request review event: 12
Pull request review comment event: 5
Pull request event: 8
Create event: 5

Last Year

Issues event: 25
Issue comment event: 9
Push event: 23
Pull request review event: 12
Pull request review comment event: 5
Pull request event: 8
Create event: 5

Dependencies

pyproject.toml pypi

black ^23.0.0 develop
flake8 ^6.0.0 develop
isort ^5.0.0 develop
pytest ^7.0.0 develop
jsonschema ^4.0.0
openai ^1.0.0
pandas ^2.2.3
pydantic ^2.0.0
python >=3.9,<4.0
tqdm ^4.65.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science