etl-forge

https://github.com/kkartas/etlforge

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: joss.theoj.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.7%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: kkartas
License: mit
Language: Python
Default Branch: main
Size: 572 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 8 months ago · Last pushed 8 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

ETLForge

PyPI - Python Version

A Python library for generating synthetic test data and validating ETL (Extract, Transform, Load) outputs. ETL processes are fundamental data workflows that extract data from various sources, transform it according to business rules, and load it into target systems like data warehouses or databases. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality throughout your ETL pipelines.

Features

Test Data Generator

Generate synthetic data based on YAML/JSON schema definitions
Support for multiple data types: int, float, string, date, category
Advanced constraints: ranges, uniqueness, nullable fields, categorical values
Integration with Faker for realistic string generation
Export to CSV or Excel formats

Data Validator

Validate CSV/Excel files against schema definitions
Comprehensive validation checks:
- Column existence
- Data type matching
- Value constraints (ranges, categories)
- Uniqueness validation
- Null value validation
- Date format validation
Generate detailed reports of invalid rows

Dual Interface

Command-line interface for quick operations
Python library for integration into existing workflows

Installation

Prerequisites

Python 3.9 or higher
pip package manager

Install from PyPI (Recommended)

bash pip install etl-forge

Install from Source

For development or latest features: bash git clone https://github.com/kkartas/etl-forge.git cd etl-forge pip install -e ".[dev]"

Dependencies

Core dependencies (6 total, automatically installed): - pandas>=1.3.0 - Data manipulation and analysis - pyyaml>=5.4.0 - YAML parsing for schema files - click>=8.0.0 - Command-line interface framework - openpyxl>=3.0.0 - Excel file support - numpy>=1.21.0 - Numerical computing - psutil>=5.9.0 - System monitoring for benchmarks

Optional dependencies for enhanced features: ```bash

For realistic data generation using Faker templates

pip install etl-forge[faker]

For development (testing, linting, documentation)

pip install etl-forge[dev] ```

Verify Installation

```bash

CLI verification (may require adding Scripts directory to PATH on Windows)

etl-forge --version

Alternative CLI access (works on all platforms)

python -m etl_forge.cli --version

Library verification

python -c "from etl_forge import DataGenerator, DataValidator; print('Installation verified')" ```

CLI Access Note

On some systems (especially Windows), the etl-forge command may not be directly accessible. In such cases, use: bash python -m etl_forge.cli [command] [options]

Complete Example

For a comprehensive demonstration of ETLForge's capabilities, see the included example.py file:

```bash

Run the complete example

python example.py ```

This example demonstrates: - Schema-driven data generation with realistic data (using Faker) - Data validation with the same schema - Error detection and reporting - Complete ETL testing workflow

Key snippet from example.py:

```python from etl_forge import DataGenerator, DataValidator

Single schema drives both generation and validation

schema = { "fields": [ {"name": "customerid", "type": "int", "unique": True, "range": {"min": 1, "max": 10000}}, {"name": "name", "type": "string", "fakertemplate": "name"}, {"name": "email", "type": "string", "unique": True, "fakertemplate": "email"}, {"name": "purchaseamount", "type": "float", "range": {"min": 10.0, "max": 5000.0}, "nullable": True}, {"name": "customer_tier", "type": "category", "values": ["Bronze", "Silver", "Gold", "Platinum"]} ] }

Generate test data

generator = DataGenerator(schema) df = generator.generatedata(1000) generator.savedata(df, 'customertestdata.csv')

Validate with the same schema

validator = DataValidator(schema) result = validator.validate('customertestdata.csv') print(f"Validation passed: {result.is_valid}") ```

This demonstrates ETLForge's key advantage: single schema, dual purpose - the same schema definition drives both data generation and validation, ensuring perfect synchronization between test data and validation rules.

Quick Start

1. Create a Schema

Create a schema.yaml file defining your data structure:

```yaml fields: - name: id type: int unique: true nullable: false range: min: 1 max: 10000

name: name type: string nullable: false faker_template: name
name: department type: category nullable: false values:
- Engineering
- Marketing
- Sales ```

2. Generate Test Data

Command Line: ```bash

Direct CLI command (if available)

etl-forge generate --schema schema.yaml --rows 500 --output sample.csv

Alternative CLI access (works on all platforms)

python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv ```

Python Library: ```python from etl_forge import DataGenerator

generator = DataGenerator('schema.yaml') df = generator.generatedata(500) generator.savedata(df, 'sample.csv') ```

3. Validate Data

Command Line: ```bash

Direct CLI command (if available)

etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv

Alternative CLI access (works on all platforms)

python -m etlforge.cli check --input sample.csv --schema schema.yaml --report invalidrows.csv ```

Python Library: ```python from etl_forge import DataValidator

validator = DataValidator('schema.yaml') result = validator.validate('sample.csv') print(f"Validation passed: {result.is_valid}") ```

Schema Definition

Supported Field Types

Integer (`int`)

yaml - name: age type: int nullable: false range: min: 18 max: 65 unique: false

Float (`float`)

yaml - name: salary type: float nullable: true range: min: 30000.0 max: 150000.0 precision: 2 null_rate: 0.1

String (`string`)

yaml - name: email type: string nullable: false unique: true length: min: 10 max: 50 faker_template: email # Optional: uses Faker library

Date (`date`)

yaml - name: hire_date type: date nullable: false range: start: '2020-01-01' end: '2024-12-31' format: '%Y-%m-%d'

Category (`category`)

yaml - name: status type: category nullable: false values: - Active - Inactive - Pending

Schema Constraints

nullable: Allow null values (default: false)
unique: Ensure all values are unique (default: false)
range: Define min/max values for numeric types or start/end dates
values: List of allowed values for categorical fields
length: Min/max length for string fields
precision: Decimal places for float fields
format: Date format string (default: '%Y-%m-%d')
faker_template: Faker method name for realistic string generation
null_rate: Probability of null values when nullable: true (default: 0.1)

Command Line Interface

Generate Data

```bash

Direct CLI command (if available)

etl-forge generate [OPTIONS]

Alternative CLI access (works on all platforms)

python -m etl_forge.cli generate [OPTIONS]

Options: -s, --schema PATH Path to schema file (YAML or JSON) [required] -r, --rows INTEGER Number of rows to generate (default: 100) -o, --output PATH Output file path (CSV or Excel) [required] -f, --format [csv|excel] Output format (auto-detected if not specified) ```

Validate Data

```bash

Direct CLI command (if available)

etl-forge check [OPTIONS]

Alternative CLI access (works on all platforms)

python -m etl_forge.cli check [OPTIONS]

Options: -i, --input PATH Path to input data file [required] -s, --schema PATH Path to schema file [required] -r, --report PATH Path to save invalid rows report (optional) -v, --verbose Show detailed validation errors ```

Create Example Schema

```bash

Direct CLI command (if available)

etl-forge create-schema example_schema.yaml

Alternative CLI access (works on all platforms)

python -m etlforge.cli create-schema exampleschema.yaml ```

Library Usage

Data Generation

```python from etl_forge import DataGenerator

Initialize with schema

generator = DataGenerator('schema.yaml')

Generate data

df = generator.generate_data(1000)

Save to file

generator.save_data(df, 'output.csv')

Or do both in one step

df = generator.generateandsave(1000, 'output.xlsx', 'excel') ```

Data Validation

```python from etl_forge import DataValidator

Initialize validator

validator = DataValidator('schema.yaml')

Validate data

result = validator.validate('data.csv')

Check results

if result.isvalid: print("Data is valid!") else: print(f"Found {len(result.errors)} validation errors") print(f"Invalid rows: {len(result.invalidrows)}")

Generate report

result = validator.validateandreport('data.csv', 'errors.csv')

Print summary

validator.printvalidationsummary(result) ```

Advanced Usage

```python

Use schema as dictionary

schemadict = { 'fields': [ {'name': 'id', 'type': 'int', 'unique': True}, {'name': 'name', 'type': 'string', 'fakertemplate': 'name'} ] }

generator = DataGenerator(schemadict) validator = DataValidator(schemadict)

Validate DataFrame directly

import pandas as pd df = pd.read_csv('data.csv') result = validator.validate(df) ```

Faker Integration

When the faker library is installed, you can use realistic data generation:

```yaml - name: firstname type: string fakertemplate: first_name

name: address type: string faker_template: address
name: phone type: string fakertemplate: phonenumber ```

Common Faker templates: - name, first_name, last_name - email, phone_number - address, city, country - company, job - date, time - And many more! See Faker documentation

Testing

Run the test suite:

bash pytest tests/

Run with coverage:

bash pytest tests/ --cov=etl_forge --cov-report=html

Performance

Performance benchmarks are available in BENCHMARKS.md. To reproduce them, run:

bash python benchmark.py

Then, to visualize the results:

bash python plot_benchmark.py

Citation

If you use ETLForge in your research or work, please cite it using the information in CITATION.cff.

Owner

Name: Kyriakos Kartas
Login: kkartas
Kind: user
Location: Herakleion, Greece

Website: http://www.kkartas.gr
Repositories: 1
Profile: https://github.com/kkartas

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
version: 1.0.3
date-released: 2025-06-15
url: "https://github.com/kkartas/ETLForge"
repository-code: "https://github.com/kkartas/ETLForge"
license: MIT
authors:
  - family-names: "Kartas"
    given-names: "Kyriakos"
    email: "mail@kkartas.gr"
    orcid: "https://orcid.org/0009-0001-6477-4676"
keywords:
  - Python
  - ETL
  - data validation
  - synthetic data
  - data quality
  - testing
  - data engineering
abstract: >-
  ETLForge is a comprehensive Python framework for synthetic test data 
  generation and automated ETL output validation. The framework enables 
  data engineers and scientists to create realistic test datasets based 
  on declarative schema definitions and automatically validate data 
  quality against those schemas, significantly improving the reliability 
  and maintainability of ETL pipelines.
preferred-citation:
  type: article
  title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
  authors:
    - family-names: "Kartas"
      given-names: "Kyriakos"
      email: "mail@kkartas.gr"
      orcid: "https://orcid.org/0009-0001-6477-4676"
  journal: "Journal of Open Source Software"
  year: 2025
  volume: TBD
  issue: TBD
  start: TBD
  end: TBD
  doi: "TBD"
  url: "https://github.com/kkartas/ETLForge"
references:
  - type: software
    title: "pandas: powerful Python data analysis toolkit"
    authors:
      - family-names: "McKinney"
        given-names: "Wes"
    url: "https://pandas.pydata.org/"
  - type: software
    title: "NumPy: The fundamental package for scientific computing with Python"
    authors:
      - family-names: "Harris"
        given-names: "Charles R."
    url: "https://numpy.org/"
  - type: software
    title: "PyYAML: YAML parser and emitter for Python"
    authors:
      - family-names: "PyYAML Contributors"
    url: "https://pyyaml.org/"

GitHub Events

Total

Release event: 1
Push event: 10
Create event: 1

Last Year

Release event: 1
Push event: 10
Create event: 1

Packages

Total packages: 1
Total downloads:
- pypi 21 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

pypi.org: etl-forge

A Python library for generating synthetic test data and validating ETL outputs.

Homepage: https://github.com/kkartas/ETLForge
Documentation: https://etl-forge.readthedocs.io/
License: MIT License
Latest release: 1.0.3
published 8 months ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 21 Last month

Rankings

Dependent packages count: 9.0%

Average: 29.8%

Dependent repos count: 50.6%

Maintainers (1)

kkartas

Last synced: 6 months ago

Dependencies

requirements.txt pypi

click >=8.0.0
faker >=15.0.0
numpy >=1.21.0
openpyxl >=3.0.0
pandas >=1.3.0
pyyaml >=5.4.0

etl_forge.egg-info/requires.txt pypi

black >=21.0.0
click >=8.0.0
coverage >=6.0
faker >=15.0.0
flake8 >=3.8.0
matplotlib >=3.5.0
mypy >=0.900
numpy >=1.21.0
openpyxl >=3.0.0
pandas >=1.3.0
psutil >=5.9.0
pytest >=6.0.0
pytest-cov >=2.0.0
pyyaml >=5.4.0
sphinx >=4.0.0
sphinx-rtd-theme >=1.0.0

pyproject.toml pypi

click >=8.0.0
numpy >=1.21.0
openpyxl >=3.0.0
pandas >=1.3.0
psutil >=5.9.0
pyyaml >=5.4.0

etl-forge

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ETLForge

Features

Test Data Generator

Data Validator

Dual Interface

Installation

Prerequisites

Install from PyPI (Recommended)

Install from Source

Dependencies

For realistic data generation using Faker templates

For development (testing, linting, documentation)

Verify Installation

CLI verification (may require adding Scripts directory to PATH on Windows)

Alternative CLI access (works on all platforms)

Library verification

CLI Access Note

Complete Example

Run the complete example

Single schema drives both generation and validation

Generate test data

Validate with the same schema

Quick Start

1. Create a Schema

2. Generate Test Data

Direct CLI command (if available)

Alternative CLI access (works on all platforms)

3. Validate Data

Direct CLI command (if available)

Alternative CLI access (works on all platforms)

Schema Definition

Supported Field Types

Integer (int)

Float (float)

String (string)

Date (date)

Category (category)

Schema Constraints

Command Line Interface

Generate Data

Direct CLI command (if available)

Alternative CLI access (works on all platforms)

Validate Data

Direct CLI command (if available)

Alternative CLI access (works on all platforms)

Create Example Schema

Direct CLI command (if available)

Alternative CLI access (works on all platforms)

Library Usage

Data Generation

Initialize with schema

Generate data

Save to file

Or do both in one step

Data Validation

Initialize validator

Validate data

Check results

Generate report

Print summary

Advanced Usage

Use schema as dictionary

Validate DataFrame directly

Faker Integration

Testing

Performance

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

Integer (`int`)

Float (`float`)

String (`string`)

Date (`date`)

Category (`category`)