Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: joss.theoj.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: kkartas
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 572 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

ETLForge

PyPI version docs build PyPI - Python Version License: MIT status PyPI Downloads

A Python library for generating synthetic test data and validating ETL (Extract, Transform, Load) outputs. ETL processes are fundamental data workflows that extract data from various sources, transform it according to business rules, and load it into target systems like data warehouses or databases. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality throughout your ETL pipelines.

Features

Test Data Generator

  • Generate synthetic data based on YAML/JSON schema definitions
  • Support for multiple data types: int, float, string, date, category
  • Advanced constraints: ranges, uniqueness, nullable fields, categorical values
  • Integration with Faker for realistic string generation
  • Export to CSV or Excel formats

Data Validator

  • Validate CSV/Excel files against schema definitions
  • Comprehensive validation checks:
    • Column existence
    • Data type matching
    • Value constraints (ranges, categories)
    • Uniqueness validation
    • Null value validation
    • Date format validation
  • Generate detailed reports of invalid rows

Dual Interface

  • Command-line interface for quick operations
  • Python library for integration into existing workflows

Installation

Prerequisites

  • Python 3.9 or higher
  • pip package manager

Install from PyPI (Recommended)

bash pip install etl-forge

Install from Source

For development or latest features: bash git clone https://github.com/kkartas/etl-forge.git cd etl-forge pip install -e ".[dev]"

Dependencies

Core dependencies (6 total, automatically installed): - pandas>=1.3.0 - Data manipulation and analysis - pyyaml>=5.4.0 - YAML parsing for schema files - click>=8.0.0 - Command-line interface framework - openpyxl>=3.0.0 - Excel file support - numpy>=1.21.0 - Numerical computing - psutil>=5.9.0 - System monitoring for benchmarks

Optional dependencies for enhanced features: ```bash

For realistic data generation using Faker templates

pip install etl-forge[faker]

For development (testing, linting, documentation)

pip install etl-forge[dev] ```

Verify Installation

```bash

CLI verification (may require adding Scripts directory to PATH on Windows)

etl-forge --version

Alternative CLI access (works on all platforms)

python -m etl_forge.cli --version

Library verification

python -c "from etl_forge import DataGenerator, DataValidator; print('Installation verified')" ```

CLI Access Note

On some systems (especially Windows), the etl-forge command may not be directly accessible. In such cases, use: bash python -m etl_forge.cli [command] [options]

Complete Example

For a comprehensive demonstration of ETLForge's capabilities, see the included example.py file:

```bash

Run the complete example

python example.py ```

This example demonstrates: - Schema-driven data generation with realistic data (using Faker) - Data validation with the same schema - Error detection and reporting - Complete ETL testing workflow

Key snippet from example.py:

```python from etl_forge import DataGenerator, DataValidator

Single schema drives both generation and validation

schema = { "fields": [ {"name": "customerid", "type": "int", "unique": True, "range": {"min": 1, "max": 10000}}, {"name": "name", "type": "string", "fakertemplate": "name"}, {"name": "email", "type": "string", "unique": True, "fakertemplate": "email"}, {"name": "purchaseamount", "type": "float", "range": {"min": 10.0, "max": 5000.0}, "nullable": True}, {"name": "customer_tier", "type": "category", "values": ["Bronze", "Silver", "Gold", "Platinum"]} ] }

Generate test data

generator = DataGenerator(schema) df = generator.generatedata(1000) generator.savedata(df, 'customertestdata.csv')

Validate with the same schema

validator = DataValidator(schema) result = validator.validate('customertestdata.csv') print(f"Validation passed: {result.is_valid}") ```

This demonstrates ETLForge's key advantage: single schema, dual purpose - the same schema definition drives both data generation and validation, ensuring perfect synchronization between test data and validation rules.

Quick Start

1. Create a Schema

Create a schema.yaml file defining your data structure:

```yaml fields: - name: id type: int unique: true nullable: false range: min: 1 max: 10000

  • name: name type: string nullable: false faker_template: name

  • name: department type: category nullable: false values:

    • Engineering
    • Marketing
    • Sales ```

2. Generate Test Data

Command Line: ```bash

Direct CLI command (if available)

etl-forge generate --schema schema.yaml --rows 500 --output sample.csv

Alternative CLI access (works on all platforms)

python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv ```

Python Library: ```python from etl_forge import DataGenerator

generator = DataGenerator('schema.yaml') df = generator.generatedata(500) generator.savedata(df, 'sample.csv') ```

3. Validate Data

Command Line: ```bash

Direct CLI command (if available)

etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv

Alternative CLI access (works on all platforms)

python -m etlforge.cli check --input sample.csv --schema schema.yaml --report invalidrows.csv ```

Python Library: ```python from etl_forge import DataValidator

validator = DataValidator('schema.yaml') result = validator.validate('sample.csv') print(f"Validation passed: {result.is_valid}") ```

Schema Definition

Supported Field Types

Integer (int)

yaml - name: age type: int nullable: false range: min: 18 max: 65 unique: false

Float (float)

yaml - name: salary type: float nullable: true range: min: 30000.0 max: 150000.0 precision: 2 null_rate: 0.1

String (string)

yaml - name: email type: string nullable: false unique: true length: min: 10 max: 50 faker_template: email # Optional: uses Faker library

Date (date)

yaml - name: hire_date type: date nullable: false range: start: '2020-01-01' end: '2024-12-31' format: '%Y-%m-%d'

Category (category)

yaml - name: status type: category nullable: false values: - Active - Inactive - Pending

Schema Constraints

  • nullable: Allow null values (default: false)
  • unique: Ensure all values are unique (default: false)
  • range: Define min/max values for numeric types or start/end dates
  • values: List of allowed values for categorical fields
  • length: Min/max length for string fields
  • precision: Decimal places for float fields
  • format: Date format string (default: '%Y-%m-%d')
  • faker_template: Faker method name for realistic string generation
  • null_rate: Probability of null values when nullable: true (default: 0.1)

Command Line Interface

Generate Data

```bash

Direct CLI command (if available)

etl-forge generate [OPTIONS]

Alternative CLI access (works on all platforms)

python -m etl_forge.cli generate [OPTIONS]

Options: -s, --schema PATH Path to schema file (YAML or JSON) [required] -r, --rows INTEGER Number of rows to generate (default: 100) -o, --output PATH Output file path (CSV or Excel) [required] -f, --format [csv|excel] Output format (auto-detected if not specified) ```

Validate Data

```bash

Direct CLI command (if available)

etl-forge check [OPTIONS]

Alternative CLI access (works on all platforms)

python -m etl_forge.cli check [OPTIONS]

Options: -i, --input PATH Path to input data file [required] -s, --schema PATH Path to schema file [required] -r, --report PATH Path to save invalid rows report (optional) -v, --verbose Show detailed validation errors ```

Create Example Schema

```bash

Direct CLI command (if available)

etl-forge create-schema example_schema.yaml

Alternative CLI access (works on all platforms)

python -m etlforge.cli create-schema exampleschema.yaml ```

Library Usage

Data Generation

```python from etl_forge import DataGenerator

Initialize with schema

generator = DataGenerator('schema.yaml')

Generate data

df = generator.generate_data(1000)

Save to file

generator.save_data(df, 'output.csv')

Or do both in one step

df = generator.generateandsave(1000, 'output.xlsx', 'excel') ```

Data Validation

```python from etl_forge import DataValidator

Initialize validator

validator = DataValidator('schema.yaml')

Validate data

result = validator.validate('data.csv')

Check results

if result.isvalid: print("Data is valid!") else: print(f"Found {len(result.errors)} validation errors") print(f"Invalid rows: {len(result.invalidrows)}")

Generate report

result = validator.validateandreport('data.csv', 'errors.csv')

Print summary

validator.printvalidationsummary(result) ```

Advanced Usage

```python

Use schema as dictionary

schemadict = { 'fields': [ {'name': 'id', 'type': 'int', 'unique': True}, {'name': 'name', 'type': 'string', 'fakertemplate': 'name'} ] }

generator = DataGenerator(schemadict) validator = DataValidator(schemadict)

Validate DataFrame directly

import pandas as pd df = pd.read_csv('data.csv') result = validator.validate(df) ```

Faker Integration

When the faker library is installed, you can use realistic data generation:

```yaml - name: firstname type: string fakertemplate: first_name

  • name: address type: string faker_template: address

  • name: phone type: string fakertemplate: phonenumber ```

Common Faker templates: - name, first_name, last_name - email, phone_number - address, city, country - company, job - date, time - And many more! See Faker documentation

Testing

Run the test suite:

bash pytest tests/

Run with coverage:

bash pytest tests/ --cov=etl_forge --cov-report=html

Performance

Performance benchmarks are available in BENCHMARKS.md. To reproduce them, run:

bash python benchmark.py

Then, to visualize the results:

bash python plot_benchmark.py

Citation

If you use ETLForge in your research or work, please cite it using the information in CITATION.cff.

Owner

  • Name: Kyriakos Kartas
  • Login: kkartas
  • Kind: user
  • Location: Herakleion, Greece

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
version: 1.0.3
date-released: 2025-06-15
url: "https://github.com/kkartas/ETLForge"
repository-code: "https://github.com/kkartas/ETLForge"
license: MIT
authors:
  - family-names: "Kartas"
    given-names: "Kyriakos"
    email: "mail@kkartas.gr"
    orcid: "https://orcid.org/0009-0001-6477-4676"
keywords:
  - Python
  - ETL
  - data validation
  - synthetic data
  - data quality
  - testing
  - data engineering
abstract: >-
  ETLForge is a comprehensive Python framework for synthetic test data 
  generation and automated ETL output validation. The framework enables 
  data engineers and scientists to create realistic test datasets based 
  on declarative schema definitions and automatically validate data 
  quality against those schemas, significantly improving the reliability 
  and maintainability of ETL pipelines.
preferred-citation:
  type: article
  title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
  authors:
    - family-names: "Kartas"
      given-names: "Kyriakos"
      email: "mail@kkartas.gr"
      orcid: "https://orcid.org/0009-0001-6477-4676"
  journal: "Journal of Open Source Software"
  year: 2025
  volume: TBD
  issue: TBD
  start: TBD
  end: TBD
  doi: "TBD"
  url: "https://github.com/kkartas/ETLForge"
references:
  - type: software
    title: "pandas: powerful Python data analysis toolkit"
    authors:
      - family-names: "McKinney"
        given-names: "Wes"
    url: "https://pandas.pydata.org/"
  - type: software
    title: "NumPy: The fundamental package for scientific computing with Python"
    authors:
      - family-names: "Harris"
        given-names: "Charles R."
    url: "https://numpy.org/"
  - type: software
    title: "PyYAML: YAML parser and emitter for Python"
    authors:
      - family-names: "PyYAML Contributors"
    url: "https://pyyaml.org/" 

GitHub Events

Total
  • Release event: 1
  • Push event: 10
  • Create event: 1
Last Year
  • Release event: 1
  • Push event: 10
  • Create event: 1

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 21 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: etl-forge

A Python library for generating synthetic test data and validating ETL outputs.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 21 Last month
Rankings
Dependent packages count: 9.0%
Average: 29.8%
Dependent repos count: 50.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • click >=8.0.0
  • faker >=15.0.0
  • numpy >=1.21.0
  • openpyxl >=3.0.0
  • pandas >=1.3.0
  • pyyaml >=5.4.0
etl_forge.egg-info/requires.txt pypi
  • black >=21.0.0
  • click >=8.0.0
  • coverage >=6.0
  • faker >=15.0.0
  • flake8 >=3.8.0
  • matplotlib >=3.5.0
  • mypy >=0.900
  • numpy >=1.21.0
  • openpyxl >=3.0.0
  • pandas >=1.3.0
  • psutil >=5.9.0
  • pytest >=6.0.0
  • pytest-cov >=2.0.0
  • pyyaml >=5.4.0
  • sphinx >=4.0.0
  • sphinx-rtd-theme >=1.0.0
pyproject.toml pypi
  • click >=8.0.0
  • numpy >=1.21.0
  • openpyxl >=3.0.0
  • pandas >=1.3.0
  • psutil >=5.9.0
  • pyyaml >=5.4.0