etl-forge
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: joss.theoj.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: kkartas
- License: mit
- Language: Python
- Default Branch: main
- Size: 572 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ETLForge
A Python library for generating synthetic test data and validating ETL (Extract, Transform, Load) outputs. ETL processes are fundamental data workflows that extract data from various sources, transform it according to business rules, and load it into target systems like data warehouses or databases. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality throughout your ETL pipelines.
Features
Test Data Generator
- Generate synthetic data based on YAML/JSON schema definitions
- Support for multiple data types:
int,float,string,date,category - Advanced constraints: ranges, uniqueness, nullable fields, categorical values
- Integration with Faker for realistic string generation
- Export to CSV or Excel formats
Data Validator
- Validate CSV/Excel files against schema definitions
- Comprehensive validation checks:
- Column existence
- Data type matching
- Value constraints (ranges, categories)
- Uniqueness validation
- Null value validation
- Date format validation
- Generate detailed reports of invalid rows
Dual Interface
- Command-line interface for quick operations
- Python library for integration into existing workflows
Installation
Prerequisites
- Python 3.9 or higher
- pip package manager
Install from PyPI (Recommended)
bash
pip install etl-forge
Install from Source
For development or latest features:
bash
git clone https://github.com/kkartas/etl-forge.git
cd etl-forge
pip install -e ".[dev]"
Dependencies
Core dependencies (6 total, automatically installed):
- pandas>=1.3.0 - Data manipulation and analysis
- pyyaml>=5.4.0 - YAML parsing for schema files
- click>=8.0.0 - Command-line interface framework
- openpyxl>=3.0.0 - Excel file support
- numpy>=1.21.0 - Numerical computing
- psutil>=5.9.0 - System monitoring for benchmarks
Optional dependencies for enhanced features: ```bash
For realistic data generation using Faker templates
pip install etl-forge[faker]
For development (testing, linting, documentation)
pip install etl-forge[dev] ```
Verify Installation
```bash
CLI verification (may require adding Scripts directory to PATH on Windows)
etl-forge --version
Alternative CLI access (works on all platforms)
python -m etl_forge.cli --version
Library verification
python -c "from etl_forge import DataGenerator, DataValidator; print('Installation verified')" ```
CLI Access Note
On some systems (especially Windows), the etl-forge command may not be directly accessible. In such cases, use:
bash
python -m etl_forge.cli [command] [options]
Complete Example
For a comprehensive demonstration of ETLForge's capabilities, see the included example.py file:
```bash
Run the complete example
python example.py ```
This example demonstrates: - Schema-driven data generation with realistic data (using Faker) - Data validation with the same schema - Error detection and reporting - Complete ETL testing workflow
Key snippet from example.py:
```python from etl_forge import DataGenerator, DataValidator
Single schema drives both generation and validation
schema = { "fields": [ {"name": "customerid", "type": "int", "unique": True, "range": {"min": 1, "max": 10000}}, {"name": "name", "type": "string", "fakertemplate": "name"}, {"name": "email", "type": "string", "unique": True, "fakertemplate": "email"}, {"name": "purchaseamount", "type": "float", "range": {"min": 10.0, "max": 5000.0}, "nullable": True}, {"name": "customer_tier", "type": "category", "values": ["Bronze", "Silver", "Gold", "Platinum"]} ] }
Generate test data
generator = DataGenerator(schema) df = generator.generatedata(1000) generator.savedata(df, 'customertestdata.csv')
Validate with the same schema
validator = DataValidator(schema) result = validator.validate('customertestdata.csv') print(f"Validation passed: {result.is_valid}") ```
This demonstrates ETLForge's key advantage: single schema, dual purpose - the same schema definition drives both data generation and validation, ensuring perfect synchronization between test data and validation rules.
Quick Start
1. Create a Schema
Create a schema.yaml file defining your data structure:
```yaml fields: - name: id type: int unique: true nullable: false range: min: 1 max: 10000
name: name type: string nullable: false faker_template: name
name: department type: category nullable: false values:
- Engineering
- Marketing
- Sales ```
2. Generate Test Data
Command Line: ```bash
Direct CLI command (if available)
etl-forge generate --schema schema.yaml --rows 500 --output sample.csv
Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv ```
Python Library: ```python from etl_forge import DataGenerator
generator = DataGenerator('schema.yaml') df = generator.generatedata(500) generator.savedata(df, 'sample.csv') ```
3. Validate Data
Command Line: ```bash
Direct CLI command (if available)
etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv
Alternative CLI access (works on all platforms)
python -m etlforge.cli check --input sample.csv --schema schema.yaml --report invalidrows.csv ```
Python Library: ```python from etl_forge import DataValidator
validator = DataValidator('schema.yaml') result = validator.validate('sample.csv') print(f"Validation passed: {result.is_valid}") ```
Schema Definition
Supported Field Types
Integer (int)
yaml
- name: age
type: int
nullable: false
range:
min: 18
max: 65
unique: false
Float (float)
yaml
- name: salary
type: float
nullable: true
range:
min: 30000.0
max: 150000.0
precision: 2
null_rate: 0.1
String (string)
yaml
- name: email
type: string
nullable: false
unique: true
length:
min: 10
max: 50
faker_template: email # Optional: uses Faker library
Date (date)
yaml
- name: hire_date
type: date
nullable: false
range:
start: '2020-01-01'
end: '2024-12-31'
format: '%Y-%m-%d'
Category (category)
yaml
- name: status
type: category
nullable: false
values:
- Active
- Inactive
- Pending
Schema Constraints
nullable: Allow null values (default:false)unique: Ensure all values are unique (default:false)range: Define min/max values for numeric types or start/end datesvalues: List of allowed values for categorical fieldslength: Min/max length for string fieldsprecision: Decimal places for float fieldsformat: Date format string (default:'%Y-%m-%d')faker_template: Faker method name for realistic string generationnull_rate: Probability of null values whennullable: true(default: 0.1)
Command Line Interface
Generate Data
```bash
Direct CLI command (if available)
etl-forge generate [OPTIONS]
Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate [OPTIONS]
Options: -s, --schema PATH Path to schema file (YAML or JSON) [required] -r, --rows INTEGER Number of rows to generate (default: 100) -o, --output PATH Output file path (CSV or Excel) [required] -f, --format [csv|excel] Output format (auto-detected if not specified) ```
Validate Data
```bash
Direct CLI command (if available)
etl-forge check [OPTIONS]
Alternative CLI access (works on all platforms)
python -m etl_forge.cli check [OPTIONS]
Options: -i, --input PATH Path to input data file [required] -s, --schema PATH Path to schema file [required] -r, --report PATH Path to save invalid rows report (optional) -v, --verbose Show detailed validation errors ```
Create Example Schema
```bash
Direct CLI command (if available)
etl-forge create-schema example_schema.yaml
Alternative CLI access (works on all platforms)
python -m etlforge.cli create-schema exampleschema.yaml ```
Library Usage
Data Generation
```python from etl_forge import DataGenerator
Initialize with schema
generator = DataGenerator('schema.yaml')
Generate data
df = generator.generate_data(1000)
Save to file
generator.save_data(df, 'output.csv')
Or do both in one step
df = generator.generateandsave(1000, 'output.xlsx', 'excel') ```
Data Validation
```python from etl_forge import DataValidator
Initialize validator
validator = DataValidator('schema.yaml')
Validate data
result = validator.validate('data.csv')
Check results
if result.isvalid: print("Data is valid!") else: print(f"Found {len(result.errors)} validation errors") print(f"Invalid rows: {len(result.invalidrows)}")
Generate report
result = validator.validateandreport('data.csv', 'errors.csv')
Print summary
validator.printvalidationsummary(result) ```
Advanced Usage
```python
Use schema as dictionary
schemadict = { 'fields': [ {'name': 'id', 'type': 'int', 'unique': True}, {'name': 'name', 'type': 'string', 'fakertemplate': 'name'} ] }
generator = DataGenerator(schemadict) validator = DataValidator(schemadict)
Validate DataFrame directly
import pandas as pd df = pd.read_csv('data.csv') result = validator.validate(df) ```
Faker Integration
When the faker library is installed, you can use realistic data generation:
```yaml - name: firstname type: string fakertemplate: first_name
name: address type: string faker_template: address
name: phone type: string fakertemplate: phonenumber ```
Common Faker templates:
- name, first_name, last_name
- email, phone_number
- address, city, country
- company, job
- date, time
- And many more! See Faker documentation
Testing
Run the test suite:
bash
pytest tests/
Run with coverage:
bash
pytest tests/ --cov=etl_forge --cov-report=html
Performance
Performance benchmarks are available in BENCHMARKS.md. To reproduce them, run:
bash
python benchmark.py
Then, to visualize the results:
bash
python plot_benchmark.py
Citation
If you use ETLForge in your research or work, please cite it using the information in CITATION.cff.
Owner
- Name: Kyriakos Kartas
- Login: kkartas
- Kind: user
- Location: Herakleion, Greece
- Website: http://www.kkartas.gr
- Repositories: 1
- Profile: https://github.com/kkartas
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
version: 1.0.3
date-released: 2025-06-15
url: "https://github.com/kkartas/ETLForge"
repository-code: "https://github.com/kkartas/ETLForge"
license: MIT
authors:
- family-names: "Kartas"
given-names: "Kyriakos"
email: "mail@kkartas.gr"
orcid: "https://orcid.org/0009-0001-6477-4676"
keywords:
- Python
- ETL
- data validation
- synthetic data
- data quality
- testing
- data engineering
abstract: >-
ETLForge is a comprehensive Python framework for synthetic test data
generation and automated ETL output validation. The framework enables
data engineers and scientists to create realistic test datasets based
on declarative schema definitions and automatically validate data
quality against those schemas, significantly improving the reliability
and maintainability of ETL pipelines.
preferred-citation:
type: article
title: "ETLForge: A Python Framework for Synthetic Test Data Generation and ETL Pipeline Validation"
authors:
- family-names: "Kartas"
given-names: "Kyriakos"
email: "mail@kkartas.gr"
orcid: "https://orcid.org/0009-0001-6477-4676"
journal: "Journal of Open Source Software"
year: 2025
volume: TBD
issue: TBD
start: TBD
end: TBD
doi: "TBD"
url: "https://github.com/kkartas/ETLForge"
references:
- type: software
title: "pandas: powerful Python data analysis toolkit"
authors:
- family-names: "McKinney"
given-names: "Wes"
url: "https://pandas.pydata.org/"
- type: software
title: "NumPy: The fundamental package for scientific computing with Python"
authors:
- family-names: "Harris"
given-names: "Charles R."
url: "https://numpy.org/"
- type: software
title: "PyYAML: YAML parser and emitter for Python"
authors:
- family-names: "PyYAML Contributors"
url: "https://pyyaml.org/"
GitHub Events
Total
- Release event: 1
- Push event: 10
- Create event: 1
Last Year
- Release event: 1
- Push event: 10
- Create event: 1
Packages
- Total packages: 1
-
Total downloads:
- pypi 21 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
pypi.org: etl-forge
A Python library for generating synthetic test data and validating ETL outputs.
- Homepage: https://github.com/kkartas/ETLForge
- Documentation: https://etl-forge.readthedocs.io/
- License: MIT License
-
Latest release: 1.0.3
published 8 months ago
Rankings
Maintainers (1)
Dependencies
- click >=8.0.0
- faker >=15.0.0
- numpy >=1.21.0
- openpyxl >=3.0.0
- pandas >=1.3.0
- pyyaml >=5.4.0
- black >=21.0.0
- click >=8.0.0
- coverage >=6.0
- faker >=15.0.0
- flake8 >=3.8.0
- matplotlib >=3.5.0
- mypy >=0.900
- numpy >=1.21.0
- openpyxl >=3.0.0
- pandas >=1.3.0
- psutil >=5.9.0
- pytest >=6.0.0
- pytest-cov >=2.0.0
- pyyaml >=5.4.0
- sphinx >=4.0.0
- sphinx-rtd-theme >=1.0.0
- click >=8.0.0
- numpy >=1.21.0
- openpyxl >=3.0.0
- pandas >=1.3.0
- psutil >=5.9.0
- pyyaml >=5.4.0