https://github.com/copyleftdev/specmint

๐ŸŽฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.

https://github.com/copyleftdev/specmint

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • โ—‹
    CITATION.cff file
  • โœ“
    codemeta.json file
    Found codemeta.json file
  • โœ“
    .zenodo.json file
    Found .zenodo.json file
  • โ—‹
    DOI references
  • โ—‹
    Academic publication links
  • โ—‹
    Academic email domains
  • โ—‹
    Institutional organization owner
  • โ—‹
    JOSS paper metadata
  • โ—‹
    Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

cli-tool dataset-generator deterministic ecommerce fintech golang healthcare json-schema llm-integration ollama privacy-first synthetic-data test-data
Last synced: 5 months ago · JSON representation

Repository

๐ŸŽฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
cli-tool dataset-generator deterministic ecommerce fintech golang healthcare json-schema llm-integration ollama privacy-first synthetic-data test-data
Created 6 months ago · Last pushed 6 months ago
Metadata Files
Readme License Security

README.md

SpecMint: Synthetic Dataset Generator

SpecMint Logo

CI/CD Pipeline Security Audit Go Report Card codecov Go Version License GitHub release GitHub stars GitHub issues

SpecMint is an intelligent synthetic dataset generator that transforms business scenarios into realistic datasets. Instead of manually configuring schemas and record counts, simply describe your business context (e.g., "500-bed hospital", "community bank with 12 branches") and SpecMint automatically calculates realistic record counts, relationships, and generates comprehensive datasets.

๐ŸŽฏ Population-Based Intelligence

SpecMint's breakthrough feature is population-based simulation - analyze real-world business scenarios and automatically generate realistic datasets:

```bash

Hospital simulation - automatically calculates patients, claims, prescriptions, etc.

./bin/specmint simulate --population "100-bed regional hospital" --execute --output ./hospital-data

Banking simulation - generates customers, accounts, transactions, loans

./bin/specmint simulate --population "community bank with 5 branches" --execute --output ./bank-data

E-commerce simulation - creates users, products, orders, reviews

./bin/specmint simulate --population "e-commerce platform with 50K users" --execute --output ./ecommerce-data

Retail simulation - generates stores, products, customers, inventory

./bin/specmint simulate --population "retail chain with 10 stores" --execute --output ./retail-data ```

๐Ÿš€ Traditional Schema-Based Generation

```bash

Generate specific record types with custom counts

./bin/specmint generate -s test/schemas/ecommerce/products.json -o output -c 1000

Generate healthcare claims with LLM enrichment

./bin/specmint generate -s test/schemas/medical/healthcare-claims-837.json -o claims --count 100 --llm-mode fields

Generate pharmacy claims

./bin/specmint generate -s test/schemas/medical/rx-claims-ncpdp.json -o rx-data --count 500

Validate existing dataset

./bin/specmint validate -s schema.json -d dataset.jsonl

System health check

./bin/specmint doctor ```

๐Ÿ“Š Project Metrics

| Metric | Value | Details | |--------|-------|---------| | Development Time | ~6 hours | August 17, 2025 (05:00 - 11:00 PST) | | Total Lines of Code | 3,186 | Pure Go implementation | | Go Files | 11 | Modular architecture | | Security Rating | A (Excellent) | Zero vulnerabilities | | Test Coverage | Comprehensive | Golden dataset validation |

๐Ÿ—๏ธ Architecture

SpecMint follows a clean, modular architecture designed for maintainability and extensibility:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ CLI Commands โ”‚โ”€โ”€โ”€โ–ถโ”‚ Core Generator โ”‚โ”€โ”€โ”€โ–ถโ”‚ Output Writer โ”‚ โ”‚ (Cobra-based) โ”‚ โ”‚ (Deterministic โ”‚ โ”‚ (JSONL/JSON) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + LLM Enhanced)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Schema Parser โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Domain Validatorโ”‚ โ”‚ (JSON Schema) โ”‚ โ”‚ LLM Integration โ”‚ โ”‚ (Business Rules)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ (Local Ollama) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core Components

  • cmd/specmint/ - CLI interface with 6 commands (generate, simulate, validate, inspect, doctor, benchmark)
  • pkg/generator/ - Deterministic generation engine with optional LLM enrichment
  • pkg/population/ - Population-based simulation and business scenario analysis
  • pkg/schema/ - JSON Schema parsing and validation
  • pkg/llm/ - Local Ollama integration for realistic data enhancement
  • pkg/validator/ - Domain-specific business rule validation
  • pkg/writer/ - Multi-format output handling
  • internal/config/ - Configuration management
  • internal/logger/ - Structured logging with zerolog

๐Ÿค Development Collaboration

This project represents a unique Human-AI collaborative development approach:

Human Role (Project Lead)

  • Strategic Vision: Defined requirements for privacy-focused synthetic data generation
  • Architecture Guidance: Directed modular design decisions and Go best practices
  • Domain Expertise: Provided business logic for healthcare, fintech, and e-commerce validation
  • Quality Assurance: Guided testing strategies and security requirements
  • Project Management: Managed scope, priorities, and deliverable timelines

AI Role (Cascade Assistant)

  • Code Implementation: Wrote 100% of the 3,186 lines of Go code
  • Technical Architecture: Implemented clean architecture patterns and interfaces
  • Testing Strategy: Developed comprehensive golden dataset testing approach
  • Security Implementation: Integrated security scanning and vulnerability management
  • Documentation: Created comprehensive technical documentation and reports

Collaborative Highlights

  • Real-time Feedback Loop: Immediate iteration on requirements and implementation
  • Knowledge Transfer: AI learned domain-specific validation rules through human guidance
  • Quality Standards: Human oversight ensured enterprise-grade code quality
  • Problem Solving: Combined human strategic thinking with AI implementation speed

๐Ÿงช Testing Strategies

SpecMint employs multiple testing methodologies for comprehensive quality assurance:

1. Golden Dataset Testing

bash ./test/golden-test-suite.sh - Purpose: Regression testing with known-good datasets - Coverage: All three domains (healthcare, fintech, e-commerce) - Validation: Schema compliance + domain business rules - Datasets: 175 total records across domains

2. Domain-Specific Validation

  • Healthcare: 837 Claims (ICD-10/CPT codes), NCPDP pharmacy claims, NPI validation, HIPAA compliance
  • Fintech: ABA routing numbers, transaction limits, risk scoring
  • E-commerce: SKU formats, inventory consistency, pricing validation
  • X12 EDI: Purchase order validation, party ID verification, business transaction compliance

3. LLM Integration Testing

  • Connectivity: Automated Ollama health checks
  • Fallback Logic: Graceful degradation to deterministic generation
  • Quality Assurance: LLM output validation against schema constraints

4. Security Testing

  • Static Analysis: gosec security scanner integration
  • Vulnerability Scanning: govulncheck for Go stdlib issues
  • Dependency Auditing: nancy for third-party package security

5. Performance Benchmarking

bash ./bin/specmint benchmark -s schema.json --counts 100,1000,10000 - Scalability: Multi-record generation performance - Memory Usage: Resource consumption monitoring - Deterministic Verification: Seed-based reproducibility testing

๐Ÿ”ง Build System & CI/CD

Local Development

Comprehensive Makefile with 15+ targets for complete development lifecycle:

```bash

Development

make build test lint

Security

make audit vulncheck

CI/CD Pipeline

make ci

Dependency Management

make deps-update deps-verify

System Diagnostics

make doctor ```

Automated CI/CD Pipeline

Production-grade GitHub Actions workflows with expert separation of concerns:

  • CI/CD Pipeline: Multi-platform builds, test matrix, golden dataset validation
  • Security Audit: Daily automated security scanning with SARIF integration
  • Release Automation: Multi-platform binary builds with automated GitHub releases
  • Coverage Reporting: Automated code coverage via Codecov integration
  • Quality Gates: Go Report Card integration for code quality metrics

๐Ÿ›ก๏ธ Security

SpecMint maintains an A-grade security rating with:

  • โœ… Zero vulnerabilities (post Go 1.25.0 upgrade)
  • โœ… Automated security scanning in CI/CD pipeline
  • โœ… Hardened file permissions (0600 for logs, 0750 for directories)
  • โœ… Clean dependency tree with regular vulnerability monitoring
  • โœ… Static code analysis with 54% security issue reduction
  • โœ… Daily security audits via GitHub Actions
  • โœ… SARIF integration for GitHub Security tab

See SECURITYAUDITREPORT.md for detailed security assessment.

๐ŸŽฏ Key Features

Population-Based Intelligence

  • Business Context Understanding: Analyze real-world scenarios and suggest realistic data volumes
  • Automatic Scaling: Calculate appropriate record counts based on business size
  • Domain Templates: Built-in knowledge for Healthcare, Banking, Retail, E-commerce, Insurance
  • Relationship Modeling: Understand data dependencies and realistic proportions

Deterministic Generation

  • Reproducible: Same seed produces identical datasets
  • Scalable: Efficient generation of large datasets
  • Schema-Compliant: Strict adherence to JSON Schema specifications

LLM Enhancement

  • Local Privacy: Uses local Ollama instance (no data leaves your machine)
  • Selective Enrichment: Field-level LLM enhancement with fallback
  • Configurable: Adjustable workers, rate limiting, and model selection

Domain Intelligence

  • Healthcare: 837 Healthcare Claims (NCPDP D.0), NCPDP pharmacy claims with medical coding
  • Fintech: Transaction processing, ABA routing validation, risk scoring
  • E-commerce: Product catalogs, inventory management, SKU generation
  • X12 EDI: Purchase orders (850), business transactions with party validation
  • Business Rules: Industry-specific validation logic with cross-field constraints
  • Medical Coding: ICD-10 diagnosis codes, CPT procedure codes, NPI provider validation
  • Realistic Data: LLM-enhanced medical descriptions and contextually appropriate values

Production Ready

  • CLI Interface: Professional command-line tool with comprehensive help
  • Multiple Formats: JSON, JSONL output with manifest generation
  • Monitoring: Built-in health checks and system diagnostics
  • Extensible: Plugin-ready architecture for new domains

๐Ÿ“ˆ Performance

  • Generation Speed: 1000+ records/second (deterministic mode)
  • Memory Efficiency: Streaming output for large datasets
  • LLM Integration: Configurable rate limiting and worker pools
  • Scalability: Tested up to 10,000+ record generation

๐Ÿฅ Healthcare & Medical Data

SpecMint excels at generating enterprise-grade healthcare datasets with medical accuracy:

837 Healthcare Claims (X12 EDI)

  • Complete NCPDP D.0 structure: Professional, institutional, and dental claims
  • Medical coding compliance: Valid ICD-10 diagnosis codes, CPT procedure codes
  • Provider validation: NPI identifiers, taxonomy codes, federal tax IDs
  • LLM-enhanced realism: Medical diagnoses and procedure descriptions
  • Cross-field validation: Medical logic enforcement across claim hierarchies
  • Performance optimized: 5x faster than generic tools (2 LLM calls vs 10+ per record)

NCPDP Pharmacy Claims

  • Prescription accuracy: NDC codes, DEA numbers, prior authorization
  • Drug information: Realistic medication names, strengths, quantities
  • Insurance processing: BIN/PCN numbers, copay calculations
  • Regulatory compliance: HIPAA-safe synthetic data generation

Key Healthcare Features

  • Medical realism: Clinically plausible diagnosis-procedure relationships
  • Regulatory compliance: No real PHI/PII in synthetic data
  • Scalable generation: Thousands of compliant claims efficiently
  • Industry validation: Healthcare-specific business rules and constraints

๐Ÿ”ฎ Future Enhancements

  • Additional Medical: 270/271 Eligibility, 835 Payment/Remittance, 856 ASN
  • Additional Domains: Legal, manufacturing, retail verticals
  • Output Formats: CSV, Parquet, database direct insertion
  • Cloud LLM Support: OpenAI, Anthropic, Google integration
  • Web Interface: Browser-based dataset generation UI
  • API Mode: REST API for programmatic access

๐Ÿ“„ License

BSD 3-Clause License - see LICENSE for details.

Attribution Required: When using SpecMint, please include attribution as specified in the LICENSE file.

๐Ÿ™ Acknowledgments

This project demonstrates the power of Human-AI collaboration in software development, combining human strategic vision with AI implementation capabilities to create enterprise-grade solutions in record time.


Built with โค๏ธ using Go 1.25.0 and collaborative AI development

Owner

  • Name: Donald Johnson
  • Login: copyleftdev
  • Kind: user
  • Location: Los Angeles

GitHub Events

Total
  • Push event: 22
  • Create event: 1
Last Year
  • Push event: 22
  • Create event: 1

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • actions/setup-go v4 composite
  • actions/upload-artifact v4 composite
  • codecov/codecov-action v4 composite
  • golangci/golangci-lint-action v3 composite
.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-go v4 composite
  • actions/upload-artifact v4 composite
  • softprops/action-gh-release v1 composite
.github/workflows/security.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/github-script v6 composite
  • actions/setup-go v4 composite
  • actions/upload-artifact v4 composite
  • github/codeql-action/upload-sarif v3 composite
go.mod go
  • github.com/inconshreveable/mousetrap v1.1.0
  • github.com/mattn/go-colorable v0.1.13
  • github.com/mattn/go-isatty v0.0.19
  • github.com/rs/zerolog v1.32.0
  • github.com/santhosh-tekuri/jsonschema/v6 v6.0.1
  • github.com/sony/gobreaker v0.5.0
  • github.com/spf13/cobra v1.8.0
  • github.com/spf13/pflag v1.0.5
  • golang.org/x/sys v0.12.0
  • golang.org/x/text v0.14.0
  • golang.org/x/time v0.5.0
  • gopkg.in/yaml.v3 v3.0.1
go.sum go
  • github.com/coreos/go-systemd/v22 v22.5.0
  • github.com/cpuguy83/go-md2man/v2 v2.0.3
  • github.com/davecgh/go-spew v1.1.0
  • github.com/dlclark/regexp2 v1.11.0
  • github.com/godbus/dbus/v5 v5.0.4
  • github.com/inconshreveable/mousetrap v1.1.0
  • github.com/mattn/go-colorable v0.1.13
  • github.com/mattn/go-isatty v0.0.16
  • github.com/mattn/go-isatty v0.0.19
  • github.com/pkg/errors v0.9.1
  • github.com/pmezard/go-difflib v1.0.0
  • github.com/rs/xid v1.5.0
  • github.com/rs/zerolog v1.32.0
  • github.com/russross/blackfriday/v2 v2.1.0
  • github.com/santhosh-tekuri/jsonschema/v6 v6.0.1
  • github.com/sony/gobreaker v0.5.0
  • github.com/spf13/cobra v1.8.0
  • github.com/spf13/pflag v1.0.5
  • github.com/stretchr/objx v0.1.0
  • github.com/stretchr/testify v1.3.0
  • golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab
  • golang.org/x/sys v0.6.0
  • golang.org/x/sys v0.12.0
  • golang.org/x/text v0.14.0
  • golang.org/x/time v0.5.0
  • gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405
  • gopkg.in/yaml.v3 v3.0.1