https://github.com/copyleftdev/specmint
๐ฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
โCITATION.cff file
-
โcodemeta.json file
Found codemeta.json file -
โ.zenodo.json file
Found .zenodo.json file -
โDOI references
-
โAcademic publication links
-
โAcademic email domains
-
โInstitutional organization owner
-
โJOSS paper metadata
-
โScientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Keywords
Repository
๐ฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.
Basic Info
- Host: GitHub
- Owner: copyleftdev
- License: other
- Language: Go
- Default Branch: main
- Homepage: https://specmint-n7.vercel.app/
- Size: 8.54 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
SpecMint: Synthetic Dataset Generator
SpecMint is an intelligent synthetic dataset generator that transforms business scenarios into realistic datasets. Instead of manually configuring schemas and record counts, simply describe your business context (e.g., "500-bed hospital", "community bank with 12 branches") and SpecMint automatically calculates realistic record counts, relationships, and generates comprehensive datasets.
๐ฏ Population-Based Intelligence
SpecMint's breakthrough feature is population-based simulation - analyze real-world business scenarios and automatically generate realistic datasets:
```bash
Hospital simulation - automatically calculates patients, claims, prescriptions, etc.
./bin/specmint simulate --population "100-bed regional hospital" --execute --output ./hospital-data
Banking simulation - generates customers, accounts, transactions, loans
./bin/specmint simulate --population "community bank with 5 branches" --execute --output ./bank-data
E-commerce simulation - creates users, products, orders, reviews
./bin/specmint simulate --population "e-commerce platform with 50K users" --execute --output ./ecommerce-data
Retail simulation - generates stores, products, customers, inventory
./bin/specmint simulate --population "retail chain with 10 stores" --execute --output ./retail-data ```
๐ Traditional Schema-Based Generation
```bash
Generate specific record types with custom counts
./bin/specmint generate -s test/schemas/ecommerce/products.json -o output -c 1000
Generate healthcare claims with LLM enrichment
./bin/specmint generate -s test/schemas/medical/healthcare-claims-837.json -o claims --count 100 --llm-mode fields
Generate pharmacy claims
./bin/specmint generate -s test/schemas/medical/rx-claims-ncpdp.json -o rx-data --count 500
Validate existing dataset
./bin/specmint validate -s schema.json -d dataset.jsonl
System health check
./bin/specmint doctor ```
๐ Project Metrics
| Metric | Value | Details | |--------|-------|---------| | Development Time | ~6 hours | August 17, 2025 (05:00 - 11:00 PST) | | Total Lines of Code | 3,186 | Pure Go implementation | | Go Files | 11 | Modular architecture | | Security Rating | A (Excellent) | Zero vulnerabilities | | Test Coverage | Comprehensive | Golden dataset validation |
๐๏ธ Architecture
SpecMint follows a clean, modular architecture designed for maintainability and extensibility:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ CLI Commands โโโโโถโ Core Generator โโโโโถโ Output Writer โ
โ (Cobra-based) โ โ (Deterministic โ โ (JSONL/JSON) โ
โโโโโโโโโโโโโโโโโโโ โ + LLM Enhanced)โ โโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโโโโโโโ โ
โผ โ โผ
โโโโโโโโโโโโโโโโโโโ โผ โโโโโโโโโโโโโโโโโโโ
โ Schema Parser โ โโโโโโโโโโโโโโโโโโโโ โ Domain Validatorโ
โ (JSON Schema) โ โ LLM Integration โ โ (Business Rules)โ
โโโโโโโโโโโโโโโโโโโ โ (Local Ollama) โ โโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโ
Core Components
cmd/specmint/- CLI interface with 6 commands (generate, simulate, validate, inspect, doctor, benchmark)pkg/generator/- Deterministic generation engine with optional LLM enrichmentpkg/population/- Population-based simulation and business scenario analysispkg/schema/- JSON Schema parsing and validationpkg/llm/- Local Ollama integration for realistic data enhancementpkg/validator/- Domain-specific business rule validationpkg/writer/- Multi-format output handlinginternal/config/- Configuration managementinternal/logger/- Structured logging with zerolog
๐ค Development Collaboration
This project represents a unique Human-AI collaborative development approach:
Human Role (Project Lead)
- Strategic Vision: Defined requirements for privacy-focused synthetic data generation
- Architecture Guidance: Directed modular design decisions and Go best practices
- Domain Expertise: Provided business logic for healthcare, fintech, and e-commerce validation
- Quality Assurance: Guided testing strategies and security requirements
- Project Management: Managed scope, priorities, and deliverable timelines
AI Role (Cascade Assistant)
- Code Implementation: Wrote 100% of the 3,186 lines of Go code
- Technical Architecture: Implemented clean architecture patterns and interfaces
- Testing Strategy: Developed comprehensive golden dataset testing approach
- Security Implementation: Integrated security scanning and vulnerability management
- Documentation: Created comprehensive technical documentation and reports
Collaborative Highlights
- Real-time Feedback Loop: Immediate iteration on requirements and implementation
- Knowledge Transfer: AI learned domain-specific validation rules through human guidance
- Quality Standards: Human oversight ensured enterprise-grade code quality
- Problem Solving: Combined human strategic thinking with AI implementation speed
๐งช Testing Strategies
SpecMint employs multiple testing methodologies for comprehensive quality assurance:
1. Golden Dataset Testing
bash
./test/golden-test-suite.sh
- Purpose: Regression testing with known-good datasets
- Coverage: All three domains (healthcare, fintech, e-commerce)
- Validation: Schema compliance + domain business rules
- Datasets: 175 total records across domains
2. Domain-Specific Validation
- Healthcare: 837 Claims (ICD-10/CPT codes), NCPDP pharmacy claims, NPI validation, HIPAA compliance
- Fintech: ABA routing numbers, transaction limits, risk scoring
- E-commerce: SKU formats, inventory consistency, pricing validation
- X12 EDI: Purchase order validation, party ID verification, business transaction compliance
3. LLM Integration Testing
- Connectivity: Automated Ollama health checks
- Fallback Logic: Graceful degradation to deterministic generation
- Quality Assurance: LLM output validation against schema constraints
4. Security Testing
- Static Analysis: gosec security scanner integration
- Vulnerability Scanning: govulncheck for Go stdlib issues
- Dependency Auditing: nancy for third-party package security
5. Performance Benchmarking
bash
./bin/specmint benchmark -s schema.json --counts 100,1000,10000
- Scalability: Multi-record generation performance
- Memory Usage: Resource consumption monitoring
- Deterministic Verification: Seed-based reproducibility testing
๐ง Build System & CI/CD
Local Development
Comprehensive Makefile with 15+ targets for complete development lifecycle:
```bash
Development
make build test lint
Security
make audit vulncheck
CI/CD Pipeline
make ci
Dependency Management
make deps-update deps-verify
System Diagnostics
make doctor ```
Automated CI/CD Pipeline
Production-grade GitHub Actions workflows with expert separation of concerns:
- CI/CD Pipeline: Multi-platform builds, test matrix, golden dataset validation
- Security Audit: Daily automated security scanning with SARIF integration
- Release Automation: Multi-platform binary builds with automated GitHub releases
- Coverage Reporting: Automated code coverage via Codecov integration
- Quality Gates: Go Report Card integration for code quality metrics
๐ก๏ธ Security
SpecMint maintains an A-grade security rating with:
- โ Zero vulnerabilities (post Go 1.25.0 upgrade)
- โ Automated security scanning in CI/CD pipeline
- โ Hardened file permissions (0600 for logs, 0750 for directories)
- โ Clean dependency tree with regular vulnerability monitoring
- โ Static code analysis with 54% security issue reduction
- โ Daily security audits via GitHub Actions
- โ SARIF integration for GitHub Security tab
See SECURITYAUDITREPORT.md for detailed security assessment.
๐ฏ Key Features
Population-Based Intelligence
- Business Context Understanding: Analyze real-world scenarios and suggest realistic data volumes
- Automatic Scaling: Calculate appropriate record counts based on business size
- Domain Templates: Built-in knowledge for Healthcare, Banking, Retail, E-commerce, Insurance
- Relationship Modeling: Understand data dependencies and realistic proportions
Deterministic Generation
- Reproducible: Same seed produces identical datasets
- Scalable: Efficient generation of large datasets
- Schema-Compliant: Strict adherence to JSON Schema specifications
LLM Enhancement
- Local Privacy: Uses local Ollama instance (no data leaves your machine)
- Selective Enrichment: Field-level LLM enhancement with fallback
- Configurable: Adjustable workers, rate limiting, and model selection
Domain Intelligence
- Healthcare: 837 Healthcare Claims (NCPDP D.0), NCPDP pharmacy claims with medical coding
- Fintech: Transaction processing, ABA routing validation, risk scoring
- E-commerce: Product catalogs, inventory management, SKU generation
- X12 EDI: Purchase orders (850), business transactions with party validation
- Business Rules: Industry-specific validation logic with cross-field constraints
- Medical Coding: ICD-10 diagnosis codes, CPT procedure codes, NPI provider validation
- Realistic Data: LLM-enhanced medical descriptions and contextually appropriate values
Production Ready
- CLI Interface: Professional command-line tool with comprehensive help
- Multiple Formats: JSON, JSONL output with manifest generation
- Monitoring: Built-in health checks and system diagnostics
- Extensible: Plugin-ready architecture for new domains
๐ Performance
- Generation Speed: 1000+ records/second (deterministic mode)
- Memory Efficiency: Streaming output for large datasets
- LLM Integration: Configurable rate limiting and worker pools
- Scalability: Tested up to 10,000+ record generation
๐ฅ Healthcare & Medical Data
SpecMint excels at generating enterprise-grade healthcare datasets with medical accuracy:
837 Healthcare Claims (X12 EDI)
- Complete NCPDP D.0 structure: Professional, institutional, and dental claims
- Medical coding compliance: Valid ICD-10 diagnosis codes, CPT procedure codes
- Provider validation: NPI identifiers, taxonomy codes, federal tax IDs
- LLM-enhanced realism: Medical diagnoses and procedure descriptions
- Cross-field validation: Medical logic enforcement across claim hierarchies
- Performance optimized: 5x faster than generic tools (2 LLM calls vs 10+ per record)
NCPDP Pharmacy Claims
- Prescription accuracy: NDC codes, DEA numbers, prior authorization
- Drug information: Realistic medication names, strengths, quantities
- Insurance processing: BIN/PCN numbers, copay calculations
- Regulatory compliance: HIPAA-safe synthetic data generation
Key Healthcare Features
- Medical realism: Clinically plausible diagnosis-procedure relationships
- Regulatory compliance: No real PHI/PII in synthetic data
- Scalable generation: Thousands of compliant claims efficiently
- Industry validation: Healthcare-specific business rules and constraints
๐ฎ Future Enhancements
- Additional Medical: 270/271 Eligibility, 835 Payment/Remittance, 856 ASN
- Additional Domains: Legal, manufacturing, retail verticals
- Output Formats: CSV, Parquet, database direct insertion
- Cloud LLM Support: OpenAI, Anthropic, Google integration
- Web Interface: Browser-based dataset generation UI
- API Mode: REST API for programmatic access
๐ License
BSD 3-Clause License - see LICENSE for details.
Attribution Required: When using SpecMint, please include attribution as specified in the LICENSE file.
๐ Acknowledgments
This project demonstrates the power of Human-AI collaboration in software development, combining human strategic vision with AI implementation capabilities to create enterprise-grade solutions in record time.
Built with โค๏ธ using Go 1.25.0 and collaborative AI development
Owner
- Name: Donald Johnson
- Login: copyleftdev
- Kind: user
- Location: Los Angeles
- Repositories: 39
- Profile: https://github.com/copyleftdev
GitHub Events
Total
- Push event: 22
- Create event: 1
Last Year
- Push event: 22
- Create event: 1
Dependencies
- actions/cache v3 composite
- actions/checkout v4 composite
- actions/setup-go v4 composite
- actions/upload-artifact v4 composite
- codecov/codecov-action v4 composite
- golangci/golangci-lint-action v3 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-go v4 composite
- actions/upload-artifact v4 composite
- softprops/action-gh-release v1 composite
- actions/cache v3 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/github-script v6 composite
- actions/setup-go v4 composite
- actions/upload-artifact v4 composite
- github/codeql-action/upload-sarif v3 composite
- github.com/inconshreveable/mousetrap v1.1.0
- github.com/mattn/go-colorable v0.1.13
- github.com/mattn/go-isatty v0.0.19
- github.com/rs/zerolog v1.32.0
- github.com/santhosh-tekuri/jsonschema/v6 v6.0.1
- github.com/sony/gobreaker v0.5.0
- github.com/spf13/cobra v1.8.0
- github.com/spf13/pflag v1.0.5
- golang.org/x/sys v0.12.0
- golang.org/x/text v0.14.0
- golang.org/x/time v0.5.0
- gopkg.in/yaml.v3 v3.0.1
- github.com/coreos/go-systemd/v22 v22.5.0
- github.com/cpuguy83/go-md2man/v2 v2.0.3
- github.com/davecgh/go-spew v1.1.0
- github.com/dlclark/regexp2 v1.11.0
- github.com/godbus/dbus/v5 v5.0.4
- github.com/inconshreveable/mousetrap v1.1.0
- github.com/mattn/go-colorable v0.1.13
- github.com/mattn/go-isatty v0.0.16
- github.com/mattn/go-isatty v0.0.19
- github.com/pkg/errors v0.9.1
- github.com/pmezard/go-difflib v1.0.0
- github.com/rs/xid v1.5.0
- github.com/rs/zerolog v1.32.0
- github.com/russross/blackfriday/v2 v2.1.0
- github.com/santhosh-tekuri/jsonschema/v6 v6.0.1
- github.com/sony/gobreaker v0.5.0
- github.com/spf13/cobra v1.8.0
- github.com/spf13/pflag v1.0.5
- github.com/stretchr/objx v0.1.0
- github.com/stretchr/testify v1.3.0
- golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab
- golang.org/x/sys v0.6.0
- golang.org/x/sys v0.12.0
- golang.org/x/text v0.14.0
- golang.org/x/time v0.5.0
- gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405
- gopkg.in/yaml.v3 v3.0.1