https://github.com/timmikeladze/rehiver

🐝 Super-charge your S3 hive partitioned based file operations with intelligent pattern matching, change detection, optimized data-fetching, and out-of-the-box time series support.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Keywords

glob hive-partioning hive-s3 hive-timeseries s3 s3-hive s3-timeseries time-series timeseries

Last synced: 9 months ago · JSON representation

Repository

🐝 Super-charge your S3 hive partitioned based file operations with intelligent pattern matching, change detection, optimized data-fetching, and out-of-the-box time series support.

Basic Info

Host: GitHub
Owner: TimMikeladze
License: mit
Language: TypeScript
Default Branch: main
Homepage:
Size: 381 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 4
Releases: 2

Topics

glob hive-partioning hive-s3 hive-timeseries s3 s3-hive s3-timeseries time-series timeseries

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License

🐝 rehiver

Super-charge your S3 hive partitioned based file operations with intelligent pattern matching, change detection, optimized data-fetching, and out-of-the-box time series support.

bash pnpm install rehiver

📋 Overview

rehiver is your TypeScript powerhouse for S3 operations that makes working with partitioned data and cloud storage effortless. It combines intelligent glob pattern matching with flexible Hive partitioning, local data caching, and efficient change detection to simplify complex data operations - all with type safety built in.

Key Features

🍯 Hive Partitioning - Parse and generate Hive-style partitions with type safety and custom partition layouts.
🔍 Pattern Matching - Target exactly the files you need with expressive glob patterns.
⏱️ Time Partitioning - Built-in support for time-based partitioning (hourly, daily, monthly, yearly).
🔄 Change Detection - Track additions, modifications, and deletions efficiently with disk-based state.
💾 Local Data Management - Smart local caching with disk-based storage and efficient change detection.
⚡ Concurrency Controls - Process multiple objects in parallel with fine-tuned settings.
📊 Progress Tracking - Monitor long-running operations with built-in hooks.
🚀 Optimized Data Fetching - Smart caching, batch processing, and efficient pattern matching for large-scale operations.

🚀 Quick Start

Here's a quick example demonstrating the core functionality of rehiver for a simple time series data pipeline:

```typescript import rehiver from 'rehiver';

// Initialize with your configuration const rehiver = new Rehiver({ s3Options: { region: 'us-east-1' } });

// Set up time partitioning for hourly data const timeGen = rehiver.timePartitioner({ granularity: 'hourly', format: 'hive' });

// Get the last 24 hours of metrics const now = new Date(); const yesterday = new Date(now); yesterday.setDate(now.getDate() - 1);

// Find and process time series data await rehiver.streamMatchingObjects( 'metrics-bucket', timeGen.generatePathsForRange(yesterday, now).map(p => ${p}/metrics.parquet), async (key) => { const timestamp = timeGen.parsePath(key).toDate(); await processTimeSeriesData(key, timestamp); } ); ```

🔥 Features in Action

Powerful Pattern Matching

Target exactly what you need with glob patterns:

typescript // Multiple patterns with negation const dataFiles = await rehiver.findMatchingObjects({ bucket: 'analytics-bucket', patterns: [ '**/*.json', // All JSON files '!**/temp/**/*.json' // Exclude temp files ], maxConcurrentRequests: 20 });

Under the hood, rehiver optimizes pattern matching by: - Compiling patterns to regular expressions for performance - Caching compiled patterns to avoid redundant processing - Supporting advanced glob syntax including negation and alternation

Time-Series Made Simple

Effortlessly work with time-partitioned data:

```typescript // Create a time partitioner const timeGen = rehiver.timePartitioner({ granularity: 'daily', format: 'hive' // Creates "year=2023/month=07/day=15" style paths });

// Generate paths for the last 7 days const today = new Date(); const weekAgo = new Date(today); weekAgo.setDate(today.getDate() - 7);

const paths = timeGen.generatePathsForRange(weekAgo, today);

// Use generated paths to find matching objects const weeklyData = await rehiver.findMatchingObjects({ bucket: 'timeseries-bucket', patterns: paths.map(p => ${p}/**/*.parquet) }); ```

Time series features include: - Multiple granularity levels (hourly, daily, monthly, yearly) - Range generation for time windows - Integration with Hive partitioning and pattern matching - Efficient querying of historical data with smart path generation - Support for custom time formats and timezone handling

Type-Safe Hive Partitioning

Handle partitioned data with confidence:

```typescript import { z } from 'zod';

// Define your partition schema with Zod const partitionSchema = z.object({ year: z.coerce.number().int().min(2000).max(2100), month: z.coerce.number().int().min(1).max(12), day: z.coerce.number().int().min(1).max(31), region: z.enum(['us', 'eu', 'asia']) });

// Create a partition parser const parser = rehiver.partitionParser(partitionSchema);

// Parse with type inference const partitionData = parser.parse('year=2023/month=07/day=15/region=us'); // => { year: 2023, month: 7, day: 15, region: 'us' }

// Generate a glob pattern for partial specifications const pattern = parser.createGlobPattern({ year: 2023, region: 'us' }); // => "year=2023/month=/day=/region=us" ```

The Hive partitioning system provides: - Runtime validation through Zod schemas - Type-safe access to partition components - Bidirectional conversion between paths and structured data - Seamless integration with Apache Hive, Presto, and other query engines - Support for nested partitioning (e.g., year/month/day/hour) - Automatic partition pruning for efficient querying - Built-in support for common partition types (date, region, customer, etc.)

Time Series Database Integration

rehiver excels at working with time series data:

```typescript // Set up hourly partitioning for high-frequency data const hourlyGen = rehiver.timePartitioner({ granularity: 'hourly', format: 'hive' });

// Generate paths for the last 24 hours const now = new Date(); const yesterday = new Date(now); yesterday.setDate(now.getDate() - 1); const paths = hourlyGen.generatePathsForRange(yesterday, now);

// Find and process time series data const timeSeriesData = await rehiver.findMatchingObjects({ bucket: 'metrics-bucket', patterns: paths.map(p => ${p}/metrics.parquet), // Optional: Add metadata for time series specific operations metadata: { retentionPeriod: '30d', compression: 'snappy' } });

// Process with time-aware operations for (const data of timeSeriesData) { const partition = parser.parse(data.key); const timestamp = new Date(partition.year, partition.month - 1, partition.day, partition.hour); await processTimeSeriesData(data.key, timestamp); } ```

Time series database features: - Optimized for high-frequency data ingestion - Efficient querying of time ranges - Automatic data lifecycle management - Support for data retention policies - Integration with popular time series databases - Built-in support for data downsampling and aggregation - Smart caching for frequently accessed time ranges

Efficient Change Detection

Track what's changed between runs:

```typescript // Create a change detector const detector = rehiver.changeDetector();

// Load previous state await detector.loadPreviousState('state.json');

// Add current objects const currentObjects = await rehiver.findMatchingObjects('data-lake', '*/.parquet'); detector.addObjects(currentObjects.map(key => ({ key, size: 0, etag: '', lastModified: new Date() })));

// Get only what changed const changes = detector.detectChanges();

// Process each change type for (const change of changes) { if (change.changeType === 'added') { await processNewFile(change.object.key); } else if (change.changeType === 'modified') { await reprocessFile(change.object.key); } }

// Save current state for next run await detector.saveCurrentState('state.json'); ```

Change detection capabilities: - Track additions, modifications, and deletions - Configurable comparison modes (quick or full) - Persistent state between application runs

🌍 Real-World Examples

Data Lake ETL Pipeline

Build a robust ETL pipeline with change detection:

```typescript // 1. Set up time partitioning and change detection const timeGen = rehiver.timePartitioner({ granularity: 'daily' }); const todayPath = timeGen.generateCurrentPath(); const detector = rehiver.changeDetector();

// 2. Load previous state await detector.loadPreviousState();

// 3. Get current raw files const rawFiles = await rehiver.findMatchingObjects( 'data-lake', ${todayPath}/raw/**/*.json );

// 4. Track the objects for change detection detector.addObjects(rawFiles.map(key => ({ key, size: 0, etag: '', lastModified: new Date() })));

// 5. Process only new or modified files const changes = detector.detectChanges(); for (const { changeType, object } of changes) { if (changeType === 'added' || changeType === 'modified') { await transformAndLoad(object.key); } }

// 6. Save state for next run await detector.saveCurrentState(); ```

Event Log Processing

Stream and process logs with concurrency control:

``typescript // Process logs with controlled concurrency const { processed, matched } = await rehiver.streamMatchingObjects({ bucket: 'logs-bucket', patterns: '**/*.log', processor: async (key) => { const logContent = await downloadLogFile(key); await processLogEvents(logContent); }, maxConcurrentProcessing: 5, onProgress: ({ processed, total, matched }) => { console.log(Processed ${processed}/${total} objects, matched ${matched}`); } });

console.log(Completed processing ${processed} out of ${matched} logs); ```

Multi-Region Data Processing

```typescript // Define your partition schema const schema = z.object({ year: z.coerce.number(), month: z.coerce.number(), day: z.coerce.number(), region: z.enum(['us', 'eu', 'asia']) });

// Create a partition parser const parser = rehiver.partitionParser(schema);

// Find data for US region from last month const lastMonth = new Date(); lastMonth.setMonth(lastMonth.getMonth() - 1); const year = lastMonth.getFullYear(); const month = lastMonth.getMonth() + 1;

// Create a pattern for the specific month and region const pattern = parser.createGlobPattern({ year, month, region: 'us' });

// Find and process matching objects const usData = await rehiver.findMatchingObjects( 'analytics-bucket', ${pattern}/**/*.parquet );

console.log(Processing ${usData.length} US region files from ${year}-${month}); ```

Real-Time Data Monitoring

```typescript // Create hourly partitioner const hourlyGen = rehiver.timePartitioner({ granularity: 'hourly', format: 'hive' });

// Generate paths for the last 24 hours const now = new Date(); const yesterday = new Date(now); yesterday.setDate(now.getDate() - 1); const paths = hourlyGen.generatePathsForRange(yesterday, now);

// Find the latest data files const latestData = await rehiver.findMatchingObjects( 'metrics-bucket', paths.map(p => ${p}/metrics.json) );

console.log(Found ${latestData.length} hourly metric files for the dashboard); ```

💻 API Overview

rehiver provides a clean, unified API for all functionality:

```typescript // Create a single rehiver instance for all operations const rehiver = new Rehiver({ s3Options: { region: 'us-east-1', // Optional AWS credentials credentials: { accessKeyId: process.env.AWSACCESSKEYID!, secretAccessKey: process.env.AWSSECRETACCESSKEY! }, // Optional S3 endpoint for custom S3-compatible storage endpoint: 'http://minio.example.com', forcePathStyle: true }, // Optional caching configuration cacheOptions: { enabled: true, maxSize: 1000, ttl: 5 * 60 * 1000 // 5 minutes } });

// All functionality through the same interface // 1. Pattern matching const matched = rehiver.match(paths, '*/.json');

// 2. S3 operations const objects = await rehiver.findMatchingObjects('bucket', '*/.parquet');

// 3. Hive partitioning const parser = rehiver.partitionParser(schema);

// 4. Time partitioning const timeGen = rehiver.timePartitioner({ granularity: 'daily' });

// 5. Change detection const detector = rehiver.changeDetector(); ```

🏗️ Technical Implementation

Architecture Overview

rehiver is built on a modular architecture with specialized components:

PathMatcher: Core pattern matching capabilities
S3PathMatcher: S3-specific operations and optimizations
HivePartitionParser: Partition path parsing and validation
TimePartitionGenerator: Time-based path generation
ChangeDetectionEngine: File change tracking
rehiver: Main class that orchestrates all components

S3 Integration

rehiver's S3 integration is designed for reliability and performance:

Automatic retry with exponential backoff
Concurrency controls to prevent API throttling
Metadata caching for improved performance
Support for custom S3-compatible storage endpoints

Data Fetching Optimizations

rehiver includes several powerful optimizations for efficient data fetching at scale:

Smart Caching System
- LRU-based metadata caching with configurable TTL
- Background cache refresh to prevent stale data
- Automatic cache invalidation on object updates
- Configurable cache size and refresh thresholds
Concurrency Controls
- Fine-grained control over request and processing concurrency
- Batch processing with configurable batch sizes
- Automatic throttling to prevent API rate limits
- Progress tracking for long-running operations
Pattern Matching Optimizations
- Compiled regex caching for faster pattern matching
- Fast path matching with precompiled patterns
- Support for negation patterns to exclude files
- Efficient handling of special characters in paths
Local Caching
- Optional local file caching to reduce S3 requests
- Skip existing files to avoid redundant downloads
- Configurable cache base paths and policies
- Automatic cache cleanup and management
Performance Monitoring
- Built-in progress tracking hooks
- Detailed statistics for processed objects
- Support for abort signals to cancel long operations
- Comprehensive error handling and retry logic

Example of optimized data fetching:

```typescript // Configure optimized data fetching const rehiver = new Rehiver({ s3Options: { region: 'us-east-1', maxRetries: 3 }, cacheOptions: { enabled: true, maxSize: 2000, // Store up to 2000 items ttl: 10 * 60 * 1000, // Cache for 10 minutes refreshThreshold: 70 // Refresh at 70% of TTL } });

// Process files with optimized settings await rehiver.streamMatchingObjects({ bucket: 'data-bucket', patterns: '*/.parquet', processor: async (key) => processFile(key), // Concurrency controls concurrency: { requestLimit: 5, // Max concurrent S3 requests processingLimit: 10 // Max concurrent file processing }, // Batch processing batchSize: 100, // Local caching localCache: { enabled: true, basePath: './cache', skipExisting: true }, // Progress tracking onProgress: ({ processed, total, matched }) => { console.log(Processed: ${processed}/${total} (${matched} matched)); } }); ```

🤝 Contributing

Contributions are welcome! Here's how to get started:

Fork the repository
Clone your fork and create a new branch
Install dependencies: pnpm install
Start docker containers for testing: docker compose up -d
Run tests during development: pnpm dev
Make your changes and add tests
Ensure all tests pass: pnpm test
Commit your changes with conventional commits
Push your branch and open a pull request

Owner

Name: Tim Mikeladze
Login: TimMikeladze
Kind: user
Location: Seattle, WA

Website: linesofcode.dev
Twitter: linesofcode
Repositories: 138
Profile: https://github.com/TimMikeladze

GitHub Events

Total

Create event: 7
Issues event: 2
Release event: 3
Watch event: 1
Delete event: 2
Issue comment event: 1
Public event: 1
Push event: 4
Pull request event: 6

Last Year

Create event: 7
Issues event: 2
Release event: 3
Watch event: 1
Delete event: 2
Issue comment event: 1
Public event: 1
Push event: 4
Pull request event: 6

Packages

Total packages: 1
Total downloads:
- npm 16 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

npmjs.org: rehiver

Super-charge your S3 hive partitioned based file operations with intelligent pattern matching, change detection, optimized data-fetching, and out-of-the-box time series support.

Homepage: https://github.com/TimMikeladze/rehiver#readme
License: MIT
Latest release: 1.1.0
published about 1 year ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 16 Last month

Rankings

Downloads: 7.8%

Average: 22.9%

Dependent repos count: 25.0%

Dependent packages count: 36.0%

Maintainers (1)

tmikeladze

Last synced: 10 months ago

Dependencies

.github/workflows/main.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
pnpm/action-setup v4 composite

.github/workflows/publish.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
bitovi/github-actions-storybook-to-github-pages v1.0.2 composite
pnpm/action-setup v4 composite

docker-compose.yml docker

minio/minio latest

package.json npm

@biomejs/biome 1.9.4 development
@ryansonshine/commitizen 4.2.8 development
@ryansonshine/cz-conventional-changelog 3.3.4 development
@storybook/addon-essentials 8.6.8 development
@storybook/addon-interactions 8.6.8 development
@storybook/addon-links 8.6.8 development
@storybook/addon-webpack5-compiler-swc 3.0.0 development
@storybook/blocks 8.6.8 development
@storybook/react 8.6.8 development
@storybook/react-webpack5 8.6.8 development
@storybook/test 8.6.8 development
@testing-library/jest-dom 6.6.3 development
@testing-library/react 16.2.0 development
@types/micromatch ^4.0.9 development
@types/mime-types ^2.1.4 development
@types/node 22.13.11 development
@types/react 19.0.12 development
@types/react-dom 19.0.4 development
@types/react-test-renderer 19.0.0 development
@vitest/coverage-v8 3.0.9 development
concurrently 9.1.2 development
dotenv ^16.4.7 development
jsdom 26.0.0 development
lefthook 1.11.3 development
prop-types 15.8.1 development
react 19.0.0 development
react-dom 19.0.0 development
react-test-renderer 19.0.0 development
release-it 18.1.2 development
storybook 8.6.8 development
ts-node 10.9.2 development
tsconfig-paths 4.2.0 development
tsup 8.4.0 development
tsx 4.19.3 development
typescript 5.8.2 development
vitest 3.0.9 development
@aws-sdk/client-s3 ^3.772.0
lru-cache ^11.0.2
micromatch ^4.0.8
mime-types ^2.1.35
p-limit ^6.2.0
zod ^3.24.2

pnpm-lock.yaml npm

379 dependencies

https://github.com/timmikeladze/rehiver

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

🐝 rehiver

📋 Overview

Key Features

🚀 Quick Start

🔥 Features in Action

Powerful Pattern Matching

Time-Series Made Simple

Type-Safe Hive Partitioning

Time Series Database Integration

Efficient Change Detection

🌍 Real-World Examples

Data Lake ETL Pipeline

Event Log Processing

Multi-Region Data Processing

Real-Time Data Monitoring

💻 API Overview

🏗️ Technical Implementation

Architecture Overview

S3 Integration

Data Fetching Optimizations

🤝 Contributing

Owner

GitHub Events

Total

Last Year

Packages

npmjs.org: rehiver

Rankings

Maintainers (1)

Dependencies