spring-batch-db-cluster-partitioning: Database-driven clustering with heartbeats and failover for Spring Batch
spring-batch-db-cluster-partitioning: Database-driven clustering with heartbeats and failover for Spring Batch - Published in JOSS (2026)
https://github.com/jchejarla/spring-batch-db-cluster-partitioning
Science Score: 87.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
Spring Batch distributed partitioning using a shared database for coordination and failover
Basic Info
Statistics
- Stars: 9
- Watchers: 3
- Forks: 1
- Open Issues: 4
- Releases: 4
Metadata Files
README.md
Spring Batch Database Cluster Partitioning
🧭 Table of Contents
- Overview
- Key Features
- How It Works
- Architecture
- Sequence diagram
- Featured In
- Actuator Endpoints
- Getting Started
- Performance and Scalability
- Fault Tolerance
- Contributing
- License
- Contact
🚀 Overview
This project provides a novel solution for distributed partitioning in Spring Batch, enabling scalable and fault-tolerant execution of batch jobs across multiple JVM instances (nodes). Unlike traditional Spring Batch remote partitioning methods that rely on external messaging systems (e.g., Kafka, RabbitMQ) or complex orchestrators, this solution leverages a shared relational database for all cluster coordination, dynamic task assignment, explicit state tracking, and reliable fault detection and recovery.
This approach simplifies the architecture, provides real-time visibility into job progress, and ensures robust task re-assignment upon node failures, all while minimizing changes to existing Spring Batch application logic.
The core principles of this project is to know number of available nodes upfront so that the efficient tasks partitioning and distribution strategy can be determined at runtime, and facilitate easy failover in the event of any cluster node is not responding.
✨ Key Features
- Decentralized Master Election: No central master node required — the node that initiates the job automatically becomes the master for that execution, enabling fully autonomous job launches across the cluster.
- Proactive Node Awareness: Before partitioning, the master node dynamically queries the cluster state to identify all currently active nodes. This enables smarter distribution strategies (e.g., round-robin, fixed-node allocation) based on real-time availability, avoiding delays or imbalance caused by late-arriving workers. Importantly, the number of available nodes is also passed to the task builder logic, empowering end users to construct task partitions that are tailored to the current cluster size. This allows for more efficient execution planning and better resource utilization.
- Database-Driven Coordination: Utilizes a common relational database (e.g., PostgreSQL, Oracle, MySQL) as the central hub for cluster state management.
- Dynamic Node Awareness: Master nodes discover and assign partitions to active worker nodes in real-time by querying the database.
- Flexible Partitioning Strategies:
- Round-Robin: Evenly distributes partitions across available nodes.
- Fixed Node Count: Assigns partitions to a specified number of nodes.
- Explicit Task State Tracking: Every partition's lifecycle (PENDING, CLAIMED, COMPLETED, FAILED) is transactionally recorded, offering unparalleled visibility.
- Robust Fault Tolerance:
- Node Heartbeats: Worker nodes periodically update their liveness, enabling master nodes to detect unresponsive instances.
- Configurable Task Re-assignment: Uncompleted tasks from failed nodes can be automatically re-assigned to healthy nodes, ensuring job completion.
- Simplified Architecture: Reduces operational overhead by eliminating the need for complex message queues or dedicated cluster management tools.
- Customizable Callbacks: Provides interfaces for custom logic upon overall job success or failure.
🛠 How it Works
- Node Registration: Each Spring Batch instance (master and worker) registers itself in the
BATCH_NODEStable upon startup and sends periodic heartbeats. - Partitioning (Master Node):
- A custom
ClusterAwarePartitionerqueries theBATCH_NODEStable to identify active workers. - It then splits the job's workload into
ExecutionContextpartitions. - Based on a chosen
PartitionStrategy(e.g., Round-Robin), it assigns these partitions to active worker nodes and records them in theBATCH_PARTITIONStable with a 'PENDING' status.
- A custom
- Task Execution (Worker Nodes):
- Each
PartitionWorkerTasksRunneron a worker node continuously polls theBATCH_PARTITIONStable for tasks assigned to it. - Upon picking up a task, it immediately updates its status to 'CLAIMED' transactionally.
- The worker then executes the assigned Spring Batch
Step. - Throughout execution, and upon completion/failure, the worker updates the task's status in
BATCH_PARTITIONSand the Spring BatchJobRepository.
- Each
- Fault Tolerance:
- If a worker node fails, its heartbeats stop. After a configurable timeout, the master node (or another designated process) marks the node as
UNREACHABLEinBATCH_NODESand eventually removes it. - Any 'CLAIMED' or 'PENDING' partitions associated with the failed node that are marked
is_transferableare identified and made available for re-assignment to other active nodes, ensuring the job completes.
- If a worker node fails, its heartbeats stop. After a configurable timeout, the master node (or another designated process) marks the node as
- Aggregation: The
ClusterAwareAggregatorcollects results from all partitions, updating the master job's status and invoking success/failure callbacks.
Architecture
mermaid
graph TD
A[Job Launcher Node] --> B[Job Execution]
B --> C{Step Type}
C -->|Single Step| D[Execute Locally]
C -->|Partitioned Step| E[Distribute Partitions]
E --> F[Worker Node 1]
E --> G[Worker Node 2]
E --> H[Worker Node N]
F --> I[Process Partition]
G --> J[Process Partition]
H --> K[Process Partition]
I --> L[Update Status]
J --> L
K --> L
L --> M[Master Monitors & Reassigns if Needed]
Sequence diagram
```mermaid
sequenceDiagram participant Master as Job Launcher Node participant Worker1 as Worker Node 1 participant Worker2 as Worker Node 2 participant DB as Shared Database
Master->>DB: Store Job & Step Metadata
Master->>Worker1: Assign Partition 1
Master->>Worker2: Assign Partition 2
Worker1->>DB: Update Partition 1 Status
Worker2->>DB: Update Partition 2 Status
Master->>DB: Monitor Partition Status
Master->>Worker1: Reassign Partition if Needed
```
📚 Featured In
- 📝 TechRxiv Preprint Spring Batch Database-Backed Clustered Partitioning — 98+ views, 27+ downloads
- ✍️ Dev.to Overview Article Introduction to the coordination framework — 171+ views
- 📘 Article Series: Distributed Spring Batch Coordination
- Distributed Spring Batch Coordination: Lightweight, Database-Driven, and Cloud-Native Series' Articles
- 📚 Differ.blog Feature
- Distributed Spring Batch Clustering: A Lightweight Alternative to Heavy Orchestration
- ✨ Coming Soon Medium/In Plain English article on production usage patterns and lessons learned.
🔍 Actuator Endpoints
| Endpoint | Description |
|-------------------------------------|----------------------------------------------|
| /actuator/health | Shows cluster-aware health status |
| /actuator/batch-cluster | Cluster overview of node executions |
| /actuator/batch-cluster/{nodeId}| Details of a specific node |
📦 Getting Started
Prerequisites
- Java 17+
- Maven 3.x or Gradle 7.x+
- A relational database (e.g., PostgreSQL, MySQL, Oracle, H2 for development)
Installation (as a Maven/Gradle dependency)
Add the following to your pom.xml (for Maven):
xml
<dependency>
<groupId>io.github.jchejarla</groupId>
<artifactId>spring-batch-db-cluster-core</artifactId>
<version>2.0.0</version>
</dependency>
If you are using a SNAPSHOT version of the jar (snapshot version is not an official release, this is just for testing purpose), then add below snapshot repository URL into pom.xml
xml
<repositories>
<repository>
<id>central-snapshots</id>
<url>https://central.sonatype.com/repository/maven-snapshots/</url>
<snapshots><enabled>true</enabled></snapshots>
<releases><enabled>false</enabled></releases>
</repository>
</repositories>
Or for Gradle:
gradle
implementation 'io.github.jchejarla:spring-batch-db-cluster-core:2.0.0' // Use the latest version
NOTE: The artifact has been deployed to Maven Central for direct consumption. For local development, you might need to build and install it to your local Maven repository (
mvn clean install).
Database Setup
This library extends Spring Batch, so two layers of schema need to exist in your database:
1. Spring Batch core schema (required prerequisite)
The standard Spring Batch metadata tables (BATCH_JOB_INSTANCE, BATCH_JOB_EXECUTION, BATCH_STEP_EXECUTION, etc.) must exist. You have two options:
- Recommended for development: let Spring Boot create them on startup by setting:
properties spring.batch.jdbc.initialize-schema=always - For production (or if you prefer to manage schema manually): apply the official Spring Batch DDL script for your database from the Spring Batch repository. Direct links per database:
Use the Spring Batch version that matches the one this library depends on (currently 5.2.x).
2. Cluster partitioning tables (provided by this library)
This library adds three additional tables (BATCH_NODES, BATCH_JOB_COORDINATION, BATCH_PARTITIONS) for cluster state, partition assignment, and heartbeat tracking. SQL scripts for PostgreSQL, Oracle, MySQL, and H2 are bundled in spring-batch-db-cluster-core/src/main/resources/schema/. Apply the one matching your database, or use the inline PostgreSQL example below.
PostgreSQL Example Schema:
```sql -- Table: BATCHNODES - maintains cluster nodes heartbeat CREATE TABLE BATCHNODES ( NODEID VARCHAR(200) NOT NULL PRIMARY KEY, CREATEDTIME TIMESTAMP NOT NULL, LASTUPDATEDTIME TIMESTAMP NOT NULL, STATUS VARCHAR(20) NOT NULL, -- Indicates node status (e.g., ACTIVE, UNREACHABLE) HOSTIDENTIFIER VARCHAR(200), CURRENTLOAD BIGINT NOT NULL DEFAULT 0 );
-- Table: BATCHJOBCOORDINATION - Job coordination CREATE TABLE BATCHJOBCOORDINATION ( JOBEXECUTIONID BIGINT PRIMARY KEY, MASTERNODEID VARCHAR(200) NOT NULL, MASTERSTEPEXECUTIONID BIGINT NOT NULL, MASTERSTEPNAME VARCHAR(100) NOT NULL, STATUS VARCHAR(20) NOT NULL, CREATEDTIME TIMESTAMP NOT NULL, LASTUPDATED TIMESTAMP NOT NULL, constraint JOBCOORDFK FOREIGN KEY (JOBEXECUTIONID) REFERENCES BATCHJOBEXECUTION(JOBEXECUTION_ID) );
-- Table: BATCHPARTITIONS - Partition tracking CREATE TABLE BATCHPARTITIONS ( stepexecutionid BIGINT PRIMARY KEY, jobexecutionid BIGINT NOT NULL, partitionkey VARCHAR(100) NOT NULL, assignednode VARCHAR(100), status VARCHAR(20) NOT NULL CHECK (status IN ('PENDING', 'CLAIMED', 'COMPLETED', 'FAILED')), lastupdated TIMESTAMP DEFAULT CURRENTTIMESTAMP, masterstepexecutionid BIGINT NOT NULL, istransferable SMALLINT DEFAULT 0, CHECK (istransferable IN (0, 1)), CONSTRAINT BATPARTFK FOREIGN KEY (stepexecutionid) REFERENCES BATCHSTEPEXECUTION(stepexecution_id) ON DELETE CASCADE ); ```
Configuration
Enable the cluster partitioning by adding the following to your application.properties (or application.yml):
```properties
Enable the cluster partitioning feature
spring.batch.cluster.enabled=true
Unique identifier for this node instance
spring.batch.cluster.node-id=${HOSTNAME:my-batch-node-01}
How often this node sends a heartbeat to the database (in milliseconds)
spring.batch.cluster.heartbeat-interval=3000 # 3 seconds
How often worker nodes poll for new tasks (in milliseconds)
spring.batch.cluster.task-polling-interval=1000 # 1 second
Time in milliseconds after which a node is considered unreachable if no heartbeat is received
spring.batch.cluster.unreachable-node-threshold=15000 # 15 seconds
Time in milliseconds after which an unreachable node's entry is removed from BATCH_NODES
spring.batch.cluster.node-cleanup-threshold=60000 # 60 seconds (after becoming unreachable) ```
Usage
- Define your
ClusterAwarePartitioner: Extend theClusterAwarePartitionerabstract class.
```java @Configuration public class MyJobPartitioner extends ClusterAwarePartitioner {
@Override
public List<ExecutionContext> splitIntoChunksForDistribution(int availableNodeCount) {
// Example: Create 10 partitions, distributing across available nodes
List<ExecutionContext> contexts = new ArrayList<>();
for (int i = 0; i < 10; i++) {
ExecutionContext context = new ExecutionContext();
context.putLong("startRange", i * 100000L);
context.putLong("endRange", (i + 1) * 100000L - 1);
contexts.add(context);
}
return contexts;
}
@Override
public PartitionTransferableProp arePartitionsTransferableWhenNodeFailed() {
// Set to YES if partitions can be re-assigned to other nodes on failure
return PartitionTransferableProp.YES;
}
@Override
public PartitionBuilder buildPartitionStrategy() {
// Example: Use Round-Robin strategy
return PartitionBuilder.builder().partitioningMode(PartitioningMode.ROUND_ROBIN).build();
// Or, Fixed Node Count:
// return PartitionBuilder.builder().partitioningMode(PartitioningMode.FIXED_NODE_COUNT).fixedNodeCount(3).build();
// Or, Scale-Up (similar to Round-Robin but explicitly named for dynamic scaling intent):
// return PartitionBuilder.builder().partitioningMode(PartitioningMode.SCALE_UP).build();
}
} ```
- Configure your partitioned step: Use your custom
ClusterAwarePartitionerand a Step (e.g., aTaskletSteporItemReader/Writerbased step) that will be executed by each partition.
```java @Configuration public class MyJobConfig {
@Autowired
private JobRepository jobRepository;
@Autowired
private PlatformTransactionManager transactionManager;
@Autowired
private TaskExecutor taskExecutor; // For worker threads
@Autowired
private MyJobPartitioner myJobPartitioner; // Your custom partitioner
@Autowired
private ClusterAwareAggregator clusterAwareAggregator; // Provided by the library
@Bean
public Step partitionedWorkerStep() {
return new StepBuilder("partitionedWorkerStep", jobRepository)
.tasklet((contribution, chunkContext) -> {
// This is the logic executed by each partition
ExecutionContext stepContext = chunkContext.getStepContext().getStepExecution().getExecutionContext();
Long startRange = stepContext.getLong("startRange");
Long endRange = stepContext.getLong("endRange");
String nodeId = stepContext.getString(ClusterPartitioningConstants.CLUSTER_NODE_IDENTIFIER);
Boolean isTransferable = stepContext.getBoolean(ClusterPartitioningConstants.IS_TRANSFERABLE_IDENTIFIER);
System.out.println(String.format("Node %s processing range %d to %d. IsTransferable: %s",
nodeId, startRange, endRange, isTransferable));
// Simulate work
long sum = 0;
for (long i = startRange; i <= endRange; i++) {
sum += i;
}
System.out.println(String.format("Node %s finished range. Sum: %d", nodeId, sum));
return RepeatStatus.FINISHED;
}, transactionManager)
.build();
}
@Bean
public Step masterStep(JobExplorer jobExplorer) {
return new StepBuilder("masterStep", jobRepository)
.partitioner("partitionedWorkerStep", myJobPartitioner)
.step(partitionedWorkerStep()) // The step to be executed by workers
.aggregator(clusterAwareAggregator) // Optional: For custom success/failure callbacks
.taskExecutor(taskExecutor) // Local task executor for master coordination, or simple direct call for small workloads
.build();
}
@Bean
public Job myPartitionedJob(Step masterStep) {
return new JobBuilder("myPartitionedJob", jobRepository)
.start(masterStep)
.build();
}
// Example of a custom callback for aggregation
@Bean
public ClusterAwareAggregatorCallback myJobCompletionCallback() {
return new ClusterAwareAggregatorCallback() {
@Override
public void onSuccess(Collection<StepExecution> executions) {
System.out.println("Partitioned Job Completed Successfully!");
executions.forEach(se -> System.out.println(" Partition " + se.getStepName() + " on node " + se.getExecutionContext().getString(ClusterPartitioningConstants.CLUSTER_NODE_IDENTIFIER) + ": " + se.getStatus()));
}
@Override
public void onFailure(Collection<StepExecution> executions) {
System.err.println("Partitioned Job FAILED!");
executions.forEach(se -> System.err.println(" Partition " + se.getStepName() + " on node " + se.getExecutionContext().getString(ClusterPartitioningConstants.CLUSTER_NODE_IDENTIFIER) + ": " + se.getStatus() + " (" + se.getExitStatus().getExitDescription() + ")"));
}
};
}
}
```
3. Run Multiple Instances: Start multiple instances of your Spring Boot application, each configured with a unique spring.batch.cluster.node-id. One instance will act as the master (initiating the job), and others will automatically register as workers.
Running the Bundled Examples
The examples/ module contains a runnable Spring Boot application that demonstrates simple, advanced (ETL), and task-executor jobs against a real cluster of two or more JVMs.
See examples/README.md for the complete step-by-step quick start — provisioning PostgreSQL, applying both schemas, building, starting multiple workers, triggering jobs, observing partition state, and testing failover. The example module also documents how to switch to MySQL or Oracle, and includes a zero-setup H2 profile for users who want to verify functionality without provisioning an external database.
📈 Performance and Scalability
This solution enables true horizontal scalability by distributing batch workloads across distinct physical or virtual machines. Performance benchmarks demonstrate significant reductions in job execution time as more worker nodes are added, effectively leveraging distributed computing resources.
🛡 Fault Tolerance
The database-centric approach provides robust fault tolerance. In the event of a worker node failure, its assigned tasks (if marked as transferable) are identified via the database and re-assigned to other active nodes, ensuring job completion without manual intervention or data loss.
🤝 Contributing
Contributions are welcome! Please feel free to open issues, submit pull requests, or suggest improvements.
📄 License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
📧 Contact
Janardhan Chejarla - janardhan.chejarla@googlemail.com
Owner
- Name: Janardhan Chejarla
- Login: jchejarla
- Kind: user
- Location: San Francisco
- Repositories: 1
- Profile: https://github.com/jchejarla
JOSS Publication
spring-batch-db-cluster-partitioning: Database-driven clustering with heartbeats and failover for Spring Batch
Tags
Spring Batch Distributed Systems Job Scheduling Database CoordinationGitHub Events
Total
- Release event: 4
- Pull request event: 1
- Issues event: 5
- Watch event: 7
- Issue comment event: 5
- Push event: 48
- Create event: 5
Last Year
- Release event: 3
- Pull request event: 1
- Issues event: 5
- Watch event: 6
- Issue comment event: 5
- Push event: 31
- Create event: 2
Issues and Pull Requests
Last synced: 3 months ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: 1 day
- Average time to close pull requests: about 13 hours
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: 1 day
- Average time to close pull requests: about 13 hours
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- arvind-tech-ai (1)
Pull Request Authors
- jchejarla (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 4
- Total downloads: unknown
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 7
repo1.maven.org: io.github.jchejarla:spring-batch-db-cluster-core
Core module for Spring Batch DB clustering framework
- Homepage: https://github.com/jchejarla/spring-batch-db-cluster-partitioning
- Documentation: https://appdoc.app/artifact/io.github.jchejarla/spring-batch-db-cluster-core/
- License: Apache License, Version 2.0
-
Latest release: 2.0.0
published 11 months ago
Rankings
repo1.maven.org: io.github.jchejarla:examples
Examples using Spring Batch DB clustering framework
- Homepage: https://github.com/jchejarla/spring-batch-db-cluster-partitioning
- Documentation: https://appdoc.app/artifact/io.github.jchejarla/examples/
- License: Apache License, Version 2.0
-
Latest release: 2.0.0
published 11 months ago
Rankings
repo1.maven.org: io.github.jchejarla:spring-batch-db-cluster-partitioning
Spring Batch DB-based cluster partitioning with dynamic node coordination
- Homepage: https://github.com/jchejarla/spring-batch-db-cluster-partitioning
- Documentation: https://appdoc.app/artifact/io.github.jchejarla/spring-batch-db-cluster-partitioning/
- License: Apache License, Version 2.0
-
Latest release: 2.0.0
published 11 months ago
Rankings
repo1.maven.org: io.github.jchejarla:clustering-core
Core module for Spring Batch DB clustering framework
- Homepage: https://github.com/jchejarla/spring-batch-db-cluster-partitioning
- Documentation: https://appdoc.app/artifact/io.github.jchejarla/clustering-core/
- License: Apache License, Version 2.0
-
Latest release: 1.0.1
published 11 months ago
