clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

https://github.com/mschubert/clustermq

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 16 committers (6.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary

Keywords

cluster high-performance-computing lsf r-package sge slurm ssh

Keywords from Contributors

reproducibility drake makefile ropensci geo genomics documentation-tool make r-targetopia targets
Last synced: 6 months ago · JSON representation

Repository

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

Basic Info
Statistics
  • Stars: 152
  • Watchers: 7
  • Forks: 28
  • Open Issues: 21
  • Releases: 23
Topics
cluster high-performance-computing lsf r-package sge slurm ssh
Created over 9 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License

README.md

ClusterMQ: send R function calls as cluster jobs

CRAN version Build Status CRAN downloads DOI

This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the Q function:

```r

install the package if you haven't done so yet

install.packages('clustermq')

load the library and create a simple function

library(clustermq) fx = function(x) x * 2

queue the function call on your scheduler

Q(fx, x=1:3, n_jobs=1)

list(2,4,6)

```

Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.

Browse the vignettes here:

Schedulers

An HPC cluster's scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.

We currently support the following schedulers (either locally or via SSH):

  • Multiprocess - test your calls and parallelize on cores using options(clustermq.scheduler="multiprocess")
  • SLURM - should work without setup
  • LSF - should work without setup
  • SGE - may require configuration
  • PBS/Torque - needs options(clustermq.scheduler="PBS"/"Torque")
  • via SSH - needs options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)

[!TIP] Follow the links above to configure your scheduler in case it is not working out of the box and check the FAQ if your job submission errors or gets stuck

Usage

The most common arguments for Q are:

  • fun - The function to call. This needs to be self-sufficient (because it will not have access to the master environment)
  • ... - All iterated arguments passed to the function. If there is more than one, all of them need to be named
  • const - A named list of non-iterated arguments passed to fun
  • export - A named list of objects to export to the worker environment

The documentation for other arguments can be accessed by typing ?Q. Examples of using const and export would be:

```r

adding a constant argument

fx = function(x, y) x * 2 + y Q(fx, x=1:3, const=list(y=10), n_jobs=1)

exporting an object to workers

fx = function(x) x * 2 + y Q(fx, x=1:3, export=list(y=10), n_jobs=1) ```

We can also use clustermq as a parallel backend in foreach or BiocParallel:

```r

using foreach

library(foreach) registerdoparcmq(n_jobs=2, memory=1024) # see ?workers for arguments foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs

using BiocParallel

library(BiocParallel) register(DoparParam()) # after registerdoparcmq(...) bplapply(1:3, sqrt) ```

More examples are available in the User Guide.

Comparison to other packages

There are some packages that provide high-level parallelization of R function calls on a computing cluster. We compared clustermq to BatchJobs and batchtools for processing many short-running jobs, and found it to have approximately 1000x less overhead cost.

Overhead comparison

In short, use clustermq if you want:

  • a one-line solution to run cluster jobs with minimal setup
  • access cluster functions from your local Rstudio via SSH
  • fast processing of many function calls without network storage I/O

Use batchtools if you:

  • want to use a mature and well-tested package
  • don't mind that arguments to every call are written to/read from disc
  • don't mind there's no load-balancing at run-time

Use Snakemake or targets if:

  • you want to design and run a workflow on HPC

Don't use batch (last updated 2013) or BatchJobs (issues with SQLite on network-mounted storage).

Contributing

Contributions are welcome and they come in many different forms, shapes, and sizes. These include, but are not limited to:

  • Questions: Ask on the Github Discussions board. If you are an advanced user, please also consider answering questions there.
  • Bug reports: File an issue if something does not work as expected. Be sure to include a self-contained Minimal Reproducible Example and set log_worker=TRUE.
  • Code contributions: Have a look at the good first issue tag. Please discuss anything more complicated before putting a lot of work in, I'm happy to help you get started.

[!TIP] Check the User Guide and the FAQ first, maybe your query is already answered there

Citation

This project is part of my academic work, for which I will be evaluated on citations. If you like me to be able to continue working on research support tools like clustermq, please cite the article when using it for publications:

M Schubert. clustermq enables efficient parallelisation of genomic analyses. Bioinformatics (2019). doi:10.1093/bioinformatics/btz284

Owner

  • Name: Michael Schubert
  • Login: mschubert
  • Kind: user
  • Location: Amsterdam, NL

Postdoctoral scientist at the Netherlands Cancer Institute (NKI). Previously EMBL-EBI/Cambridge Uni.

GitHub Events

Total
  • Create event: 6
  • Release event: 4
  • Issues event: 8
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 4
  • Push event: 51
  • Fork event: 1
Last Year
  • Create event: 6
  • Release event: 4
  • Issues event: 8
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 4
  • Push event: 51
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 1,206
  • Total Committers: 16
  • Avg Commits per committer: 75.375
  • Development Distribution Score (DDS): 0.023
Past Year
  • Commits: 50
  • Committers: 1
  • Avg Commits per committer: 50.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
mschubert m****v@g****m 1,178
Konrad Rudolph k****h@g****m 4
Jeroen Ooms j****s@g****m 4
Will Landau w****u@g****m 3
Matthew Strasiotto 3****6 3
M.P. Barzine b****e@g****m 3
brendanf b****x@g****m 2
nickholway n****y@g****m 1
Unknown s****o@g****m 1
Phil Dyer p****w@g****m 1
Michael Mayer m****r@r****m 1
Michael Kane k****s@g****m 1
Mervin Fansler m****r@g****m 1
Attila Gabor g****7@g****m 1
Alexey Shiklomanov a****v@g****m 1
Matthew T. Warkentin m****n@m****a 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 126
  • Total pull requests: 14
  • Average time to close issues: 8 months
  • Average time to close pull requests: 19 days
  • Total issue authors: 46
  • Total pull request authors: 11
  • Average comments per issue: 4.0
  • Average comments per pull request: 2.21
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 0
  • Average time to close issues: 14 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.14
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mschubert (34)
  • wlandau (15)
  • nick-youngblut (7)
  • mattwarkentin (5)
  • rimorob (4)
  • liutiming (4)
  • luwidmer (4)
  • mhesselbarth (4)
  • nickholway (3)
  • HenrikBengtsson (3)
  • statquant (2)
  • strazto (2)
  • bhayete-empress (2)
  • Zhuk66 (2)
  • quirinmanz (2)
Pull Request Authors
  • jeroen (3)
  • michaelmayer2 (2)
  • wlandau (1)
  • mfansler (1)
  • strazto (1)
  • nickholway (1)
  • mschubert (1)
  • sam217pa (1)
  • mattwarkentin (1)
  • statquant (1)
  • klmr (1)
Top Labels
Issue Labels
enhancement (31) bug (30) ux (6) needs mwe (6) duplicate (5) can not reproduce (4) needs info (4) community request (3) priority (3) breaking (3) next release (3) documentation (2) invalid (2) upstream (2) next version (1) wontfix (1)
Pull Request Labels

Packages

  • Total packages: 3
  • Total downloads:
    • cran 1,418 last-month
  • Total docker downloads: 52,439
  • Total dependent packages: 4
    (may contain duplicates)
  • Total dependent repositories: 15
    (may contain duplicates)
  • Total versions: 64
  • Total maintainers: 1
proxy.golang.org: github.com/mschubert/clustermq
  • Versions: 23
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 6 months ago
cran.r-project.org: clustermq

Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

  • Versions: 30
  • Dependent Packages: 4
  • Dependent Repositories: 15
  • Downloads: 1,418 Last month
  • Docker Downloads: 52,439
Rankings
Stargazers count: 2.9%
Forks count: 3.2%
Dependent repos count: 7.4%
Dependent packages count: 9.4%
Average: 9.5%
Downloads: 11.4%
Docker downloads count: 22.4%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-clustermq
  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 27.9%
Forks count: 32.6%
Dependent repos count: 34.0%
Average: 36.4%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.6.0 depends
  • R6 * imports
  • Rcpp * imports
  • methods * imports
  • narray * imports
  • progress * imports
  • purrr * imports
  • utils * imports
  • callr * suggests
  • devtools * suggests
  • dplyr * suggests
  • foreach * suggests
  • iterators * suggests
  • knitr * suggests
  • parallel * suggests
  • rmarkdown * suggests
  • roxygen2 >= 5.0.0 suggests
  • testthat * suggests
  • tools * suggests
.github/workflows/check-standard.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact main composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite