hpc-workflows

HPC Workflow Management with Snakemake

https://github.com/carpentries-incubator/hpc-workflows

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 7 committers (28.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.4%) to scientific vocabulary

Keywords

carpentries-incubator english hpc-carpentry lesson pre-alpha snakemake workflows

Keywords from Contributors

alpha carpentry-lesson helpwanted-list auroc bootstrapping data-leakage prediction beta github-pages jekyll
Last synced: 4 months ago · JSON representation ·

Repository

HPC Workflow Management with Snakemake

Basic Info
Statistics
  • Stars: 3
  • Watchers: 10
  • Forks: 4
  • Open Issues: 9
  • Releases: 0
Topics
carpentries-incubator english hpc-carpentry lesson pre-alpha snakemake workflows
Created over 2 years ago · Last pushed 4 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

Tame Your Workflow with Snakemake

In HPC Intro, learners explored the scheduler on their cluster by launching a program called amdahl. The objective of this lesson is to adapt the manual job submission process into a repeatable, reusable workflow with minimal human intervention. This is accomplished using Snakemake, a modern workflow engine.

If you are interested in learning more about workflow tools, please visit The Workflows Community.

Snakemake is best for single-node jobs

NERSC's Snakemake docs lists Snakemake's "cluster mode" as a disadvantage, since it submits each "rule" as a separate job, thereby spamming the scheduler with dependent tasks. The main Snakemake process also resides on the login node until all jobs have finished, occupying some resources.

If you wish to adapt your Python-based program for multi-node cluster execution, consider applying the workflow principles learned from this lesson to the Parsl framework. Again, NERSC's Parsl docs provide helpful tips.

Contributing

This is a translation of the old HPC Workflows lesson using The Carpentries Workbench and R Markdown (Rmd). You are cordially invited to contribute! Please check the list of issues if you're unsure where to start.

Building Locally

If you edit the lesson, it is important to verify that the changes are rendered properly in the online version. The best way to do this is to build the lesson locally. You will need an R environment to do this: as described in the {sandpaper} docs, the environment can be either your terminal or RStudio.

Setup

The environment.yml file describes a Conda virtual environment that includes R, Snakemake, amdahl, pandoc, and termplotlib: the tools you'll need to develop and run this lesson, as well as some depencencies. To prepare the environment, install Miniconda following the official instructions. Then open a shell application and create a new environment:

shell you@yours:~$ cd path/to/local/hpc-workflows you@yours:hpc-workflows$ conda env create -f environment.yaml

N.B.: the environment will be named "workflows" by default. If you prefer another name, add -n «alternate_name» to the command.

{sandpaper}

{sandpaper} is the engine behind The Carpentries Workbench lesson layout and static website generator. It is an R package, and has not yet been installed. Paraphrasing the installation instructions, start R or radian, then install:

shell you@yours:hpc-workflows$ R --no-restore --no-save

R install.packages(c("sandpaper", "varnish", "pegboard", "tinkr"), repos = c("https://carpentries.r-universe.dev/", getOption("repos")))

Now you can render the site! From your R session,

R library("sandpaper") sandpaper::serve()

This should output something like the following:

plain Output created: hpc-workflows/site/docs/index.html To stop the server, run servr::daemon_stop(1) or restart your R session Serving the directory hpc-workflows/site/docs at http://127.0.0.1:4321

Click on the link to http://127.0.0.1:4321 or copy and paste it in your browser. You should see any changes you've made to the lesson on the corresponding page(s). If it looks right, you're set to proceed!

Owner

  • Name: carpentries-incubator
  • Login: carpentries-incubator
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
---
cff-version: 1.2.0
title: "HPC Workflow Management with Snakemake"
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Alan
    family-names: O'Cais
    email: "alan.ocais@cecam.org"
    affiliation: "University of Barcelona"
    orcid: "https://orcid.org/0000-0002-8254-8752"
    alias: ocaisa
  - given-names: Andrew
    family-names: Reid
    email: "andrew.reid@nist.gov"
    affiliation: "National Institute of Standards and Technology"
    orcid: "https://orcid.org/0000-0002-1564-5640"
    alias: reid-a
  - given-names: Annajiat
    family-names: Alim Rasel
    email: "annajiat@bracu.ac.bd"
    affiliation: "Brac University"
    orcid: "https://orcid.org/0000-0003-0198-3734"
    alias: annajiat
  - given-names: Benson
    family-names: Muite
    email: "benson_muite@emailplus.org"
    affiliation: "Kichakato Kizito"
    alias: bkmgit
  - given-names: Trevor
    family-names: Keller
    email: "trevor.keller@nist.gov"
    affiliation: "National Institute of Standards and Technology"
    orcid: "https://orcid.org/0000-0002-2920-8302"
    alias: tkphd
  - given-names: Wirawan
    family-names: Purwanto
    email: "wpurwant@odu.edu"
    affiliation: "Old Dominion University"
    orcid: "https://orcid.org/0000-0002-2124-4552"
    alias: wirawan0

repository-code: "https://github.com/carpentries-incubator/hpc-workflows"
url: "https://carpentries-incubator.github.io/hpc-workflows/"
abstract: >-
  When using HPC resources, it's very common to need to
  carry out the same set of tasks over a set of data
  (commonly called a workflow or pipeline). In this lesson
  we will make an experiment that takes an application which
  runs in parallel and investigate its scalability. To do
  that we will need to gather data, in this case that means
  running the application multiple times with different
  numbers of CPU cores and recording the execution time.
  Once we've done that we need to create a visualisation of
  the data to see how it compares against the ideal case.

  We could do all of this manually, but there are useful
  tools to help us manage data analysis pipelines like we
  have in our experiment. In the context of this lesson,
  we'll learn about one of those: Snakemake.
keywords:
  - HPC
  - Carpentries
  - Lesson
  - Workflow
  - Pipeline
license: "CC-BY-4.0"
references:
  - title: "Getting Started with Snakemake"
    authors:
      - family-names: Collins
        given-names: Daniel
        alias: DC23
    type: software
    repository-code: "https://github.com/carpentries-incubator/workflows-snakemake"
    url: "https://carpentries-incubator.github.io/workflows-snakemake/"
  - title: "Snakemake for Bioinformatics"
    authors:
      - family-names: Booth
        given-names: Tim
        alias: tbooth
        orcid: "https://orcid.org/0000-0003-2470-9519"
    type: software
    repository-code: "https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/"
    url: "https://carpentries-incubator.github.io/snakemake-novice-bioinformatics"

GitHub Events

Total
  • Issues event: 1
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 39
  • Pull request review event: 2
  • Pull request event: 6
  • Create event: 4
Last Year
  • Issues event: 1
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 39
  • Pull request review event: 2
  • Pull request event: 6
  • Create event: 4

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 31
  • Total Committers: 7
  • Avg Commits per committer: 4.429
  • Development Distribution Score (DDS): 0.71
Past Year
  • Commits: 20
  • Committers: 6
  • Avg Commits per committer: 3.333
  • Development Distribution Score (DDS): 0.55
Top Committers
Name Email Commits
Trevor Keller t****r@n****v 9
Alan O'Cais a****s@c****g 9
Andrew Reid a****d@n****v 6
Toby Hodges t****s@g****m 3
Trevor Keller t****r@g****m 2
ocaisa o****a 1
tkphd t****d 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 11
  • Total pull requests: 20
  • Average time to close issues: 7 months
  • Average time to close pull requests: 8 days
  • Total issue authors: 6
  • Total pull request authors: 4
  • Average comments per issue: 1.27
  • Average comments per pull request: 1.75
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 13
  • Average time to close issues: about 1 hour
  • Average time to close pull requests: 11 days
  • Issue authors: 3
  • Pull request authors: 2
  • Average comments per issue: 1.5
  • Average comments per pull request: 1.92
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ocaisa (4)
  • tkphd (3)
  • cgross95 (1)
  • guyer (1)
  • reid-a (1)
  • tobyhodges (1)
Pull Request Authors
  • tkphd (18)
  • ocaisa (10)
  • reid-a (3)
Top Labels
Issue Labels
Pull Request Labels
type: template and tools (9)