beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality

https://github.com/niehs/beethoven

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    12 of 21 committers (57.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality

Basic Info
Statistics
  • Stars: 6
  • Watchers: 6
  • Forks: 2
  • Open Issues: 8
  • Releases: 0
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License

README.md

Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality two hexagons with distributed tan, orange, and teal with geometric symbols placed. Two hexagons are diagonally placed from the top left to the bottom right

[![R-CMD-check](https://github.com/NIEHS/beethoven/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/NIEHS/beethoven/actions/workflows/check-standard.yaml) [![cov](https://NIEHS.github.io/beethoven/badges/coverage.svg)](https://github.com/NIEHS/beethoven/actions/workflows/test-coverage.yaml) [![lint](https://github.com/NIEHS/beethoven/actions/workflows/lint.yaml/badge.svg)](https://github.com/NIEHS/beethoven/actions/workflows/lint.yaml) [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) Group Project for the Spatiotemporal Exposures and Toxicology group with help from friends :smiley: :cowboy_hat_face: :earth_americas:

Installation

r remotes::install_github("NIEHS/beethoven")

Workflow

beethoven is a targets reproducible analysis pipeline with the following workflow.

`beethoven` workflow

Version 0.4.4 of beethoven has stable targets for downloading data files, calculating features at AQS sites, and merging to a base learner-ready data.table (dt_feat_calc_xyt). Ongoing changes relate to calculating features for the prediction grid, computationally managing prediction grid, base learner hyperparameter tuning, and meta learner function development.

r targets::tar_visnetwork() `beethoven` targets::tar_visnetwork()

Organization

Here, we describe the structure of the repository, important files, and the targets object naming conventions.

Folder Structure

  • R/ is where the beethoven functions are stored. Only ".R" files should be in this folder (ie. targets helpers, post-processing, model fitting functions).
  • inst/ is a directory for arbitrary files outside of the main R/ directory
    • targets/ is a sub-directory within inst/ which contains the pipeline files (ie. "targetsaqs.R"). These files declare the `targets::tartargetobjects which constitute thebeethoven` pipeline.
  • tests/ stores unit and integration tests (testthat/) and test data (testdata/) according to the testthat package's standard structure. for unit testing.
    • testthat.R is created and maintained by testthat, and is not to be edited manually.
  • container/ stores definition files and build scripts to build covariate- and model-specific Apptainer container images (container_covariates.def and container_models.def).
  • man/ contains function documentation files (".Rd") which are by the roxygen2 package. These files are not to be edited manually.
  • vignettes/ contains ".Rmd" narrative text and code files. These are rendered by pkgdown into the Articles section of the beethoven webpage.
  • .github/workflows/ is a hidden directory which stores the GitHub CI/CD "yaml" files.
  • tools/ is dedicated to educational or demonstration material (e.g. Rshiny), but is not excluded from the package build.

Important Files

  • _targets.R configures targets settings, creates computational resource controllers, and structures the beethoven pipeline.
    • To run beethoven, users must review and update the following parameters for their user profile and computing system:
    • controller_* Ensure the local controllers do not request more CPUs than are available on your machine or high performance system.
    • #SBATCH --partition Utilization of NVIDIA GPUs (within glue::glue command)
    • --bind /USER_PATH_TO_INPUT/input:/input (within glue::glue command)
  • _targets.yaml is created and updated by running targets::tar_make and is not to be edited manually.
  • run.sh submits separate SBATCH jobs for the covariate, cpu- and gpu-enabled base learner, and the meta learner targets (see /inst/scripts/). This setup ensures that each stage utilizes the proper container image and computational resources. To run beethoven, users must review and update the following parameters for their user profile and computing system in each of the inst/scripts/run_* files.:
    • #SBATCH --mail-user
    • #SBATCH --partition
    • #SBATCH --mem
    • #SBATCH --cpus-per-task
    • --bind /USER_PATH_TO_INPUT/input:/input
    • --bind /USER_PATH_TO_SLURM/slurm:/USER_PATH_TO_SLURM/slurm

Running beethoven Pipeline

User settings

beethoven pipeline is configured for SLURM with defaults for NIEHS HPC settings. For adapting the settings to users' environment, consult with the documentation of your platform and edit the requested resources in the stage-specific run files (/inst/scripts/) (lines 3-11) and _targets.R (lines 41-45; individual crew and crew.cluster controller workers).

Critical targets

There are 5 "critical" targets that users may want to change to run beethoven.

  • chr_daterange
    • Controls all time-related targets for the entire pipeline. This is the only target that needs to be changed to update the pipeline with a new temopral range. Month and year specific arguments are derived from the time range defined by chr_daterange.
  • chr_nasa_token
    • Sets the file path to the user's NASA Earthdata account credentials. These credentials expire at ~90 day intervals and therefore must be updated regularly.
  • chr_mod06_links
    • The file path to the MOD06 links file. These links must be manually downloaded per the amadeus::download_modis function. The links are then stored in a CSV file that is read by the function. The new file with links must be updated to match the new date range.
  • chr_input_dir
    • The file path to the input directory. This target controls where the raw data files are downloaded to and imported from. This file path must be mounted to the container at run time in the run.sh script.
  • num_dates_split
    • Controls the size of temporal splits. Splitting the temporal range into smaller chunks allows for parallel processing across multiple workers. It also allows for dispatching new dynamic branches when the temporal range is updated.

Apptainer

Current implementation of beethoven utilizes Apptainer images to run the pipeline with consistent package versions and custom installations. Users must build these images before runnning beethoven.

sh cd container/ # must be working in the `container/` directory sh build_container_covariates.sh # build "covariates" stage image sh build_container_models.sh # build "models" image mv *sif ../ # move images to `beethoven/` root directory

[!NOTE] .sif files are omitted from GitHub due to size (>5 Gb each)

Run

After switching back to the project root directory, users can run the pipeline with the run.sh shell script. The following lines of /inst/scripts/run_*.sh must be updated with user-specific settings before running the pipeline

```sh

SBATCH --mail-user=[USER_EMAIL] # email address for job notifications

SBATCH --partition=[PARTITION_NAME] # HPC partition to run on

SBATCH --mem=[###G] # Total memory for the job

SBATCH --cpus-per-task=[###] # Total CPUs for the job

... --bind [USERINPUTDIRECTORY]/input:/input \ ... --bind [USERSYSTEMPATH/munge]:/run/munge \ --bind [USERSYSTEMPATH/slurm]:[USERSYSTEMPATH/slurm] \ ```

Once configured, the pipeline can be run with a SLRUM batch job.

sh cd ../ # assuming still in the `container/` directory sbatch run.sh

The SLURM batch job can also be submitted R session with the batch helper function.

r source("R/helpers.R") batch()

Contribution

The Developer's Guide provides detailed instructions for how to develop or update beethoven settings or individual targets objecdts

To contribute developments or modifications, open a Pull request into the dev branch with a detailed description of the proposed changes. Pull requests must pass all status checks, and then will be approved or rejected by beethoven's authors.

Utilize Issues to notify the authors of bugs, questions, or recommendations. Identify each issue with the appropriate label to help ensure a timely response.

Owner

  • Name: National Institute of Environmental Health Science
  • Login: NIEHS
  • Kind: organization
  • Location: Durham, NC

The mission of the National Institute of Environmental Health Sciences is to discover how the environment affects people in order to promote healthier lives.

GitHub Events

Total
  • Create event: 22
  • Issues event: 71
  • Watch event: 6
  • Delete event: 20
  • Issue comment event: 102
  • Push event: 246
  • Pull request review comment event: 6
  • Pull request review event: 15
  • Gollum event: 1
  • Pull request event: 40
  • Fork event: 2
Last Year
  • Create event: 22
  • Issues event: 71
  • Watch event: 6
  • Delete event: 20
  • Issue comment event: 102
  • Push event: 246
  • Pull request review comment event: 6
  • Pull request review event: 15
  • Gollum event: 1
  • Pull request event: 40
  • Fork event: 2

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 1,021
  • Total Committers: 21
  • Avg Commits per committer: 48.619
  • Development Distribution Score (DDS): 0.676
Past Year
  • Commits: 329
  • Committers: 4
  • Avg Commits per committer: 82.25
  • Development Distribution Score (DDS): 0.523
Top Committers
Name Email Commits
Insang Song i****g@n****v 331
mitchellmanware m****e@g****m 243
{SET}group 1****y 194
Kyle Messier m****p@e****v 83
Eva Marques m****l@e****v 40
Insang Song s****x@h****m 36
kyle-messier k****r@n****v 28
Spatiotemporal-Exposures-and-Toxicology m****p@a****v 15
Mitchell Manware m****e@M****l 13
Spatiotemporal-Exposures-and-Toxicology m****p@a****v 8
Eva Marques m****l@c****v 7
Eva Marques e****s@g****m 4
Messier m****p@a****v 4
Mariana Kassien k****a@e****v 4
Ranadeep Daw 3****p 3
Eva Marques m****l@g****v 2
dzilber d****r@g****m 2
Daniel Zilber d****r@n****v 1
Mitchell Manware m****e@e****v 1
Spatiotemporal-Exposures-and-Toxicology m****p@a****l 1
Spatiotemporal-Exposures-and-Toxicology m****p@a****n 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 99
  • Total pull requests: 117
  • Average time to close issues: 4 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 7
  • Total pull request authors: 6
  • Average comments per issue: 2.44
  • Average comments per pull request: 0.97
  • Merged pull requests: 90
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 39
  • Pull requests: 64
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 2 days
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 1.9
  • Average comments per pull request: 0.52
  • Merged pull requests: 47
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kyle-messier (40)
  • sigmafelix (26)
  • mitchellmanware (21)
  • eva0marques (7)
  • MAKassien (3)
  • Sanisha003 (1)
  • dawranadeep (1)
Pull Request Authors
  • mitchellmanware (46)
  • kyle-messier (32)
  • sigmafelix (30)
  • eva0marques (7)
  • dawranadeep (1)
  • MAKassien (1)
Top Labels
Issue Labels
Covariate development (10) models (7) documentation (7) development (7) enhancement (5) Production (3) Test-Driven-Development (3) bug (3) covariates (3) test-driven development (2) Refactor (2) AQS data (1) refactor (1) help wanted (1) question (1) Exploratory (1)
Pull Request Labels
documentation (2) test-driven development (2) development (2) enhancement (1)

Dependencies

.github/workflows/codecov.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • codecov/codecov-action v3 composite
  • r-lib/actions/setup-r v2 composite
.github/workflows/learn-github-actions.yml actions
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • covr * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • sf * suggests
  • sftime * suggests
  • terra * suggests
  • testthat >= 3.0.0 suggests