https://github.com/ansleybrown1337/runoff-mcmc
My first use of Markov chain Monte Carlo techniques in an attempt to model runoff water quality.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary
Repository
My first use of Markov chain Monte Carlo techniques in an attempt to model runoff water quality.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md

Modeling Runoff Water Quality using Markov-Chain Monte Carlo Techniques in R
By A.J. Brown, Agricultural Data Scientist
[!WARNING]
This project is not yet complete, so this code is not yet fully functional. This notice will be removed when a first version is successful and online.
Here, we explore the application of Markov chain Monte Carlo (MCMC) techniques to model runoff water quality using a Bayesian approach. We also compare results from a Frequentist linear mixed model (LMM) This project signifies my inaugural venture into leveraging MCMC methodologies, aiming to provide a deeper understanding and predictive capabilities for runoff water quality.
For a nice discussion on Frequentist v. Bayesian theory, check out this article from the National Institue of Health (NIH).
Key Features:
- Programming Language: R
- Key Package: NIMBLE
- Technique: Markov-Chain Monte Carlo (MCMC)
The repository is structured to guide you through the entire process, from data preprocessing to model evaluation. Whether you're an environmental scientist, a data analyst, or someone with a keen interest in water quality modeling, this repository offers insights into the potential of MCMC techniques in the field.
Feel free to explore the code, datasets, and documentation. Your contributions, feedback, and suggestions are always appreciated!
Table of Contents
- Directory Structure
- Getting Started
- Methodology
- Results and Discussion
- Contribute
- License
- References
Directory Structure
Code: This directory contains the R code for analysis.Example Data: Sample dataset that I'm using to test my analysis.images: Contains banner image.Output: Where output images and results will be saved.
Getting Started
Clone the repository:
bash git clone https://github.com/your-username/runoff-mcmc.git cd runoff-mcmcOpen the R code in the repository using an IDE like Rstudio or VS code.
Install NIMBLE.
[!IMPORTANT]
Before installing NIMBLE, you need a compiler and related tools such as make that R can use. For Windows users, this is Rtools, and for Mac users it is Xcode.Execute the R script. This will initiate the MCMC process and might take some time, depending on the dataset's size and the number of iterations.
Once the script completes, you'll be able to see the model's results, including any plots or graphs generated.
Methodology
Data Structure and Experimental Design
Data Structure
The example data provided is runoff water sediment concentration and load measured from a surface-irrigated, agricultural research field with various tillage treatments from 2011 - 2016.
The columns present in the CSV are as follows:
- date (Date): Date of sample collection
- year (Factor): Year of sample collection
- yi (factor): Combination of year and irrigation events as a single factor
- id (Factor): The sample number from that specific irrigation event
- block (Factor): Whether the sample was from tillage replication 1 or 2
- trt (Factor): Designates tillage treatment, CT, MT, or ST
- irr (Factor): Irrigation event number
- tss (Numeric): Total suspended solids (TSS) concentration in water sample (grams/liter, g/L)
- out (Numeric): Outflow runoff water volume (liters, L) for that irrigation event
- tssl (Numeric): TSS load in that irrigation event (kilograms, kg)
Note: - Date numbers represent dates. - Numeric numbers written as either integers or decimals - Factor data represent distinct categories or groups.
Experimental Design
A plot map of the agricultural field is shown below for context:

Data Analysis
For this project, I will be using the work flow outlined by Dr. Richard McElreath in his book, Statistical Rethinking. This flow is as follows, stated in Chapter 4, Geocentric Models: 1. State a clear question 2. Sketch your causal assumptions 3. Use the sketch to build a generative model 4. Use the model to build estimator 5. Profit
1. State a clear question
What is the effect of tillage treatment on total suspended solids (TSS) load in runoff water, considering the effects of all other relevant variables?
2. Sketch your causal assumptions
The second step in our analysis is to define the causal model of our system, which can be represented as a directed acyclic graph (DAG) that represents the causal relationships between the variables in our model.

The OWL is defined using the DAGitty package in R.
This DAG shows us that the variables trt is our exposure variable, or the variable where causal effects are the primary interest. Variables irr, year, and block are variables also associated with the outcome, tssl or tss load. The model implies the following conditional independences:
trt ⊥ irr
trt ⊥ year
trt ⊥ block
irr ⊥ year
irr ⊥ block
year ⊥ block
The conditional independences can be listed as:
trtis independent ofirr,year, andblock.irris independent oftrt,year, andblock.yearis independent oftrt,irr, andblock.blockis independent oftrt,irr, andyear
3. Use the sketch to build a generative model
The linear mixed model with these independences is formulated as:
$$ TSSLoad = \beta0 + \beta1 \times trt + \beta2 \times irr + \beta3 \times year + u_{\text{block}} + \epsilon $$
In this model:
- $TSSLoad$ is the dependent variable.
- $trt$, $irr$, and $year$ are the fixed effects, with $\beta1$, $\beta2$, and $\beta_3$ as their respective coefficients.
- $\beta_0$ is the intercept.
- $u{\text{block}}$ represents the random effect associated with
block. This term captures the variability in $TSSLoad$ attributable to different levels ofblock. It's assumed that $u{\text{block}}$ is normally distributed with a mean of 0 and some variance $\sigma^2_{\text{block}}$. - $\epsilon$ is the residual error term, capturing the variability in $TSSLoad$ not explained by the model. It's typically assumed to be normally distributed with a mean of 0 and variance $\sigma^2$.
Selecting Starting Values and Priors
In Bayesian modeling, the selection of prior distributions is a critical step that reflects our prior beliefs about the parameters before any data is considered.
Fixed Effects
For the fixed effects in our model, we use weakly informative priors:
- Prior:
- Rationale:
Random Effects
Random effects account for variability in block in our model. The priors for these terms are chosen to reflect the expected variability between groups:
General Approach:
- tau parameters (e.g., tau_block): Uniform prior with bounds set based on theoretical considerations and understanding of the measurement scale, not on the observed standard deviations.
For
tssandtssl:- Priors for the
tauparameters are set to uniform distributions with reasonable bounds that allow for flexibility in the variance components.
- Priors for the
For each of the u terms, representing the random effects:
- Priors: Terms (u_yi, u_block, u_id) follow normal distributions with a mean of zero and variances informed by the tau parameters.
- Rationale: This reflects the expected variability and is not directly informed by the data itself.
Sigma - Prior: - Rationale:
By selecting priors in this manner, we aim to ensure that they are not informed by the dataset at hand, but rather by prior knowledge and theoretical considerations. This approach upholds the principles of Bayesian analysis, allowing the data to update our prior beliefs reflected in the posterior distributions.
Results and Discussion
Bayesian LMM Results
Coming soon!
Model Convergence
Trace and marginal density plots
We can check the convergence of our model by looking at the trace plots of the MCMC chains for each parameter. If the chains are well-mixed and stationary, it suggests that the model has converged and the MCMC algorithm has sampled the posterior distribution well.
You can see in the trace plots that the chains are well-mixed and stationary, and the posterior distributions are well-sampled overall.
Correlation plots
Additionaly, we may want to investigate correlation between variables. The correlationPlot() function in the BayesianTools package can be used to visualize the correlation between variables in the MCMC chains:

Gelman chain convergence
Thirdly, we will also check how two chains from the same NIMBLE model converged using the gelman.diag() and gelman.plot() functions as found in the BayesianTools package:
r
The gelman.diag() function gives you the scale reduction factors for each parameter. A factor of 1 means that between variance and within chain variance are equal, larger values mean that there is still a notable difference between chains. Often, it is said that everything below 1.1 or so is OK, but note that this is more a rule of thumb.
The gelman,plot() function shows you the development of the scale-reduction over time (chain steps), which is useful to see whether a low chain reduction is also stable (sometimes, the factors go down and then up again, as you will see). The gelman plot is also a nice tool to see roughly where this point is, that is, from which point on the chains seem roughly converged. You can see that both chains converge to eachother around the value of zero after the 10,000 iterations mark. This is a good sign that our model has converged.
Posterior predictive checks
Finally, we would like to do the posterior predictive checks for the NIMBLE model to see how it captures TSS error overall. Using a NIMBLE function to generate simulated TSS error, we will compare resulting simulated mean , median, min and max value distributions to the observed TSS error summary statistic:

Summary of convergence
Considering all of the above, that is, the trace plots, correlation plots, and gelman plots, we can conclude that our model has converged and the MCMC algorithm has sampled the posterior distribution well.
Posterior Comparison Results
Coming soon!
Frequentist LMM Results
Coming soon!
Contribute
Contributions are always welcome! Please read the CONTRIBUTING.md file for details on how to contribute.
License
This project is licensed under the GNU GPL 2.0 License. See the LICENSE.md file for details.
References
- NIMBLE Development Team. 2023. NIMBLE: MCMC, Particle Filtering, and Programmable Hierarchical Modeling. doi: 10.5281/zenodo.1211190. R package version 1.0.1, https://cran.r-project.org/package=nimble.
Owner
- Name: AJ Brown
- Login: ansleybrown1337
- Kind: user
- Company: Colorado State University
- Website: sites.google.com/view/ansleyjbrown
- Repositories: 4
- Profile: https://github.com/ansleybrown1337
Data Specialist & Agronomist | Ag Water Quality Program