mixvlmc

Variable Length Markov Chains with Covariates

https://github.com/fabrice-rossi/mixvlmc

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.2%) to scientific vocabulary

Keywords

machine-learning markov-chain markov-model r r-package r-stats statistics time-series

Last synced: 9 months ago · JSON representation

Repository

Variable Length Markov Chains with Covariates

Basic Info

Host: GitHub
Owner: fabrice-rossi
License: gpl-3.0
Language: R
Default Branch: main
Homepage: https://fabrice-rossi.github.io/mixvlmc/
Size: 19.4 MB

Statistics

Stars: 4
Watchers: 1
Forks: 4
Open Issues: 15
Releases: 3

Topics

machine-learning markov-chain markov-model r r-package r-stats statistics time-series

Created about 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  cache.path = ".README_cache/"
)
```

# Variable Length Markov Chains with Covariates



[![R-CMD-check](https://github.com/fabrice-rossi/mixvlmc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/fabrice-rossi/mixvlmc/actions/workflows/R-CMD-check.yaml) 
[![Codecov test coverage](https://codecov.io/gh/fabrice-rossi/mixvlmc/branch/main/graph/badge.svg)](https://app.codecov.io/gh/fabrice-rossi/mixvlmc?branch=main)


`mixvlmc` implements variable length Markov chains (VLMC) and variable length Markov chains with covariates (COVLMC), as described in:

-   [Bühlmann, P. and Wyner, A. J. (1999), Variable length Markov chains. Ann. Statist. 27 (2) 480-513](https://dx.doi.org/10.1214/aos/1018031204)
-   [Zanin Zambom, A., Kim, S. and Lopes Garcia, N. (2022), Variable length Markov chain with exogenous covariates. J. Time Ser. Anal., 43 (2) 312-328](https://doi.org/10.1111/jtsa.12615)

`mixvlmc` includes functionalities similar to the ones available in 
[VLMC](https://cran.r-project.org/package=VLMC) and [PST](https://cran.r-project.org/package=PST). The main advantages of `mixvlmc` are
the support of time varying covariates with COVLMC and the introduction of 
post-pruning of the models that enables fast model selection via information 
criteria.

## Installation

The package can be installed from CRAN with:
```{r, eval = FALSE}
install.packages("mixvlmc")
```

The development version is available from GitHub:

```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("fabrice-rossi/mixvlmc")
```

## Usage

### Variable length Markov chains

Variable length Markov chains (VLMC) are sparse high order Markov chains. They can be used to model time series (sequences) with discrete values (states) with a mix of small order dependencies for certain states and higher order dependencies for other states. For instance, with a binary time series, the probability of observing 1 at time $t$ could be constant whatever the older past states if the last one (at time $t-1$) was 1, but could depend on states at time $t-3$ and $t-2$ if the state was 0 at time $t-1$. A collection of past states that determines completely the transition probabilities is a *context* of the VLMC. Read `vignette("context-trees")` for details about contexts and context tree, and see `vignette("variable-length-markov-chains")` for a more detailed introduction to VLMC.

VLMC with covariates (COVLMC) are extension of VLMC in which transition probabilities (probabilities of the next state given the past) can be influenced by the past values of some covariates (in addition to the past values of the time series itself). Each context is associated to a logistic model that maps the (past values of the) covariates to transition probabilities.

### Fitting a VLMC

The package is loaded in a standard way.

```{r}
library(mixvlmc)
library(ggplot2) ## we load ggplot2 for the autoplot examples
```

The main function of VLMC is `vlmc()` which can be called on a time series represented by a numerical vector or a factor, for instance.

```{r}
set.seed(0)
x <- sample(c(0L, 1L), 200, replace = TRUE)
model <- vlmc(x)
model
```

The default parameters of `vlmc()` will tend to produce overly complex VLMC in order to avoid missing potential structure in the time series. In the example above, we expect the optimal VLMC to be a constant distribution as the sample is independent and uniformly distributed (it has no temporal structure). The default parameters give here an overly complex model, as illustrated by its text based representation

```{r}
draw(model)
```

The representation uses simple ASCII art to display the contexts of the VLMC organized into a tree (see `vignette("context-trees")` for a more detailed introduction):

-   the root `r draw_control()$root` corresponds to an empty context;
-   one can read contexts by following branches (represented by 
   `r draw_control()$hbranch`) down to their ends (the leaves): for instance 
   $(0, 0, 0, 1, 0, 1)$ is one of the contexts of the tree.

Here the context $(0, 0, 0, 1, 0, 1)$ is associated to the transition
probabilities $(1, 0)$. This means that when one observes this context in
the time series, it is *always* followed by a 0. Notice that contexts are
extends to the left when we go down in the tree as deep nodes corresponds to
older values. Some papers prefer to write contexts from the most recent
value to the oldest one. With this convention, the "reverse" context 
$(1, 0, 1, 0, 0, 0)$ corresponds to the sub time series $(0, 0, 0, 1, 0, 1)$.
Unless otherwise specified, we write contexts in the temporal order. 

#### BIC based model selection

The VLMC above is obviously overfitting to the time series, as illustrated by the 0/1 transition probabilities. A classical way to select a good model is to minimize the [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion). In `mixvlmc` this can be done easily using using 'tune_vlmc()' which fits first a complex VLMC and then *prunes* it (using a combination of `cutoff()` and `prune()`), as follows (see `vignette("variable-length-markov-chains")` for details):

```{r}
best_model_tune <- tune_vlmc(x)
best_model <- as_vlmc(best_model_tune)
draw(best_model)
```

As expected, we end up with a constant model.

In time series with actual temporal patterns, the optimal model will be more complex. As a very basic illustrative example, let us consider the `sunspot.year` time series and turn it into a binary one, with high activity associated to a number of sun spots larger than the median number.

```{r}
sun_activity <- as.factor(ifelse(sunspot.year >= median(sunspot.year), "high", "low"))
```

We adjust automatically an optimal VLMC as follows:

```{r}
sun_model_tune <- tune_vlmc(sun_activity)
sun_model_tune
```

The results of the pruning process can be represented graphically:

```{r sunspots_bic}
print(autoplot(sun_model_tune) + geom_point())
```

The plot shows that simpler models are too simple as the BIC increases when pruning becomes strong enough. The best model remains rather complex (as expected based on the periodicity of the Solar cycle):

```{r}
best_sun_model <- as_vlmc(sun_model_tune)
draw(best_sun_model)
```

### Fitting a VLMC with covariates

To illustrate the use of covariates, we use the power consumption data set included in the package (see `vignette("covlmc")` for details). We consider a week of electricity usage as follows:

```{r active_power_week_5}
pc_week_5 <- powerconsumption[powerconsumption$week == 5, ]
elec <- pc_week_5$active_power
ggplot(pc_week_5, aes(x = date_time, y = active_power)) +
  geom_line() +
  xlab("Date") +
  ylab("Activer power (kW)")
```

The time series displays some typical patterns of electricity usage:

-   low active power at night (typically below 0.4 kW);
-   standard use between 0.4 and 2 kW;
-   peak use above 2 kW.

We build a discrete time series from those (somewhat arbitrary) thresholds:

```{r}
elec_dts <- cut(elec, breaks = c(0, 0.4, 2, 8), labels = c("low", "typical", "high"))
```

The best VLMC model is quite simple. It is almost a standard order one Markov chain, up to the order 2 context used when the active power is *typical*.

```{r}
elec_vlmc_tune <- tune_vlmc(elec_dts)
best_elec_vlmc <- as_vlmc(elec_vlmc_tune)
draw(best_elec_vlmc)
```

As pointed about above, *low* active power tend to correspond to night phase. We can include this information by introducing a day covariate as follows:

```{r}
elec_cov <- data.frame(day = (pc_week_5$hour >= 7 & pc_week_5$hour <= 17))
```

A COVLMC is estimated using the `covlmc` function:

```{r}
elec_covlmc <- covlmc(elec_dts, elec_cov, min_size = 2, alpha = 0.5)
draw(elec_covlmc, model = "full")
```

The model appears a bit complex. To get a more adapted model, we use a BIC based model selection as follows:

```{r elec_tune_process}
elec_covlmc_tune <- tune_covlmc(elec_dts, elec_cov)
print(autoplot(elec_covlmc_tune))
```

```{r}
best_elec_covlmc <- as_covlmc(elec_covlmc_tune)
draw(best_elec_covlmc, model = "full")
```


As in the VLMC case, the optimal model remains rather simple:

-   the *high* context do not use the covariate and is equivalent to the vlmc context;
-   the *low* context is more interesting: it does not switch to a *high* context (hence the single row of parameters) but uses the covariate. As expected, the probability of switching from *low* to *typical* is larger during the day;
-   the *typical* context is described in a more complex way that in the case of the vlmc as the transition probabilities depend on the previous state.

### Sampling

VLMC models can also be used to sample new time series as in the VMLC bootstrap proposed by Bühlmann and Wyner. For instance, we can estimate the longest time period spent in the *high* active power regime. In this "predictive" setting, the [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) may be more adapted to select the best model. Notice that some quantities can be computed directly from the model in the VLMC case, using classical results on Markov Chains. See `vignette("sampling")` for details on sampling. 

We first select two models based on the AIC.

```{r}
best_elec_vlmc_aic <- as_vlmc(tune_vlmc(elec_dts, criterion = "AIC"))
best_elec_covlmc_aic <- as_covlmc(tune_covlmc(elec_dts, elec_cov, criterion = "AIC"))
```

The we sample 100 new time series for each model, using the `simulate()` function as follows:

```{r cache=TRUE}
set.seed(0)
vlmc_simul <- vector(mode = "list", 100)
for (k in seq_along(vlmc_simul)) {
  vlmc_simul[[k]] <- simulate(best_elec_vlmc_aic, nsim = length(elec_dts), init = elec_dts[1:2])
}
```

```{r cache=TRUE}
set.seed(0)
covlmc_simul <- vector(mode = "list", 100)
for (k in seq_along(covlmc_simul)) {
  covlmc_simul[[k]] <- simulate(best_elec_covlmc_aic, nsim = length(elec_dts), covariate = elec_cov, init = elec_dts[1:2])
}
```

Then statistics can be computed on those time series. For instance, we look for the longest time period spent in the *high* active power regime.

```{r}
longuest_high <- function(x) {
  high_length <- rle(x == "high")
  10 * max(high_length$lengths[high_length$values])
}
lh_vlmc <- sapply(vlmc_simul, longuest_high)
lh_covlmc <- sapply(covlmc_simul, longuest_high)
```

The average longest time spent in *high* consecutively is

-   for the VLMC: `r mean(lh_vlmc)` minutes with a standard error of `r sd(lh_vlmc)/sqrt(length(vlmc_simul))`;
-   for the VLMC with covariate: `r mean(lh_covlmc)` minutes with a standard error of `r sd(lh_covlmc)/sqrt(length(covlmc_simul))`;
-   `r longuest_high(elec_dts)` minutes for the observed time series.

The following figure shows the distributions of the times obtained by both models as well as the observed value. The VLMC model with covariate is able to generate longer sequences in the *high* active power state than the bare VLMC model as the consequence of the sensitivity to the day/night schedule.

```{r longest_time_in_high}
lh <- data.frame(
  time = c(lh_vlmc, lh_covlmc),
  model = c(rep("VLMC", length(lh_vlmc)), rep("COVLMC", length(lh_covlmc)))
)
ggplot(lh, aes(x = time, color = model)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  geom_vline(xintercept = longuest_high(elec_dts), color = 3)
```

The VLMC with covariate can be used to investigate the effects of changes in those covariates. For instance, if the day time is longer, we expect *high* power usage to be less frequent. For instance, we simulate one week with a day time from 6:00 to 20:00 as follows.

```{r cache = TRUE}
elec_cov_long_days <- data.frame(day = (pc_week_5$hour >= 6 & pc_week_5$hour <= 20))
set.seed(0)
covlmc_simul_ld <- vector(mode = "list", 100)
for (k in seq_along(covlmc_simul_ld)) {
  covlmc_simul_ld[[k]] <- simulate(best_elec_covlmc_aic, nsim = length(elec_dts), covariate = elec_cov_long_days, init = elec_dts[1:2])
}
```

As expected the distribution of the longest time spend consecutively in *high* power usage is shifted to lower values when the day length is increased.

```{r longest_time_in_high_ld}
lh_covlmc_ld <- sapply(covlmc_simul_ld, longuest_high)
day_time_effect <- data.frame(
  time = c(lh_covlmc, lh_covlmc_ld),
  `day length` = c(rep("Short days", length(lh_covlmc)), rep("Long days", length(lh_covlmc_ld))),
  check.names = FALSE
)
ggplot(day_time_effect, aes(x = time, color = `day length`)) +
  geom_density() +
  geom_rug(alpha = 0.5)
```

Owner

Name: Fabrice Rossi
Login: fabrice-rossi
Kind: user

Website: http://apiacoa.org/
Twitter: fabrice_rossi
Repositories: 6
Profile: https://github.com/fabrice-rossi

GitHub Events

Total

Create event: 2
Release event: 1
Issues event: 4
Watch event: 1
Delete event: 1
Issue comment event: 6
Push event: 3
Pull request event: 3

Last Year

Create event: 2
Release event: 1
Issues event: 4
Watch event: 1
Delete event: 1
Issue comment event: 6
Push event: 3
Pull request event: 3

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 58
Total pull requests: 19
Average time to close issues: about 2 months
Average time to close pull requests: about 4 hours
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.78
Average comments per pull request: 0.42
Merged pull requests: 17
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: 5 days
Average time to close pull requests: 3 minutes
Issue authors: 2
Pull request authors: 1
Average comments per issue: 3.5
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

fabrice-rossi (57)
m-schillinger (1)

Pull Request Authors

fabrice-rossi (10)
hugotmdd (5)
Guenoj (4)

Top Labels

Issue Labels

feature (20) improvement (13) bug (8) Performance (7) documentation (4) test (1) invalid (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 556 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

cran.r-project.org: mixvlmc

Variable Length Markov Chains with Covariates

Homepage: https://github.com/fabrice-rossi/mixvlmc
Documentation: http://cran.r-project.org/web/packages/mixvlmc/mixvlmc.pdf
License: GPL (≥ 3)
Latest release: 0.2.2
published about 1 year ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 556 Last month

Rankings

Forks count: 17.5%

Dependent packages count: 29.0%

Dependent repos count: 34.7%

Stargazers count: 35.2%

Average: 41.2%

Downloads: 89.5%

Maintainers (1)

Fabrice.Rossi@apiacoa.org

Last synced: 9 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/check-standard.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action v4.4.1 composite
actions/checkout v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

DESCRIPTION cran

R >= 2.10 depends
Rcpp >= 1.0.8.3 imports
VGAM * imports
assertthat * imports
butcher * imports
nnet * imports
pROC * imports
rlang * imports
stats * imports
stringr * imports
withr * imports
data.table * suggests
geodist * suggests
ggplot2 * suggests
knitr * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests
tibble * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

mixvlmc

Science Score: 39.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: mixvlmc

Rankings

Maintainers (1)

Dependencies