unsupervised-multimodal-trajectory-modeling

We use EM for a mixture of state space models to perform unsupervised clustering of short trajectories.

https://github.com/burkh4rt/unsupervised-multimodal-trajectory-modeling

Keywords

clustering expectation-maximization state-space-model

Last synced: 6 months ago · JSON representation ·

Repository

We use EM for a mixture of state space models to perform unsupervised clustering of short trajectories.

Basic Info

Host: GitHub
Owner: burkh4rt
License: mit
Language: Python
Default Branch: master
Homepage: https://pypi.org/project/unsupervised-multimodal-trajectory-modeling/
Size: 668 KB

Statistics

Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 3

Topics

clustering expectation-maximization state-space-model

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation Codemeta

Unsupervised Multimodal Trajectory Modeling

We propose and validate a mixture of state space models to perform unsupervised clustering of short trajectories[^1]. Within the state space framework, we let expensive-to-gather biomarkers correspond to hidden states and readily obtainable cognitive metrics correspond to measurements. Upon training with expectation maximization, we find that our clusters stratify persons according to clinical outcome. Furthermore, we can effectively predict on held-out trajectories using cognitive metrics alone. Our approach accommodates missing data through model marginalization and generalizes across research and clinical cohorts.

Data format

We consider a training dataset

$$ \mathcal{D} = {(x{1:T}^{i}, z{1:T}^{i}) }{1 \leq i \leq nd} $$

consisting of $nd$ sequences of states and observations paired in time. We denote the states $z{1:T}^{i} = (z1^i, z2^i, \dotsc, zT^i)$ where $zt^i \in \mathbb{R}^d$ corresponds to the state at time $t$ for the $i$ th instance and measurements $x{1:T}^{i} = (x1^i, x2^i, \dotsc, xT^i)$ where $xt^i \in \mathbb{R}^\ell$ corresponds to the observation at time $t$ for the $i$ th instance. For the purposes of this code, we adopt the convention that collections of time-delineated sequences of vectors will be stored as 3-tensors, where the first dimension spans time $1\leq t \leq T$, the second dimension spans instances $1\leq i \leq nd$ (these will almost always correspond to an individual or participant), and the third dimension spans the components of each state or observation vector (and so will have dimension either $d$ or $\ell$). We accommodate trajectories of differing lengths by standardising to the longest available trajectory in a dataset and appending np.nan's to shorter trajectories.

Model specification

We adopt a mixture of state space models for the data:

$$ p(z^i{1:T}, x^i{1:T}) = \sum{c=1}^{nc} \pi{c} \delta{ \{c=c^i \} } \bigg( p(z1^i| c) \prod{t=2}^T p(zt^i | z{t-1}^i, c) \prod{t=1}^T p(xt^i | z_t^i, c) \bigg). $$

Each individual $i$ is independently assigned to some cluster $c^i$ with probability $\pi{c}$, and then conditional on this cluster assignment, their initial state $z1^i$ is drawn according to $p(z1^i| c)$, with each subsequent state $zt^i, 2\leq t \leq T$ being drawn in turn using the cluster-specific state model $p(zt^i | z{t-1}^i, c)$, depending on the previous state. At each point in time, we obtain an observation $xt^i$ from the cluster-specific _measurement model $p(xt^i | zt^i, c)$, depending on the current state. In what follows, we assume both the state and measurement models are stationary for each cluster, i.e. they are independent of $t$. In particular, for a given individual, the relationship between the state and measurement should not change over time.

In our main framework, inspired by the work of Chiappa and Barber[^2], we additionally assume that the cluster-specific state initialisation is Gaussian, i.e. $p(z1^i| c) = \etad(z1^i; mc, Sc)$, and the cluster-specific state and measurement models are linear Gaussian, i.e. $p(zt^i | z{t-1}^i, c) = \etad(zt^i; z{t-1}^iAc, \Gammac)$ and $p(xt^i | zt^i, c) = \eta\ell(xt^i; zt^iHc, \Lambdac)$, where $\etad(\cdot, \mu, \Sigma)$ denotes the multivariate $d$-dimensional Gaussian density with mean $\mu$ and covariance $\Sigma$, yielding:

$$ p(z^i{1:T}, x^i{1:T}) = \sum{c=1}^{nc} \pi{c} \delta{ \{c=c^i \} } \bigg( \etad(z1^i; mc, Sc) \prod{t=2}^T \etad(zt^i; z{t-1}^iAc, \Gammac) \prod{t=1}^T \eta\ell(xt^i; zt^iHc, \Lambdac) \bigg). $$

In particular, we assume that the variables we are modeling are continuous and changing over time. When we train a model like the above, we take a dataset $\mathcal{D}$ and an arbitrary set of cluster assignments $c^i$ (as these are also latent/ hidden from us) and iteratively perform M and E steps (from which EM[^3] gets its name):

[E] Expectation step: given the current model, we assign each data instance $(z^i{1:T}, x^i{1:T})$ to the cluster to which it is mostly likely to belong under the current model
[M] Maximization step: given the current cluster assignments, we compute the sample-level cluster assignment probabilities (the $\pi_c$) and optimal cluster-specific parameters

Optimization completes after a fixed (large) number of steps or when no data instances change their cluster assignment at a given iteration.

Adapting the code for your own use

A typical workflow is described at: https://github.com/burkh4rt/Unsupervised-Trajectory-Clustering-Starter

Caveats & Troubleshooting

Some efforts have been made to automatically handle edge cases. For a given training run, if any cluster becomes too small (fewer than 3 members), training terminates. In order to learn a model, we make assumptions about our training data as described above. While our approach seems to be robust to some types of model misspecification, we have encountered training issues with the following problems:

Extreme outliers. An extreme outlier tends to want to form its own cluster (and that's problematic). In many cases this may be due to a typo or failed data-cleaning (i.e. an upstream problem). Generating histograms of each feature is one way to recognise this problem.
Discrete / static features. Including discrete data violates our Gaussian assumptions. If we learn a cluster where each trajectory has the same value for one of the states or observations at a given time step, then we are prone to estimating a singular covariance structure for this cluster which yields numerical instabilities. Adding a small bit of noise to discrete features may remediate numerical instability to some extent.

Another assumption that is easy-to-violate is our stationarity assumption for the measurement model.

[^1]: M. Burkhart, L. Lee, D. Vaghari, A. Toh, E. Chong, C. Chen, P. Tiňo, and Z. Kourtzi, Unsupervised multimodal modeling of cognitive and brain health trajectories for early dementia prediction, Sci. Rep. 14 (2024)

[^2]: S. Chiappa and D. Barber, Dirichlet Mixtures of Bayesian Linear Gaussian State-Space Models: a Variational Approach, Tech. rep. 161, Max Planck Institute for Biological Cybernetics, 2007.

[^3]: A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from
Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 39 (1977).

Owner

Name: Michael Burkhart
Login: burkh4rt
Kind: user
Company: University of Cambridge

Website: https://burkh4rt.github.io
Twitter: burkh4rt
Repositories: 4
Profile: https://github.com/burkh4rt

research associate—machine learning for neuroscience

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Please cite the following work when using this software."
license:
  - MIT
preferred-citation:
  type: article
  authors:
    - family-names: "Burkhart"
      given-names: "Michael C."
      orcid: "https://orcid.org/0000-0002-2772-5840"
    - family-names: "Lee"
      given-names: "Liz Y."
    - family-names: "Vaghari"
      given-names: "Delshad"
    - family-names: "Toh"
      given-names: "An Qi"
    - family-names: "Chong"
      given-names: "Eddie"
    - family-names: "Chen"
      given-names: "Christopher"
    - family-names: "Tiňo"
      given-names: "Peter"
    - family-names: "Kourtzi"
      given-names: "Zoe"
  doi: "10.1038/s41598-024-60914-w"
  journal: "Scientific Reports"
  title:
    "Unsupervised multimodal modeling of cognitive and brain health
    trajectories for early dementia prediction"
  volume: 14
  year: 2024

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "license": "https://spdx.org/licenses/MIT",
  "codeRepository": "https://github.com/burkh4rt/Unsupervised-Multimodal-Trajectory-Modeling.git",
  "dateCreated": "2023-09-21",
  "datePublished": "2023-09-21",
  "dateModified": "2024-05-21",
  "name": "Unsupervised Multimodal Trajectory Modeling",
  "version": "2024.2.1",
  "description": "Trains mixtures of state space models with expectation maximization",
  "developmentStatus": "concept",
  "funder": {
    "@type": "Organization",
    "name": "University of Cambridge"
  },
  "keywords": [
    "unsupervised clustering",
    "trajectory clustering",
    "mixture models",
    "state space models",
    "machine learning"
  ],
  "programmingLanguage": [
    "Python 3"
  ],
  "author": [
    {
      "@type": "Person",
      "@id": "https://orcid.org/0000-0002-2772-5840",
      "givenName": "Michael C.",
      "familyName": "Burkhart",
      "email": "mcb93@cam.ac.uk",
      "affiliation": {
        "@type": "Organization",
        "name": "University of Cambridge"
      }
    }
  ]
}

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 22 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 1

pypi.org: unsupervised-multimodal-trajectory-modeling

trains mixtures of state space models with expectation maximization

Homepage: https://pypi.org/project/unsupervised-multimodal-trajectory-modeling/
Documentation: https://unsupervised-multimodal-trajectory-modeling.readthedocs.io/
License: MIT License Copyright (c) 2024 Michael C. Burkhart, et al. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 2024.2.2
published almost 2 years ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 22 Last month

Rankings

Dependent packages count: 7.4%

Downloads: 21.3%

Forks count: 22.8%

Average: 30.5%

Stargazers count: 32.0%

Dependent repos count: 68.9%

Maintainers (1)

burkh4rt

Last synced: 6 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science