pyPLNmodels
pyPLNmodels: A Python package to analyze multivariate high-dimensional count data - Published in JOSS (2024)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, frontiersin.org, joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary
Scientific Fields
Repository
Blazing fast inference of PLN models built on top of Python and PyTorch
Basic Info
- Host: GitHub
- Owner: PLN-team
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://pln-team.github.io/pyPLNmodels/
- Size: 97.4 MB
Statistics
- Stars: 9
- Watchers: 3
- Forks: 1
- Open Issues: 2
- Releases: 7
Metadata Files
README.md
PLNmodels: Poisson lognormal models
The Poisson lognormal model and its variants are used for the analysis of multivariate count data. This package implements efficient algorithms to extract meaningful insights from complex and difficult-to-interpret multivariate count data. It is designed to scale on large datasets, although it has memory limitations. Possible fields of application include:
- Genomics/transcriptomics (e.g., the number of times a gene is expressed in a cell)
- Ecology (e.g., species abundances)
- Metagenomics (e.g., the number of Amplicon Sequence Variants detected in a sample)
- Genetics (e.g., the number of crossovers (DNA exchanges) along a chromosome)
One of the main functionalities of this package is to normalize count data to obtain more valuable insights. It also analyzes the significance of each variable, their correlations, and the weight of covariates (if available).
Documentation and tutorials
The package documentation can be found here. In-depth tutorials are available here.
Getting started
A notebook to get started can be found here. A more in-depth tutorial is available here.
If you need just a quick view of the package, see the quickstart next. Note
that an R version of the package is available here.
🛠 Installation
pyPLNmodels is available on pypi.
Package installation
sh
pip install pyPLNmodels
The package depends on resource-intensive libraries like torch, so it may
require significant storage space.
Statistical description
For those unfamiliar with Poisson or Gaussian random variables, it's not necessary to delve into these statistical concepts. The key takeaway is that this package analyzes multi-dimensional count data, extracting significant information such as the mean, relationships with covariates, and correlations between count variables, in a manner appropriate for count data.
Consider $\mathbf Y$ a count matrix (denoted as endog in the package) consisting of $n$ rows and $p$ columns.
It is assumed that each individual $\mathbf Y_i$, that is the $i^{\text{th}}$
row of $\mathbf Y$, is independent from the others and follows a Poisson
lognormal distribution:
$$\mathbf Y{i}\sim \mathcal P(\exp(\mathbf Z{i})), \quad \mathbf Zi \sim \mathcal N(\mathbf oi + \mathbf B ^{\top} \mathbf x_i, \mathbf \Sigma), (\text{PLN-equation})$$
where $\mathbf xi \in \mathbb R^d$ (exog) and $\mathbf oi \in \mathbb R^p$ (offsets) are
user-specified covariates and offsets. The matrix $\mathbf B$ is a $d\times p$
matrix of regression coefficients and $\mathbf \Sigma$ is a $p\times p$
covariance matrix. The goal is to estimate the parameters $\mathbf B$ and
$\mathbf \Sigma$, denoted as coef and covariance in the package,
respectively.
The PLN model described in the PLN-equation is the building block of many different statistical tasks adequate for count data, by modifying the $Z_i$ latent variables. The package implements:
- Covariance analysis (
Pln) - Dimension reduction (
PlnPCAandPlnPCACollection) - Zero-inflation (
ZIPln) - Autoregressive models (
PlnAR) - Supervised clustering (
PlnLDA) - Unsupervised clustering (
PlnMixture) - Network inference (
PlnNetwork) - Zero-inflation and dimension reduction (
ZIPlnPCA) - Variance estimation (
PlnDiag)
A normalization procedure adequate to count data can be applied
by extracting the latent_variables $\mathbf Z_i$ once the parameters are learned.
⚡️ Quickstart
The package comes with a single-cell RNA sequencing dataset to present the functionalities:
python
from pyPLNmodels import load_scrna
data = load_scrna()
This dataset contains the number of occurrences of each gene in each cell in
data["endog"]. Each cell is labelled by its cell-type in data["labels"].
How to specify a model
Each model can be specified in two distinct manners:
- by formula (similar to R), where a data frame is passed and the formula is specified using the
from_formulainitialization:python from pyPLNmodels import Pln pln = Pln.from_formula("endog ~ 1 + labels ", data = data)
We rely on the patsy package for the formula parsing.
- by specifying the
endog,exog, andoffsetsmatrices directly:python import numpy as np endog = data["endog"] exog = data["labels"] offsets = np.zeros((endog.shape)) pln = Pln(endog=endog, exog=exog, offsets=offsets)
The parameters exog and offsets are optional. By default,
exog is set to represent an intercept, which is a vector of ones. Similarly,
offsets defaults to a matrix of zeros. The offsets should be on the scale of the log of the counts.
Motivation
The count data is often very noisy, and inferring the latent variables $Zi$ may reduce noise and increase signal. Suppose we try to infer the cell type of each cell, using Linear Discriminant Analysis (LDA): ```python from sklearn.modelselection import traintestsplit from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
def getclassiferror(data, y): datatrain, datatest, ytrain, ytest = traintestsplit(data, y, testsize=0.33, randomstate=42) lda = LDA() lda.fit(datatrain, ytrain) ypred = lda.predict(datatest) return np.mean(ypred != ytest) ```
Here is the classification error of the raw counts:
python
data = load_scrna(n_samples=1000)
get_classif_error(data["endog"], data["labels"])
Output:
0.31
And here is the classification error of the latent variables $Zi$:
```python
getclassiferror(Pln(data["endog"]).fit().latentvariables, data["labels"])
Output:
0.17
```
Covariance analysis with the Poisson lognormal model (aka Pln)
This is the building-block of the models implemented in this package. It fits a Poisson lognormal model to the data:
python
pln = Pln.from_formula("endog ~ 1 + labels ", data = data)
pln.fit()
print(pln)
transformed_data = pln.transform()
pln.show()
Dimension reduction with the PLN Principal Component Analysis (aka PlnPCA and PlnPCACollection)
This model excels in dimension reduction and is capable of scaling to
high-dimensional count data ($p >> 1$), by constraining the covariance matrix
$\Sigma$ to be of low rank (the larger the rank, the slower the model but the
better the approximation). The user may specify the rank when creating the
PlnPCA object:
python
from pyPLNmodels import PlnPCA
pca = PlnPCA.from_formula("endog ~ 1 + labels ", data = data, rank = 3).fit()
Multiple ranks can be simultaneously tested
within a single object (PlnPCACollection), and select the optimal model.
python
from pyPLNmodels import PlnPCACollection
pca_col = PlnPCACollection.from_formula("endog ~ 1 + labels ", data = data, ranks = [3,4,5])
pca_col.fit()
print(pca_col)
pca_col.show()
best_pca = pca_col.best_model()
print(best_pca)
Zero inflation with the Zero-Inflated PLN Model (aka ZIPln and ZIPlnPCA)
The ZIPln model, a variant of the PLN model, is designed to handle zero
inflation in the data. It is defined as follows:
$$Y{ij}\sim \mathcal W{ij} \times P(\exp(Z{ij})), \quad \mathbf Zi \sim \mathcal N(\mathbf oi + \mathbf B ^{\top} \mathbf xi, \mathbf \Sigma), \quad W{ij} \sim \mathcal B(\sigma( \mathbf xi^{0^{\top}}\mathbf B^0_j))$$
This model is particularly beneficial when the data contains a significant
number of zeros. It incorporates additional covariates for the zero inflation
coefficient, which are specified following the pipe | symbol in the formula
or via the exog_inflation keyword. If not specified, it is set to the
covariates for the Poisson part.
python
from pyPLNmodels import ZIPln
zi = ZIPln.from_formula("endog ~ 1 | 1 + labels", data = data).fit()
print(zi)
print("Transformed data shape: ", zi.transform().shape)
z_latent_variables = zi.transform()
w_latent_variables = zi.latent_prob
print(r'$Z$ latent variables shape', z_latent_variables.shape)
print(r'$W$ latent variables shape', w_latent_variables.shape)
Similar to the PlnPCA model, the ZIPlnPCA model is capable of dimension reduction.
Network inference with the PlnNetwork model
The PlnNetwork model is designed to infer the network structure of the data.
It creates a network where the nodes are the count variables and the edges
represent the correlation between them. The sparsity of the network is ensured
via the penalty keyword. The larger the penalty, the sparser the network.
python
from pyPLNmodels import PlnNetwork
net = PlnNetwork.from_formula("endog ~ 1 + labels ", data = data, penalty = 200).fit()
net.viz_network()
print(net.network)
Supervised clustering with the PlnLDA model
One can do supervised clustering using Linear Discriminant Analysis
designed for count data.
python
from pyPLNmodels import PlnLDA, plot_confusion_matrix
endog_train, endog_test = data["endog"][:500], data["endog"][500:]
labels_train, labels_test = data["labels"][:500], data["labels"][500:]
lda = PlnLDA(endog_train, clusters=labels_train).fit()
pred_test = lda.predict_clusters(endog_test)
plot_confusion_matrix(pred_test, labels_test)
Unsupervised clustering with the PlnMixture model
python
from pyPLNmodels import PlnMixture
mixture = PlnMixture.from_formula("endog ~ 0 ", data = data, n_cluster=3).fit()
mixture.show()
clusters = mixture.clusters
plot_confusion_matrix(clusters, data["labels"])
Autoregressive models with the PlnAR model
The PlnAR model is designed to handle time series data. It is a simple (one step) autoregressive model that can be used to predict the next time point.
(This assumes the endog variable is a time series, which is not the case in the example below)
python
from pyPLNmodels import PlnAR
ar = PlnAR.from_formula("endog ~ 1 + labels ", data = data).fit()
ar.show()
Visualization
The package is equipped with a set of visualization functions designed to help
the user interpret the data. The viz function conducts PCA
on the latent variables. The remove_exog_effect keyword
removes the covariates' effect specified in the model when set to True.
Much more functionalities, depending on the model, are available. One can see the full list of available functions in the documentation and by printing the model:
python
print(pln)
print(pca)
print(pca_col)
print(zi)
print(net)
print(lda)
print(mixture)
print(ar)
👐 Contributing
Feel free to contribute, but read the CONTRIBUTING.md first. A public roadmap will be available soon.
⚡️ Citations
Please cite our work using the following references:
B. Batardiere, J.Kwon, J.Chiquet: pyPLNmodels: A Python package to analyze multivariate high-dimensional count data. pdf
J. Chiquet, M. Mariadassou and S. Robin: Variational inference for probabilistic Poisson PCA, the Annals of Applied Statistics, 12: 2674–2698, 2018. pdf
B. Batardiere, J.Chiquet, M.Mariadassou: Zero-inflation in the Multivariate Poisson Lognormal Family. pdf
B. Batardiere, J.Chiquet, M.Mariadassou: Evaluating Parameter Uncertainty in the Poisson Lognormal Model with Corrected Variational Estimators. pdf
J. Chiquet, M. Mariadassou, S. Robin: The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances. pdf
J. Chiquet, S. Robin, M. Mariadassou: Variational Inference for sparse network reconstruction from count data pdf
Owner
- Name: PLN team
- Login: PLN-team
- Kind: organization
- Location: Paris area, France
- Repositories: 2
- Profile: https://github.com/PLN-team
Regroup and store repositories around our work on the Poisson lognormal model (Code, packages, new development, material for formation)
GitHub Events
Total
- Create event: 15
- Issues event: 4
- Release event: 6
- Watch event: 4
- Delete event: 16
- Issue comment event: 7
- Push event: 348
- Pull request event: 7
Last Year
- Create event: 15
- Issues event: 4
- Release event: 6
- Watch event: 4
- Delete event: 16
- Issue comment event: 7
- Push event: 348
- Pull request event: 7
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| bastien-mva | b****e@g****m | 1,585 |
| Julien Chiquet | j****t@i****r | 11 |
| Jean-Benoist Leger | j****r@h****r | 9 |
| Joon Kwon | j****n@i****r | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 6
- Total pull requests: 6
- Average time to close issues: 6 months
- Average time to close pull requests: 3 months
- Total issue authors: 5
- Total pull request authors: 2
- Average comments per issue: 1.5
- Average comments per pull request: 0.5
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 5
- Average time to close issues: 16 days
- Average time to close pull requests: about 11 hours
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.6
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Bastien-mva (2)
- thouis (1)
- oeco28 (1)
- canhochoi (1)
- mr2raccoon (1)
Pull Request Authors
- Bastien-mva (10)
- FlyingFish800 (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 83 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 71
- Total maintainers: 1
pypi.org: pyplnmodels
Package implementing PLN models
- Documentation: https://pln-team.github.io/pyPLNmodels/
- License: MIT License
-
Latest release: 1.0.8
published 8 months ago
Rankings
Maintainers (1)
Dependencies
- matplotlib *
- mlxtend *
- numpy *
- pandas *
- patsy *
- plotly *
- scikit-learn *
- scipy *
- seaborn *
- torch *