https://github.com/google-research/large_scale_mmdma

Last synced: 6 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: google-research
License: apache-2.0
Language: Python
Default Branch: master
Size: 2.18 MB

Statistics

Stars: 10
Watchers: 4
Forks: 2
Open Issues: 1
Releases: 19

Created over 4 years ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License

Large-Scale MMD-MA

The objective of MMD-MA is to match points coming from two different spaces in a lower dimensional space. To this end, two sets of points are projected, from two different spaces endowed with a positive definite kernel, to a shared Euclidean space of lower dimension low_dim. The mappings from high to low dimensional space are obtained using functions belonging to the respective RKHS. To obtain the mappings, we minimise a loss function that is composed of three terms: - an MMD term between the low dimensional representations of the two views, which encourages them to have the same distribution. The RBF kernel is used to compute MMD. - two non-collapsing penalty terms (corresponding to the pen_dual or pen_primal functions), one for each view. These terms ensure that the low dimensional representations are mutually orthogonal, preventing collapsing. - two distortion penalties (corresponding to the dis_dual or dis_primal functions), one for each view. These terms encourage the low dimensional representation to obtain the same pairwise structure as the original views.

MMD-MA can be formulated using either the primal (when we use the linear kernel in the input spaces) or the dual problem. Each has advantages or disadvantages depending on the input data. For each view, when the number of features p is larger than the number of samples n p >> n, then the dual formulation is beneficial in terms of runtime and memory, while if n >> p, the primal formulation is favorable.

Additionally, in order to scale the computation of MMD to a large number of samples, we propose either to use the KeOps library which lets you compute large kernel operations on GPUs without memory overflow.

Installation

To install the latest release of lsmmdma, use the following command:

bash $ pip install lsmmdma

To install the development version, use the following command instead:

bash $ pip install git+https://github.com/google-research/large-scale-mmdma

Alternatively, it can be installed from sources with the following command:

bash $ python setup.py install

In Google Colab, use the following command: bash $ !pip install lsmmdma

The KeOps library might require to be installed separately in advance, according to the given instructions. Typically, in Google Colab one would run this command before installing lsmmmdma: bash $ !pip install pykeops

Command line instructions

The algorithm can be run with a command line using:

bash python3 -m lsmmmda.main [flags]

A set of flags is available to determine the IO, the model, the hyperparameters and the seed.

Input/Output It is possible to give as input either a path and filenames pointing to the user input or to choose to generate data with the datapipeline.generatedata function. In the case one wants to generate simulation data, the input flags are: - --data: str, it can be either 'branch', 'triangle' or '' (default). The simulated data is described in the pydoc of datapipeline.generatedata. '' means that simulated data is not used. - --n: int (default 300), number of samples for both views. - --p: int (default 1000), number of features for both views.

For data given by the user, the input flags are: - --input_dir: str (default None), input directory. - --input_fv: str (default None), filename of the array (n1 x p1 or n1 x n1) that serves as first set of points. - --input_sv: str (default None), filename of the array (n2 x p2 or n2 x n2) that serves as second set of points. - --rd_vec: str (default None), filename of the vector that contains the indices of the samples from the first view that match (ground truth) the samples from the second view. This is only used at evaluation time. If --rd_vec is not used, we assume that the samples of both views follow the same ordering. - --labels_fv: str (default None), filename of the vector that contains the labels of the samples from the first view. Must match the order of the samples in input_fv. - --labels_sv: str (default None), filename of the vector that contains the labels of the samples from the second view, following the same ordering.

In both cases, two other flags are also available: - --kernel: bool (default False), whether the input data given by the user is a kernel (n x n instead of n x p). This parameter can only be set to True when --m is 'dual'. - --output_dir: str (default None), output directory.

Model The flags allow you to choose between four algorithms, using either the 'primal' or 'dual' formulation, and using KeOps or not. - --m: str, either 'primal' or 'dual' (default). - --keops: integer, either 1 (use keops), 0 (not not use keops) or -1 (automatic, uses keops from 4000 samples onwards) (default). - --use_unbiased_mmd: bool (default True), determines whether or not to use the unbiased version of MMD (see Gretton et al. 2012 Lemma 6).

Seeds The seed for the training phase, and for generating the data when --data is not '', is fixed with the flag --seed (int, default value is 0). If one wishes to use multipe starts when training (X seeds and selection of the best one based on the value of the loss), it is possible to also define the number of seeds to try with: --ns (int, default value is 1).

Model hyperparameters Several hyperparameters ought to be fixed in advance: - --d: int (default 5), dimension of the latent space. - --init: str (default 'uniform,0.,0.1'), initialisation for the learned parameters. It can be sampled from a 'uniform', 'normal', 'xavieruniform' or 'xaviernormal' distributions. The parameters of the initialisation functions are passed to the same flag separated by a coma. See initializers.py and train.py. - --l1: float (default 1e-4), hyperparameter in front of both penalty terms. Note that the penalty terms are automatically scaled by 1/sqrt(p). - --l2: float (default 1e-4), hyperparameter in front of both distortion terms. Note that the distortion terms are automatically scaled by 1/(n*sqrt(p)). - --lr: float (default 1e-5), learning rate. - --s: float (default 1.0), scale parameter of the RBF kernel in MMD.

Training and evaluation Several flags structure the training loop: - --e: int (default 5001), number of epochs for the training process. - --ne: int (default 100), regular interval at which the evaluation (call to metrics.Evaluation) is done, every 'ne' epochs. 0 means that the results are never evaluated. - --nr: int (default 100), regular interval at which the components of the loss are recorded, every 'ne' epochs. 0 means that they are never recorded. The loss is recorded every epoch nonetheless. - --pca: int (default 100), regular interval at which PCA is performed on the embeddings, every 'pca' epochs. 0 means that PCA is not used on the output. - --short_eval: bool (default True), whether or not to compute all the metrics (False) or only a set of them (True) (see metrics.py). - --nn: int (default 5), number of neighbours taken into account in the computation of the Label Transfer Accuracy metrics.

Stopping criterion Two flags enable to control the stopping criterion: - --ws: int (default 0), window size on which the loss is averaged for the stopping criterion. If set to 0, the algorithm stops at the last epoch without loss-based stopping criterion. - --threshold: float (default 1e-3), threshold for the stopping criterion.

Timing Timing the method is possible when the flag --time is set to True (default False).

Examples

We show now a few examples of usage of the command line to run the algorithm. We also introduce two notebooks that display the usage of the algorithm and its runtime.

To run the algorithm on simulated data from data_pipeline.py, a minimal set of commands is:

bash python3 -m lsmmdma.main --output_dir outdir \ --data branch --n 300 --p 400 \ --e 1001 --d 5 --keops False --m dual

To run the algorithm on simulated data from data_pipeline.py, one can also choose when to record the intermediate results, to set the hyperparameters and to allow for 5 multiple starts:

bash python3 -m lsmmdma.main --output_dir outdir \ --data branch --n 300 --p 400 \ --seed 4 --ns 5 \ --keops False --m dual \ --e 1001 --nr 100 --ne 100 --pca 100 \ --d 5 --lr 1e-5 --l1 1e-4 --l2 1e-4 --s 1.0 --init 'uniform,0,0.1'

To run the algorithm on user input data, in the form nsample x pfeature. --data should be '' (default) and --kernel should be False (default). The argument --keops can be True or False, --m can be 'dual' or 'primal'.

bash python3 -m lsmmdma.main --input_dir datadir --output_dir outdir \ --input_fv my_data_1 --input_sv my_data_2 --kernel False \ --seed 4 --ns 5 \ --keops True --m dual \ --e 1001 --nr 100 --ne 100 --pca 100 \ --d 5 --lr 1e-5 --l1 1e-4 --l2 1e-4 --s 1.0 --init 'uniform,0,0.1'

To run the algorithm on user kernel data, in the form nsample x nsample. --data should be '' (default) and --kernel should be True. The argument --keops can be True or False, --m can only be 'dual'.

bash python3 -m lsmmdma.main --input_dir datadir --output_dir outdir \ --input_fv my_data_1 --input_sv my_data_2 --kernel True \ --seed 4 --ns 5 \ --keops True --m dual \ --e 1001 --nr 100 --ne 100 --pca 100 \ --d 5 --lr 1e-5 --l1 1e-4 --l2 1e-4 --s 1.0 --init 'uniform,0,0.1'

A tutorial and a benchmark notebooks are also available in examples/.

Input

In case the user is providing the input data, supported formats are AnnData objects (.h5ad), numpy arrays (.npy), tab-separated arrays (.tsv), coma-separated arrays (.csv) and white-space separated arrays.

Output

When one uses main.py, several files are saved in the output directory FLAGS.output_dir: - [key:val].tsv: results of the model at the last epoch. - [key:val]_tracking.json: loss and its components during training, evaluation metrics during training, seed, number of epochs. - [key:val]_model.json: model and optimiser state dictionaries, loss, number of epochs, seed. - [key:val]_pca_X.npy: 2D representation obtained with PCA on the embeddings during training (for the first and second views). - [key:val]_embeddings_X.npy: embeddings during training (for the first and second views). - generated_data_X: first view, second view and rd_vec generated by datapipeline.generatedata.

[key:val] represents a list of key:value as determined by the flags and written in main.py.

Citation

If you have found our work useful, please consider citing us:

@article{meng2022lsmmd, title={LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis}, author={Meng-Papaxanthos, Laetitia and Zhang, Ran and Li, Gang and Cuturi, Marco and Noble, William Stafford and Vert, Jean-Philippe}, journal={bioRxiv}, year={2022}, publisher={Cold Spring Harbor Laboratory} }

Contact

In case you have questions, reach out to lpapaxanthos@google.com.

Disclaimer

This is not an officially supported Google product.

Owner

Name: Google Research
Login: google-research
Kind: organization
Location: Earth

Website: https://research.google
Repositories: 226
Profile: https://github.com/google-research

GitHub Events

Total

Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Last Year

Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 69
Total Committers: 1
Avg Commits per committer: 69.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Laetitia Meng-Papaxanthos	l**s@g**m	69

Committer Domains (Top 20 + Academic)

google.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 3
Total pull requests: 2
Average time to close issues: 6 months
Average time to close pull requests: 10 months
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 1.33
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 10 months
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

GangLiTarheel (3)
dependabot[bot] (1)

Pull Request Authors

dependabot[bot] (3)

Top Labels

Issue Labels

dependencies (1)

Pull Request Labels

dependencies (3) python (2)

Packages

Total packages: 2
Total downloads:
- pypi 44 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 2
(may contain duplicates)
Total versions: 20
Total maintainers: 1

pypi.org: lsmmdma

Scaling MMD-MA.

Homepage: https://github.com/google-research/large_scale_mmdma
Documentation: https://lsmmdma.readthedocs.io/
License: Apache 2.0
Latest release: 0.1.11
published almost 3 years ago

Versions: 19
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 34 Last month

Rankings

Dependent packages count: 10.1%

Average: 19.0%

Forks count: 19.1%

Stargazers count: 19.4%

Dependent repos count: 21.6%

Downloads: 24.9%

Maintainers (1)

lpapaxanthos

Last synced: 7 months ago

pypi.org: lsmtest10

Scaling MMD-MA.

Homepage: https://github.com/google-research/large_scale_mmdma
Documentation: https://lsmtest10.readthedocs.io/
License: Apache 2.0
Latest release: 0.0.1
published about 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 10 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 19.1%

Stargazers count: 19.4%

Dependent repos count: 21.6%

Average: 23.7%

Downloads: 48.5%

Maintainers (1)

lpapaxanthos

Last synced: 7 months ago

https://github.com/google-research/large_scale_mmdma

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Large-Scale MMD-MA

Installation

Command line instructions

Examples

Input

Output

Citation

Contact

Disclaimer

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: lsmmdma

Rankings

Maintainers (1)

pypi.org: lsmtest10

Rankings

Maintainers (1)

Dependencies