contrastive

Contrastive PCA

https://github.com/abidlabs/contrastive

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: nature.com
✓
Committers with academic emails
2 of 7 committers (28.6%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Contrastive PCA

Basic Info

Host: GitHub
Owner: abidlabs
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 29.8 MB

Statistics

Stars: 220
Watchers: 9
Forks: 52
Open Issues: 14
Releases: 0

Created almost 9 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

README.rst

contrastive
===================
A python library for performing unsupervised machine learning on datasets with learning (e.g. PCA) in contrastive settings, where one is interested in patterns (e.g. clusters or clines) that exist one dataset, but not the other.

Applications include dicovering subgroups in biological and medical data. Here are basic installation and usage instructions, written for Python 3 (in which the library has been developed and tested, although it should work in Python 2 as well).

For more details, see the accompanying paper: `"Exploring Patterns Enriched in a Dataset with Contrastive Principal Component Analysis"
`_, *Nature Communications* (2018), and please use the citation below.

.. code-block:: 

	@article{abid2018exploring,
	  title={Exploring patterns enriched in a dataset with contrastive principal component analysis},
	  author={Abid, Abubakar and Zhang, Martin J and Bagaria, Vivek K and Zou, James},
	  journal={Nature communications},
	  volume={9},
	  number={1},
	  pages={2134},
	  year={2018},
	}


This repository also includes experiments to reproduce most of the figures in the paper. Please see the python notebooks in the :code:`experiments` folder.

Installation
--------------------

.. code-block:: 

	$ pip3 install contrastive

Basic Usage
-------------------------------

The basic functions enabled by this library are shown below. Generally speaking, we have two datasets, one is a dataset that we can label as  :code:`foreground_data`, which is the dataset in which we are discovering patterns and directions, and another dataset called :code:`background_data`, which is the dataset that does not have the patterns or directions we are interested in discovering. In some cases, both datasets may contain the signal of interest, but the foreground dataset may have the pattern enriched relative to the background. In these analyses, there is a contrast parameter, known as alpha, which can be thought of as a hyperparameter.

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	projected_data = mdl.fit_transform(foreground_data, background_data)
	
	#returns a set of 2-dimensional projections of the foreground data stored in the list 'projected_data', for several different values of 'alpha' that are automatically chosen (by default, 4 values of alpha are chosen)

Note that :code:`foreground_data` and :code:`background_data` should be 2D numpy arrays that have the second dimension (which represents the number of features). In other words, :code:`foreground_data.shape[1]==background_data.shape[1]` should return :code:`True`.

**Built-in plotting**: to quickly see the results of contrastive PCA, simply enable the :code:`plot` parameter to true:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	projected_data = mdl.fit_transform(foreground_data, background_data, plot=True)
	

.. image:: images/plot_true.png

**Interactive GUI**: if you are running these analyses inside a jupyter notebook, you can easily launch an interactive GUI as shown here:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	projected_data = mdl.fit_transform(foreground_data, background_data, gui=True)
	

.. image:: images/gui_true.png

Using the slider, you can see how the your data points move as you change the value of the contrast parameter. These animations can reveal groups in the data and other insights:

.. image:: images/animation.gif

Quick Test
-------------------------------
To ensure that the library is working, here is a quick script that will allow you to test the code on synthetic data. Simply run the following commands:

.. code-block:: python

	import numpy as np
	from contrastive import CPCA

	N = 400; D = 30; gap=3
	# In B, all the data pts are from the same distribution, which has different variances in three subspaces.
	B = np.zeros((N, D))
	B[:,0:10] = np.random.normal(0,10,(N,10))  
	B[:,10:20] = np.random.normal(0,3,(N,10))
	B[:,20:30] = np.random.normal(0,1,(N,10))


	# In A there are four clusters.
	A = np.zeros((N, D))
	A[:,0:10] = np.random.normal(0,10,(N,10))
	# group 1
	A[0:100, 10:20] = np.random.normal(0,1,(100,10))
	A[0:100, 20:30] = np.random.normal(0,1,(100,10))
	# group 2
	A[100:200, 10:20] = np.random.normal(0,1,(100,10))
	A[100:200, 20:30] = np.random.normal(gap,1,(100,10))
	# group 3
	A[200:300, 10:20] = np.random.normal(2*gap,1,(100,10))
	A[200:300, 20:30] = np.random.normal(0,1,(100,10))
	# group 4
	A[300:400, 10:20] = np.random.normal(2*gap,1,(100,10))
	A[300:400, 20:30] = np.random.normal(gap,1,(100,10))
	A_labels = [0]*100+[1]*100+[2]*100+[3]*100

	cpca = CPCA(standardize=False)
	cpca.fit_transform(A, B, plot=True, active_labels=A_labels)

You should see a series of plots that looks something like this:

.. image:: images/plot_example.png

Optional Parameters
-------------------------------
**Labels for foreground data (plot/gui mode)**: In the examples above, the data points are colored according to labels known ahead of time. You can supply these labels using the :code:`active_labels` parameter, as shown here:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	#labels = [0, 1, 0, 1, 1 ... 1, 0] 
	projected_data = mdl.fit_transform(foreground_data, background_data, plot=True, active_labels=labels)

**Additional # of components**: Sometimes, you'd like to project your data on more than the top 2 contrastive principal components (cPCs). Specify the number of cPCs when you instantiate your model using the :code:`n_components` parameter:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA(n_components=3) #the top 3 components will be returned
	projected_data = mdl.fit_transform(foreground_data, background_data)

However, note that only when :code:`n_components=2` can the data be plotted or visualized through the GUI.

**How values of alpha are chosen**: So far, we've always plotted the data when the values of alpha have been chosen automatically with default parameters. However, the values of alpha can be customized. For example, if you'd like to still choose the values of alpha automatically, but change the range or number of alphas considered, you can use the :code:`n_alphas` and :code:`max_log_alpha` parameters. The former sets the number of alphas that are analyzed, and the latter sets the upper bound on the highest value of log (base 10) alpha. (The minimum value of alpha, besides alpha = 0, is always alpha = 0.1). Finally, you can change the number of values of alpha that are returned using the :code:`n_alphas_to_return` parameter.

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	projected_data = mdl.fit_transform(foreground_data, background_data, n_alphas=10,  max_log_alpha=2, n_alphas_to_return=1) #search through 10 logarithmically spaced values of alpha from 0.1 to 100 and return the PCs for only 1 of them.

You can also decide to set the value of alpha to a particular value of alpha manually by changing the :code:`alpha_selection` and :code:`alpha_value` parameters as follows:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA()
	projected_data = mdl.fit_transform(foreground_data, background_data, alpha_selection='manual', alpha_value=2.0)

Or you can decide to plot or return the data for _all_ values of alpha in the given range. In this case, you can still choose to set the :code:`n_alphas` and :code:`max_log_alpha` parameters:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA() #the top 3 components will be returned
	projected_data = mdl.fit_transform(foreground_data, background_data, n_alphas=10,  max_log_alpha=2, alpha_selection='all') #search through 10 logarithmically spaced values of alpha from 0.1 to 100 and return the PCs for all of them!

**Whether to standardize your data**: By default, before performing contrastive PCA, the data are standardized so that each column or dimension has unit variance. You can turn this off by doing the following:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA(standardize=False)
	projected_data = mdl.fit_transform(foreground_data, background_data)

**Custom colors (plot/gui mode)**: As a stylistic touch, you can also customize which colors are used to label the points when the data is plotted by using the :code:`colors` argument. Here's an example:

.. code-block:: python

	from contrastive import CPCA

	mdl = CPCA(standardize=False)
	projected_data = mdl.fit_transform(foreground_data, background_data, gui=True, colors=['r','b','k','c'])

will produce something along the lines of:

.. image:: images/gui_colors.png

Owner

Name: Abubakar Abid
Login: abidlabs
Kind: user

Twitter: abidlabs
Repositories: 9
Profile: https://github.com/abidlabs

Working on Gradio (www.gradio.dev) at @huggingface!

GitHub Events

Total

Watch event: 22
Pull request event: 1
Fork event: 4
Create event: 1

Last Year

Watch event: 22
Pull request event: 1
Fork event: 4
Create event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 52
Total Committers: 7
Avg Commits per committer: 7.429
Development Distribution Score (DDS): 0.577

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Abubakar Abid	a**d@D**u	22
Abubakar Abid	A****d	12
Abubakar Abid	a**d@s**u	9
martinjzhang	m**g@g**m	6
drakeeee	a**7@g**m	1
Abubakar Abid	a**r@h**o	1
jzazo	S**7@m**m	1

Committer Domains (Top 20 + Academic)

microsoft.com: 1 huggingface.co: 1 stanford.edu: 1 desktop-role13u.stanford.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 21
Total pull requests: 7
Average time to close issues: 29 days
Average time to close pull requests: 10 days
Total issue authors: 18
Total pull request authors: 6
Average comments per issue: 0.9
Average comments per pull request: 0.14
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 3 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.33
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

krassowski (3)
kieran-mace (2)
ndaniel (1)
klai001 (1)
ideasbyjin (1)
tamkaho (1)
PietroD (1)
toosi (1)
gtiao (1)
edifice1989 (1)
Periodinan (1)
vyraun (1)
hliu56 (1)
eliauk07yz (1)
AvantiShri (1)

Pull Request Authors

patjiang (3)
abidlabs (2)
drakeeee (1)
gabriel-santanna (1)
jzazo (1)
jkobject (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 162 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 10
Total maintainers: 1

pypi.org: contrastive

Python library for performing unsupervised learning (e.g. PCA) in contrastive settings, where one is interested in finding directions and patterns that exist one dataset, but not the other

Homepage: https://github.com/abidlabs/contrastive
Documentation: https://contrastive.readthedocs.io/
License: mit
Latest release: 1.2.0
published over 3 years ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 162 Last month

Rankings

Stargazers count: 5.3%

Forks count: 6.1%

Average: 9.9%

Dependent packages count: 10.0%

Dependent repos count: 11.6%

Downloads: 16.2%

Maintainers (1)

aabid93

Last synced: 10 months ago

Dependencies

contrastive.egg-info/requires.txt pypi

matplotlib *
numpy *
sklearn *

setup.py pypi

matplotlib *
numpy *
sklearn *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

contrastive

Science Score: 46.0%

Repository

Basic Info

Statistics

Metadata Files

README.rst

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: contrastive

Rankings

Maintainers (1)

Dependencies