trace

Interactive quality analysis for two-dimensional embeddings

https://github.com/aida-ugent/trace

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: pubmed.ncbi, ncbi.nlm.nih.gov, springer.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Keywords

dimensionality-reduction embedding-evaluation scatterplot visual-analytics visualization

Last synced: 10 months ago · JSON representation ·

Repository

Interactive quality analysis for two-dimensional embeddings

Basic Info

Host: GitHub
Owner: aida-ugent
License: mit
Language: JavaScript
Default Branch: main
Homepage:
Size: 2.68 MB

Statistics

Stars: 13
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Topics

dimensionality-reduction embedding-evaluation scatterplot visual-analytics visualization

Created over 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Citation

Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE

TRACE¹ supports you in analyzing global and local quality 🕵🏽‍♀️ of two-dimensional embeddings, based on Regl-scatterplot² .

screenshot

Installation

OPTION 1: Using Docker 🐋

Make sure you have [Docker Compose](https://docs.docker.com/compose/install/) installed. Then build the container that includes the backend and frontend. ```bash docker-compose build docker-compose up ``` This will mount the /frontend, /backend, and /data directories into the repective containers. Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.

OPTION 2: Without Docker

#### Required packages **Backend**: Install the required python packages for the backend, tested with Python 3.11 from `backend/pip_requirements.txt` or `backend/conda_requirements.yaml`. **Frontend**: Install the packages in `frontend/package.json` using e.g. `npm install`. First, start the backend within the right python evironment: ```bash conda activate backend_env/ python main.py # or python -m uvicorn main:app --reload ``` Then start the frontend development server: ```bash npm run dev ``` Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.

Data Preparation

The easiest way to load your data into TRACE is using the Dataset class to add embeddings and compute quality measures. This will create the necessary Anndata structure under the hood. Examples can be found in the notebooks of each dataset folder.

python trace_data = Dataset( hd_data=data, name="Gaussian Line", verbose=True, hd_metric="euclidean", )

How is the the Anndata object structured?

The TRACE backend can load data structured in the [Anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) format. It includes the following fields: * `adata.X` high-dimensional data * [optional] `adata.obs`: dataframe with metadata e.g. cluster labels * `adata.obsm` low-dimensional embeddings, one entry for each embedding, e.g. `adata.obsm["t-SNE (exag. 5)"]` for a t-SNE embedding. * `adata.uns` unstructured data: * `adata.uns["methods"]`: a dictionary that structures all available embeddings into groups (exactly one level with keys and a list as values such as in the example). This defines which embeddings can be selected in the interface. For example one could group according to DR methods and and list all corresponding **two-dimensional** embedding keys in adata.obsm: ```json { "t-SNE": ["t-SNE (exag. 5)", "t-SNE (exag. 1)"], "UMAP": ["UMAP 20", "UMAP 100"] } ``` * [optional] `adata.uns["neighbors"]`: an _nxk_ array of the k-nearest high-dimensional neighbors of each point * [optional] `adata.uns["t-SNE (exag. 5)"]`: dictionary with additional data for each embedding, such as **quality** scores or **parameters** used to obtain the embedding. For example: ```json { "quality": {"qnx@50": [...], "qnx@200": [...]}, "parameters": {"perplexity": 100, "exaggeration": 5, "epochs": 750} } ``` * [optional] 🌈 You can add custom colors for metadata features by adding a list of HEX values to `trace_data.adata.uns["featureName_colors"]`. For categorical features, the number of colors should match the number of categories. The colors for continuous features will be mapped to the [min, max] range of the feature values.

1. Adding 2-dimensional embeddings

After preprocessing your data and computing a range of 2-dimensional embeddings using your favorite DR method, add the data and the embeddings to the data object:

```python

Repeat for each embedding

tracedata.addembedding( name= "tSNE (perplexity 30)", embedding = tsne_emb, category="tSNE", ) ```

2. Computing High-Dimensional Neighbors and Quality Measures

To provide snappy interactions in TRACE, the HD neighbors and a range of quality measures need to be precomputed. We use ANNOY to obtain the approximate neighbors and provide implementations of the following quality measures to be visualized via point colors in TRACE:

neighborhood preservation measures the fraction of k high-dimensional neighbors that are preserved in the low-dimensional embedding.
landmark distance correlation: Sampling landmark points using a random or kmeanss++ (supports only Euclidean distance) from the high-dimensional data. We then compute the pairwise distances between all landmarks in high-dimensional space and each embedding and the rank correlation of their distance vectors. Points that are not landmark points are colored according to their nearest landmark point in the embedding.
random triplet accuracy quantifies the ratio of random triplets (i,j,k), where relative order of j and k with respect to i in the high-dimensional space is preserved in the embedding.
point stability measures how much the distances between each point and a random sample of other points vary across all embeddings. If a point has a very different global or local position in the embeddings, the stability will be low.

To compute all available quality measures: python trace_data.compute_quality(filename="./gauss_line.h5ad", hd_metric="euclidean") trace_data.print_quality()

How can I chose the parameters of the quality measures?

Instead of calling the ```compute_quality``` function, you can also call each function separately. ```python trace_data.precompute_HD_neighbors(maxK=200) trace_data.compute_neighborhood_preservation( neighborhood_sizes=[200, 100, 50] ) trace_data.compute_global_distance_correlation( max_landmarks=1000, LD_landmark_neighbors=True, hd_metric="euclidean", sampling_method="random", ) trace_data.compute_random_triplet_accuracy( num_triplets=10 ) trace_data.compute_point_stability(num_samples=50) # align the embeddings such that point movement is minimized trace_data.align_embeddings(reference_embedding="PCA") trace_data.save_adata(filename="./gauss_line.h5ad") ```

3. Add Dataset Configuration

To include a dataset in the dashboard you need to create a file /backend/data_configs.yaml following the examples in data_configs.yaml.template. For the Gaussian Line dataset adding the following lines would be sufficient: json "GaussLine": { "filepath": "../data/gauss_line/gauss_line.h5ad", "description": "Gaussian clusters shifted along a line from Böhm et al. (2022)", }

Example Datasets

Gaussian Line 🟢 🟠 🟣

A small example dataset that is included in the repository.

Mammoth 🦣

This dataset from Wang et al. can be downloaded from their PaCMAP repository. It then needs to be processed using the mammoth.ipynb notebook.

Single-Cell Mouse Data 🐁

The processed dataset of gene expressions from Guilliams et al. is not available online, please reach out if you are interested. A raw version is available under GSE192742.

Citation

TRACE was presented as a demo paper at ECML-PKDD 2024. If you find the tool useful and are using it in your research, we'd appreciate if you could cite our paper:

bibtex @inproceedings{heiter2024pattern, title={Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE}, author={Heiter, Edith and Martens, Liesbet and Seurinck, Ruth and Guilliams, Martin and De Bie, Tijl and Saeys, Yvan and Lijffijt, Jefrey}, booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases}, pages={379--382}, year={2024}, organization={Springer} }

[1] TRACE stands for Two-dimensional representation Analysis and Comparison Engine
[2] Lekschas, Fritz. "Regl-Scatterplot: A Scalable Interactive JavaScript-based Scatter Plot Library." Journal of Open Source Software (2023)

⬆️ Back to top

Owner

Name: Ghent University Artificial Intelligence & Data Analytics Group
Login: aida-ugent
Kind: organization
Email: tijl.debie@ugent.be
Location: Ghent

Website: aida.ugent.be
Repositories: 36
Profile: https://github.com/aida-ugent

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Pattern or Artifact? Interactively Exploring Embedding
  Quality with TRACE
message: >-
  If you are using TRACE, please cite our paper which was 
  presented at the Demo track of ECML-PKDD 2024:
type: software
authors:
  - family-names: Heiter
    given-names: Edith
  - family-names: Martens
    given-names: Liesbet
  - family-names: Seurinck
    given-names: Ruth
  - family-names: Guilliams
    given-names: Martin
  - family-names: De Bie
    given-names: Tijl
  - family-names: Saeys
    given-names: Yvan
  - family-names: Lijffijt
    given-names: Jefrey
identifiers:
  - type: doi
    value: 10.1007/978-3-031-70371-3_24
repository-code: 'https://github.com/aida-ugent/TRACE'
license: MIT
preferred-citation:
  type: conference-paper
  authors:
    - family-names: Heiter
      given-names: Edith
    - family-names: Martens
      given-names: Liesbet
    - family-names: Seurinck
      given-names: Ruth
    - family-names: Guilliams
      given-names: Martin
    - family-names: De Bie
      given-names: Tijl
    - family-names: Saeys
      given-names: Yvan
    - family-names: Lijffijt
      given-names: Jefrey
  title: Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE
  year: 2024
  start: 379
  end: 382

GitHub Events

Total

Watch event: 6
Push event: 17
Create event: 1

Last Year

Watch event: 6
Push event: 17
Create event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 13 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 13 minutes
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Laughing1999 (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

frontend/package-lock.json npm

391 dependencies

frontend/package.json npm

autoprefixer ^10.4.15 development
postcss ^8.4.28 development
tailwindcss ^3.3.3 development
@emotion/react ^11.11.1
@emotion/styled ^11.11.0
@headlessui/react ^1.7.17
eslint 8.47.0
eslint-config-next 13.4.16
next latest
react 18.2.0
react-dom 18.2.0
react-select ^5.7.5
regl-scatterplot ^1.8.3

backend/pip_requirements.txt pypi

annoy *
fastapi *
glasbey *
matplotlib *
numpy *
opentsne *
pandas *
pyyaml *
scanpy *
uvicorn *

frontend/requirements.txt pypi

annoy *
fastapi ==0.103.2
numpy ==1.24.3
pandas *
scanpy *
uvicorn ==0.20.0