trace
Interactive quality analysis for two-dimensional embeddings
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: pubmed.ncbi, ncbi.nlm.nih.gov, springer.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Keywords
Repository
Interactive quality analysis for two-dimensional embeddings
Basic Info
Statistics
- Stars: 13
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE
TRACE1 supports you in analyzing global and local quality 🕵🏽♀️ of two-dimensional embeddings, based on Regl-scatterplot2 .

Installation
OPTION 1: Using Docker 🐋
Make sure you have [Docker Compose](https://docs.docker.com/compose/install/) installed. Then build the container that includes the backend and frontend. ```bash docker-compose build docker-compose up ``` This will mount the /frontend, /backend, and /data directories into the repective containers. Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.OPTION 2: Without Docker
#### Required packages **Backend**: Install the required python packages for the backend, tested with Python 3.11 from `backend/pip_requirements.txt` or `backend/conda_requirements.yaml`. **Frontend**: Install the packages in `frontend/package.json` using e.g. `npm install`. First, start the backend within the right python evironment: ```bash conda activate backend_env/ python main.py # or python -m uvicorn main:app --reload ``` Then start the frontend development server: ```bash npm run dev ``` Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.Data Preparation
The easiest way to load your data into TRACE is using the Dataset class to add embeddings and compute quality measures. This will create the necessary Anndata structure under the hood. Examples can be found in the notebooks of each dataset folder.
python
trace_data = Dataset(
hd_data=data,
name="Gaussian Line",
verbose=True,
hd_metric="euclidean",
)
How is the the Anndata object structured?
The TRACE backend can load data structured in the [Anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) format. It includes the following fields: * `adata.X` high-dimensional data * [optional] `adata.obs`: dataframe with metadata e.g. cluster labels * `adata.obsm` low-dimensional embeddings, one entry for each embedding, e.g. `adata.obsm["t-SNE (exag. 5)"]` for a t-SNE embedding. * `adata.uns` unstructured data: * `adata.uns["methods"]`: a dictionary that structures all available embeddings into groups (exactly one level with keys and a list as values such as in the example). This defines which embeddings can be selected in the interface. For example one could group according to DR methods and and list all corresponding **two-dimensional** embedding keys in adata.obsm: ```json { "t-SNE": ["t-SNE (exag. 5)", "t-SNE (exag. 1)"], "UMAP": ["UMAP 20", "UMAP 100"] } ``` * [optional] `adata.uns["neighbors"]`: an _nxk_ array of the k-nearest high-dimensional neighbors of each point * [optional] `adata.uns["t-SNE (exag. 5)"]`: dictionary with additional data for each embedding, such as **quality** scores or **parameters** used to obtain the embedding. For example: ```json { "quality": {"qnx@50": [...], "qnx@200": [...]}, "parameters": {"perplexity": 100, "exaggeration": 5, "epochs": 750} } ``` * [optional] 🌈 You can add custom colors for metadata features by adding a list of HEX values to `trace_data.adata.uns["featureName_colors"]`. For categorical features, the number of colors should match the number of categories. The colors for continuous features will be mapped to the [min, max] range of the feature values.1. Adding 2-dimensional embeddings
After preprocessing your data and computing a range of 2-dimensional embeddings using your favorite DR method, add the data and the embeddings to the data object:
```python
Repeat for each embedding
tracedata.addembedding( name= "tSNE (perplexity 30)", embedding = tsne_emb, category="tSNE", ) ```
2. Computing High-Dimensional Neighbors and Quality Measures
To provide snappy interactions in TRACE, the HD neighbors and a range of quality measures need to be precomputed. We use ANNOY to obtain the approximate neighbors and provide implementations of the following quality measures to be visualized via point colors in TRACE:
- neighborhood preservation measures the fraction of k high-dimensional neighbors that are preserved in the low-dimensional embedding.
- landmark distance correlation: Sampling landmark points using a random or kmeanss++ (supports only Euclidean distance) from the high-dimensional data. We then compute the pairwise distances between all landmarks in high-dimensional space and each embedding and the rank correlation of their distance vectors. Points that are not landmark points are colored according to their nearest landmark point in the embedding.
- random triplet accuracy quantifies the ratio of random triplets (i,j,k), where relative order of j and k with respect to i in the high-dimensional space is preserved in the embedding.
- point stability measures how much the distances between each point and a random sample of other points vary across all embeddings. If a point has a very different global or local position in the embeddings, the stability will be low.
To compute all available quality measures:
python
trace_data.compute_quality(filename="./gauss_line.h5ad", hd_metric="euclidean")
trace_data.print_quality()
How can I chose the parameters of the quality measures?
Instead of calling the ```compute_quality``` function, you can also call each function separately. ```python trace_data.precompute_HD_neighbors(maxK=200) trace_data.compute_neighborhood_preservation( neighborhood_sizes=[200, 100, 50] ) trace_data.compute_global_distance_correlation( max_landmarks=1000, LD_landmark_neighbors=True, hd_metric="euclidean", sampling_method="random", ) trace_data.compute_random_triplet_accuracy( num_triplets=10 ) trace_data.compute_point_stability(num_samples=50) # align the embeddings such that point movement is minimized trace_data.align_embeddings(reference_embedding="PCA") trace_data.save_adata(filename="./gauss_line.h5ad") ```3. Add Dataset Configuration
To include a dataset in the dashboard you need to create a file /backend/data_configs.yaml following the examples in data_configs.yaml.template. For the Gaussian Line dataset adding the following lines would be sufficient:
json
"GaussLine": {
"filepath": "../data/gauss_line/gauss_line.h5ad",
"description": "Gaussian clusters shifted along a line from Böhm et al. (2022)",
}
Example Datasets
Gaussian Line 🟢 🟠 🟣
A small example dataset that is included in the repository.
Mammoth 🦣
This dataset from Wang et al. can be downloaded from their PaCMAP repository. It then needs to be processed using the mammoth.ipynb notebook.
Single-Cell Mouse Data 🐁
The processed dataset of gene expressions from Guilliams et al. is not available online, please reach out if you are interested. A raw version is available under GSE192742.
Citation
TRACE was presented as a demo paper at ECML-PKDD 2024. If you find the tool useful and are using it in your research, we'd appreciate if you could cite our paper:
bibtex
@inproceedings{heiter2024pattern,
title={Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE},
author={Heiter, Edith and Martens, Liesbet and Seurinck, Ruth and Guilliams, Martin and De Bie, Tijl and Saeys, Yvan and Lijffijt, Jefrey},
booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
pages={379--382},
year={2024},
organization={Springer}
}
[1] TRACE stands for Two-dimensional representation Analysis and Comparison Engine
[2] Lekschas, Fritz. "Regl-Scatterplot: A Scalable Interactive JavaScript-based Scatter Plot Library." Journal of Open Source Software (2023)
Owner
- Name: Ghent University Artificial Intelligence & Data Analytics Group
- Login: aida-ugent
- Kind: organization
- Email: tijl.debie@ugent.be
- Location: Ghent
- Website: aida.ugent.be
- Repositories: 36
- Profile: https://github.com/aida-ugent
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
Pattern or Artifact? Interactively Exploring Embedding
Quality with TRACE
message: >-
If you are using TRACE, please cite our paper which was
presented at the Demo track of ECML-PKDD 2024:
type: software
authors:
- family-names: Heiter
given-names: Edith
- family-names: Martens
given-names: Liesbet
- family-names: Seurinck
given-names: Ruth
- family-names: Guilliams
given-names: Martin
- family-names: De Bie
given-names: Tijl
- family-names: Saeys
given-names: Yvan
- family-names: Lijffijt
given-names: Jefrey
identifiers:
- type: doi
value: 10.1007/978-3-031-70371-3_24
repository-code: 'https://github.com/aida-ugent/TRACE'
license: MIT
preferred-citation:
type: conference-paper
authors:
- family-names: Heiter
given-names: Edith
- family-names: Martens
given-names: Liesbet
- family-names: Seurinck
given-names: Ruth
- family-names: Guilliams
given-names: Martin
- family-names: De Bie
given-names: Tijl
- family-names: Saeys
given-names: Yvan
- family-names: Lijffijt
given-names: Jefrey
title: Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE
year: 2024
start: 379
end: 382
GitHub Events
Total
- Watch event: 6
- Push event: 17
- Create event: 1
Last Year
- Watch event: 6
- Push event: 17
- Create event: 1
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 13 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 13 minutes
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- Laughing1999 (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 391 dependencies
- autoprefixer ^10.4.15 development
- postcss ^8.4.28 development
- tailwindcss ^3.3.3 development
- @emotion/react ^11.11.1
- @emotion/styled ^11.11.0
- @headlessui/react ^1.7.17
- eslint 8.47.0
- eslint-config-next 13.4.16
- next latest
- react 18.2.0
- react-dom 18.2.0
- react-select ^5.7.5
- regl-scatterplot ^1.8.3
- annoy *
- fastapi *
- glasbey *
- matplotlib *
- numpy *
- opentsne *
- pandas *
- pyyaml *
- scanpy *
- uvicorn *
- annoy *
- fastapi ==0.103.2
- numpy ==1.24.3
- pandas *
- scanpy *
- uvicorn ==0.20.0