https://github.com/batmen-lab/sonata
SONATA: Disambiguated manifold alignment of single-cell data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Repository
SONATA: Disambiguated manifold alignment of single-cell data
Basic Info
- Host: GitHub
- Owner: batmen-lab
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 14.9 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SONATA
Source code for Securing diagonal integration of multimodal single-cell data against ambiguous mapping

Requirements
Dependencies for SONATA are recorded in requirements.txt.
Data
The datasets used in this project are available for download at the following link: data.
Then organize the project as follows:
project_root/
├── src/
│ ├── examples/
│ │ ├── baselines/
│ │ ├── cfgs/
│ │ ├── noise_scale.ipynb
│ │ ├── simulation_t_branch.ipynb
│ │ └── ...
│ ├── run_baselines/
│ ├── utils/
│ └── sonata.py
├── examples/
│ ├── cfgs/
│ ├── simulation_t_branch.ipynb
│ └── ...
├── data/
│ ├── t_branch/
│ └── ...
├── results/
│ ├── sonata_pipeline
│ ├── ├── t_branch/
│ └── └── ...
├── README.md
└── requirements.txt
Baseline Performance
We demonstrate that artificial integrations resulting from ambiguous mapping in diagonal data integration are widespread yet surprisingly overlooked, occurring across all mainstream diagonal integration methods. The following notebooks show the performance cases of baseline methods on various ambiguous datasets: - t_branch: t_branch.ipynb - scGEM: scGEM.ipynb - SNARE: SNARE.ipynb - scNMT: scNMT.ipynb
To quantify the ambiguity in these cases, we report label transfer accuracy and average FOSCTTM metrics in our manuscript. All baseline method tests are implemented in the folder src/run_baselines. To run a test, use the following commands:
python
cd src
python run_baselines/run_unioncom.py --dataset t_branch
We argue that artificial integrations are more harmful than failed integrations because, while failed integrations can be qualitatively recognized, artificial integrations are difficult to detect and can mislead users into pursuing hypotheses based on erroneous results.
SONATA Examples
Jupyter notebooks to replicate the SONATA results from the manuscript are available under folder examples:
- Simulation datasets
- partial ambiguous: simulationtbranch.ipynb, simulationybranch.ipynb, simulationxbranch.ipynb
- no ambiguous: simulationdecaypath.ipynb
- Real biology datasets
- scGEM: scGEM.ipynb
- SNARE: SNARE.ipynb
- scNMT: scNMT.ipynb
Basic Use
```python import sonata sn = sonata.sonata(noise_scale=0.2) DiagnoseResult = sn.diagnose(data)
Get the indices of cells identified as ambiguous
ambiguousidx = DiagnoseResult.ambiguousidx
Get the corresponding ambiguous group labels for those cells
ambiguouslabels = DiagnoseResult.ambiguouslabels ```
Input for SONATA:
- parameters:
- noise_scale:
The scale of gaussian noise added to generate variational versions of the manifold. Default: 0.2.
- n_neighbor:
Number of neighbors when constructing noise manifold. Default: 10.
- mode:
Mode for constructing the graph. Options: "connectivity" or "distance". Default: "connectivity".
- metric:
Metric to use for distance computation. Default: "correlation".
- e:
Coefficient of the entropic regularization term in the objective function of OT formulation. Default: 1e-3.
- repeat:
Number of iterations for alignment. Default: 10.
- n_cluster:
Number of cell groups used in hierarchical clustering to achieve a smooth and efficient spline fit. Recommended: ncluster <= $\sqrt{n_samples}$. Default: 20.
- **pvalthres:
P-value threshold for ambiguous group pair detection. Default: 1e-2.
- **scalableOT:
If True, uses the scalable version of OT. Default: False.
- scalesamplerate:
The sample rate for the scalable version of OT. Default: 0.1.
- verbose:
If True, prints the progress of the algorithm. Default: True.
- data: A NumPy array or matrix where rows correspond to samples and columns correspond to features.
Output for SONATA: - An object of SimpleNamespace containing the following attributes: - ambiguouslabels: A numpy array of ambiguous group labels for ambiguous samples. - ambiguousidx: A numpy array of indices of ambiguous samples. - cannot_links: A list of ambiguous sample pairs.
Guidence on how to decide parameter "noise_scale"
Please refer to notebook: noise_scale.ipynb.
Scalable SONATA
To support large-scale datasets, we offer a more efficient yet equally effective optimal transport algorithm that significantly improves the scalability of SONATA. You can enable this scalable mode by simply setting scalableOT=True:
python
import sonata
sn = sonata.sonata(noise_scale=0.2, scalableOT=True)
DiagnoseResult = sn.diagnose(data)
Major Updates
- Jun. 11, 2025: Added Quantized Gromov–Wasserstein to enhance the scalability of SONATA for large datasets.
- Nov. 2, 2024: We have released the source code for new version of SONATA.
- Nov. 1, 2024: We have added more comprehensive tests for 5 baseline methods, which can be found in the src/run_baselines folder. We're also working on the new version of SONATA—coming soon!
Owner
- Name: BATMEN Lab @ UWaterloo
- Login: batmen-lab
- Kind: user
- Company: UWaterloo CS
- Repositories: 7
- Profile: https://github.com/batmen-lab
GitHub Events
Total
- Push event: 11
Last Year
- Push event: 11