chemcpa
Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization theislab has institutional domain (www.helmholtz-muenchen.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.7%) to scientific vocabulary
Keywords
Repository
Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
Basic Info
- Host: GitHub
- Owner: theislab
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/abs/2204.13545
- Size: 234 MB
Statistics
- Stars: 123
- Watchers: 3
- Forks: 30
- Open Issues: 7
- Releases: 0
Topics
Metadata Files
README.md
Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution
Code accompanying the NeurIPS 2022 paper (PDF).

Our talk on chemCPA at the M2D2 reading club is available here.
A previous version of this work was a spotlight paper at ICLR MLDD 2022.
Code for this previous version can be found under the v1.0 git tag.
Codebase overview
chemCPA/: contains the code for the model, the data, and the training loop.embeddings: There is one folder for each molecular embedding model we benchmarked. Each contains anenvironment.ymlwith dependencies. We generated the embeddings using the provided notebooks and saved them to disk, to load them during the main training loop.experiments: Each folder contains aREADME.mdwith the experiment description, a.yamlfile with the seml configuration, and a notebook to analyze the results.notebooks: Example analysis notebooks.preprocessing: Notebooks for processing the data. For each dataset there is one notebook that loads the raw data.tests: A few very basic tests.
All experiments where run through seml.
The entry function is ExperimentWrapper.__init__ in chemCPA/seml_sweep_icb.py.
For convenience, we provide a script to run experiments manually for debugging purposes at chemCPA/manual_seml_sweep.py.
The script expects a manual_run.yaml file containing the experiment configuration.
All notebooks also exist as Python scripts (converted through jupytext) to make them easier to review.
Getting started
Environment
The easiest way to get started is to use a docker image we provide
docker run -it -p 8888:8888 --platform=linux/amd64 registry.hf.space/b1ro-chemcpa:latest
this image contains the source code and all dependencies to run the experiments.
By default it runs a jupyter server on port 8888.
Alternatively you may clone this repository and setup your own environment by running:
python
conda env create -f environment.yml
python setup.py install -e .
Datasets
The datasets are not included in the docker image, but get automatically downloaded when you run the notebooks that require them. The datasets may alternatively be downloaded manually using the python tool in the raw_data/dataset.py folder. Usage is:
python raw_data/dataset.py --list
python raw_data/dataset.py --dataset <dataset_name>
or you may use the following links: - weight checkpoints - hyperparameter configuration - raw datasets - processed datasets - embeddings
Some of the notebooks use a drugbank_all.csv file, which can be downloaded from here (registration needed).
Data preparation
To train the models, first the raw data needs to be processed.
This can be done by running the notebooks inside the preprocessing/ folder in a sequential order.
Alternatively, you may run
python preprocessing/run_notebooks.py
A description of the preprocessing steps is given in the preprocessing/README.md file and in the headers
of individual notebooks. Section 4 of the paper is also highly relevant.
Training the models
Run
python chemCPA/train_hydra.py
Citation
You can cite our work as:
@inproceedings{hetzel2022predicting,
title={Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution},
author={Hetzel, Leon and Böhm, Simon and Kilbertus, Niki and Günnemann, Stephan and Lotfollahi, Mohammad and Theis, Fabian J},
booktitle={NeurIPS 2022},
year={2022}
}
Owner
- Name: Theis Lab
- Login: theislab
- Kind: organization
- Email: icb.office@helmholtz-muenchen.de
- Location: Munich
- Website: https://www.helmholtz-muenchen.de/icb/
- Repositories: 213
- Profile: https://github.com/theislab
Institute of Computational Biology
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Predicting Cellular Responses to Novel Drug
Perturbations at a Single-Cell Resolution
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Leon
family-names: Hetzel
- given-names: Simon
family-names: Boehm
- given-names: Niki
family-names: Kilbertus
- given-names: Stephan
family-names: Günnemann
- given-names: Mohammad
family-names: Lotfollahi
- given-names: Fabian
name-particle: J
family-names: Theis
identifiers:
- type: url
value: 'https://neurips.cc/virtual/2022/poster/53227'
repository-code: 'https://github.com/theislab/chemCPA'
abstract: >+
Single-cell transcriptomics enabled the study of
cellular heterogeneity in response to perturbations
at the resolution of individual cells. However,
scaling high-throughput screens (HTSs) to measure
cellular responses for many drugs remains a
challenge due to technical limitations and, more
importantly, the cost of such multiplexed
experiments. Thus, transferring information from
routinely performed bulk RNA HTS is required to
enrich single-cell data meaningfully.We introduce
chemCPA, a new encoder-decoder architecture to
study the perturbational effects of unseen drugs.
We combine the model with an architecture surgery
for transfer learning and demonstrate how training
on existing bulk RNA HTS datasets can improve
generalisation performance. Better generalisation
reduces the need for extensive and costly screens
at single-cell resolution. We envision that our
proposed method will facilitate more efficient
experiment designs through its ability to generate
in-silico hypotheses, ultimately accelerating drug
discovery.
keywords:
- transfer learning
- disentanglement
- perturbation
- single cell
- genomics
- Drug Discovery
- unsupervised
GitHub Events
Total
- Issues event: 18
- Watch event: 19
- Member event: 1
- Issue comment event: 9
- Push event: 8
- Pull request event: 2
- Fork event: 6
- Create event: 2
Last Year
- Issues event: 18
- Watch event: 19
- Member event: 1
- Issue comment event: 9
- Push event: 8
- Pull request event: 2
- Fork event: 6
- Create event: 2
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 14
- Total pull requests: 1
- Average time to close issues: 12 months
- Average time to close pull requests: 1 minute
- Total issue authors: 9
- Total pull request authors: 1
- Average comments per issue: 1.79
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 5
- Pull requests: 1
- Average time to close issues: about 24 hours
- Average time to close pull requests: 1 minute
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bhomass (9)
- hraeder41 (3)
- sepidism (2)
- ChangxiChi (2)
- xianglin226 (1)
- xDogBaby (1)
- ArturDev42 (1)
- wangyucheng1234 (1)
- Tigerrr07 (1)
- ceesu (1)
- rvinas (1)
Pull Request Authors
- MxMstrmn (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- boost 1.68.0
- boost-cpp 1.68.0
- descriptastorus 2.2.0
- jupyter
- numpy 1.16.4
- numpy-base 1.16.4
- pandas 0.25.0
- pyarrow
- python 3.6.8
- pytorch 1.1.0
- rdkit 2019.03.4.0
- readline 7.0
- scanpy
- scikit-learn 0.21.2
- scipy 1.3.0
- tensorboard 1.13.1
- torchvision 0.3.0
- tqdm 4.32.1
- typing 3.6.4
- cudatoolkit 10.2.*
- dgl-cuda10.2
- jupyter
- pip
- pyarrow
- python
- pytorch
- rdkit 2018.09.3.*
- seml
- tqdm
- boost =1.68.0=py36h8619c78_1001
- boost-cpp =1.68.0=h11c811c_1000
- descriptastorus =2.2.0=py_0
- numpy =1.16.4=py36h7e9f1db_0
- numpy-base =1.16.4=py36hde5b4d6_0
- pandas =0.25.0=py36hb3f55d8_0
- python =3.6.8=h0371630_0
- pytorch =1.1.0=py3.6_cuda9.0.176_cudnn7.5.1_0
- rdkit =2019.03.4.0=py36hc20afe1_1
- readline =7.0=h7b6447c_5
- scikit-learn =0.21.2=py36hcdab131_1
- scipy =1.3.0=py36h921218d_1
- tensorboard =1.13.1=py36_0
- torchvision =0.3.0=py36_cu9.0.176_1
- tqdm =4.32.1=py_0
- typing =3.6.4=py36_0
- deepchem 2.5.0
- jupyter
- pandas
- pip
- pyarrow
- rdkit
- adjusttext
- bokeh
- colorcet
- cudatoolkit 11.3.*
- datashader
- deepchem
- dgl-cuda11.3
- h5py <3.2
- holoviews
- jupyter
- jupytext
- matplotlib
- numpy
- pandas
- pip
- pre-commit
- py-spy
- pyarrow
- pytest
- python 3.7.*
- pytorch
- rdkit 2021.09.2.*
- scanpy
- scikit-image
- scikit-learn
- scipy
- seaborn
- seml
- submitit
- torchmetrics
- umap-learn