https://github.com/google-deepmind/nuclease_design

ML-guided enzyme engineering

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Keywords

engineering learning machine protein

Keywords from Contributors

archival projection profiles interactive sequences generic autograding hacking shellcodes modular

Last synced: 9 months ago · JSON representation

Repository

ML-guided enzyme engineering

Basic Info

Host: GitHub
Owner: google-deepmind
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 38.1 MB

Statistics

Stars: 64
Watchers: 10
Forks: 18
Open Issues: 3
Releases: 1

Topics

engineering learning machine protein

Created over 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License

ML-Guided Directed Evolution for Engineering a Better Nuclease Enzyme

This repository accompanies the paper: Engineering highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening

Analyzing our enzyme activity dataset

You can use our dataset of estimated enzyme activity for 55,760 NucB variants to develop new machine learning models or to generate new insights about NucB.

A simple notebook has been provided to load and analyze the data:

Reproducing the paper's analysis

All figures and tables in the paper can be reproduced by notebooks in notebooks/.

Each notebooks can be run as-is, since it loads pre-computed enrichment factor data from GCS (see below). To regenerate the analysis from the raw NGS count data, run getenrichmentfactor_data.ipynb with a local value of LOCAL_OUTPUT_DATA_DIR.

These notebooks, and the library code they call, can be used to dig deeper into our results or to provide a jumping-off point for creating your own genotype-phenotype dataset based on count data from high-throughput sorting.

Analyzing our libraries and models

Some useful starting points:

Analyze the hit rates of various library design methods
Analyze the diversity of hits from these libraries
Play with the CNN model used for the final round of sequence design

Data

All data is available in a Google cloud storage (GCS) bucket. We don't recommend directly downloading it; the above scripts use helper functions for loading from the bucket.

The bucket contains the following sub-directories:

raw_count_data: raw NGS count data for pre-sort and post-sort populations.
processed_fiducial_data: enrichment factors for synonyms of various 'fiducial' sequences. Each row represents a distinct DNA sequence that translates to the same amino acid sequence.
processed_data: enrichment factors computed from the raw count data and the processed fiducial data. Each row represents a unique amino acid sequence. For each row and each fiducial, the row is assigned a p-value for observing its enrichment factor under the null distribution of enrichment factors from the fiducial.
processed_data/landscape.csv: A single file that merges data from all 4 rounds of experiments and provides a multi-class catalytic activity labels for 56K distinct amino acid sequences.
plate_data: Data from the low-throughput purified protein experiments used to confirm hits.
library_designs: A mapping from amino acid sequences to the list of the names of the sub-libraries (corresponding to different sequence design methods) that proposed it. Note that some sequences were proposed by multiple methods.
analysis: Data used for creating certain tables and results in the paper that require expensive computations, such as clustering hits in order to quantify diversity.
alignments: A multiple sequence alignment used to fit our VAE model.

Running unit tests

The notebooks directly install this package from GitHub, so no installation is necessary. However, you can locally install this package in order to run tests using the following commands:

Note that our package requires python >= 3.10.

venv=/tmp/nuclease_design_venv python3 -m venv $venv source $venv/bin/activate pip install -e . python -m pytest nuclease_design/*test.py

Citing this work

Please cite the accompanying paper: @article {thomasbelanger2024, author = {Neil Thomas and David Belanger and Chenling Xu and Hanson Lee and Kat Hirano and Kosuke Iwai and Vanja Polic and Kendra D Nyberg and Kevin Hoff and Lucas Frenz and Charlie A Emrich and Jun W Kim and Mariya Chavarha and Abi Ramanan and Jeremy J Agresti and Lucy J Colwell}, title = {Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening},, year = {2024}, doi = {10.1101/2024.03.21.585615}, journal = {bioRxiv} }

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Owner

Name: Google DeepMind
Login: google-deepmind
Kind: organization

Website: https://www.deepmind.com/
Repositories: 245
Profile: https://github.com/google-deepmind

GitHub Events

Total

Release event: 1
Watch event: 31
Issue comment event: 2
Push event: 5
Pull request review event: 2
Pull request review comment event: 3
Pull request event: 5
Fork event: 7
Create event: 5

Last Year

Release event: 1
Watch event: 31
Issue comment event: 2
Push event: 5
Pull request review event: 2
Pull request review comment event: 3
Pull request event: 5
Fork event: 7
Create event: 5

Committers

Last synced: about 1 year ago

All Time

Total Commits: 65
Total Committers: 6
Avg Commits per committer: 10.833
Development Distribution Score (DDS): 0.446

Past Year

Commits: 6
Committers: 2
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.167

Top Committers

Name	Email	Commits
Neil Thomas	t**l@g**m	36
David Belanger	d**r@g**m	21
Neil Thomas	n**s@g**m	5
dependabot[bot]	4****]	1
DeepMind	n**y@g**m	1
DeepMind	n**y@d**m	1

Committer Domains (Top 20 + Academic)

google.com: 3 deepmind.com: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 2
Total pull requests: 32
Average time to close issues: 6 days
Average time to close pull requests: 1 day
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.5
Average comments per pull request: 0.34
Merged pull requests: 29
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 1
Pull requests: 7
Average time to close issues: N/A
Average time to close pull requests: 4 days
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

thomas-a-neil (1)
chaofan520 (1)

Pull Request Authors

thomas-a-neil (25)
davidBelanger (14)
dependabot[bot] (3)

Top Labels

Issue Labels

Pull Request Labels

dependencies (3) python (1)

Dependencies

requirements.txt pypi

absl-py ==2.1.0
biopython ==1.83
ipykernel ==6.29.3
matplotlib ==3.8.3
numpy ==1.26.4
pandas ==2.2.1
pytest ==8.1.1
requests ==2.31.0
scipy ==1.12.0
seaborn ==0.13.2
statsmodels ==0.14.1
tensorflow ==2.15.0

setup.py pypi

https://github.com/google-deepmind/nuclease_design

Science Score: 39.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ML-Guided Directed Evolution for Engineering a Better Nuclease Enzyme

Analyzing our enzyme activity dataset

Reproducing the paper's analysis

Analyzing our libraries and models

Data

Running unit tests

Citing this work

License and disclaimer

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies