choriso

https://github.com/schwallergroup/choriso

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, ieee.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: schwallergroup
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 12.2 MB

Statistics

Stars: 4
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Contributing License Code of conduct Citation

README.md

Choriso logo

CHORISO (CHemical Organic ReactIon Smiles Omnibus) is a benchmarking suite for reaction prediction machine learning models.

We release:

A highly curated dataset of academic chemical reactions (download ChORISO and splits)
A suite of standardized evaluation metrics
A compilation of models for reaction prediction (choriso-models)

It is derived from the CJHIF dataset. This repo provides all the code used for dataset curation, splitting and analysis reported in the paper, as well as the metrics for evaluation of models.

🚀 Installation

First clone this repo:

bash git clone https://github.com/schwallergroup/choriso.git cd choriso

Set up and activate the environment:

bash conda env create -f environment.yml conda activate choriso pip install rxnmapper --no-deps

🔥 Quick start

To download the preprocessed dataset and split it to obtain the corresponding train, validation and test sets, run the following command: bash choriso --download_processed \ --run split

After executing some command from choriso-models, run the analysis of your model's results using:

bash analyse --results_folders='path/to/results/folder'

Results will be stored in the same directory as benchmarking-results.

Advanced usage

## 🧠 Advanced usage Using this repo lets you reproduce the results in the paper using different flags and modes. ### 📥 Download preprocessed dataset: ```bash choriso --download_processed \ --out-dir data/processed/ ``` ### :gear: Preprocessing Get the raw datasets (CJHIF, USPTO) and preprocess. The `--upsto` command runs the same processing pipeline for the raw USPTO data: **NOTE: To run the `clean` step you need to have NameRXN (v3.4.0) installed.** ```bash choriso --download_raw \ --uspto \ --data-dir=data/raw/ \ --out-dir data/processed/ \ --run clean \ --run atom_map ``` ### :mag: Stereo check For this step you need to have either downloaded the preprocessed dataset, or running the preprocessing pipeline. The step checks reactions where there are stereochemistry issues and corrects the dataset. ``` choriso --run analysis ``` ### :heavy_division_sign: Splitting In the paper, we describe a splitting scheme to obtain test splits by product, product molecular weight and random. When doing the splitting, all the testing reactions go to a single test set file, with the `split` column indicating to which split they belong. To run the splitting: ```bash choriso --run split ``` By default, reactions with products below 150 a.m.u go to the low MW set and reactions with products above 700 a.m.u go to the high MW set. These values can be modified and adapted to your preferences. For example, to create a split to test on low MW with a threshold of 100 a.m.u., and another split on high MW with threshold of 750 a.m.u. run: ```bash choriso --run split \ --low_mw=150 --high_mw=700 ``` You can optionally augment the SMILES to double the size of the training set: ```bash choriso --run split \ --augment ``` By default, the splitting will be done on the choriso dataset, which is called `choriso.tsv`. If you want to split a different dataset, you can specify the path to the dataset using the `--split_file_name` option. For example, to split the USPTO dataset, run: ```bash choriso --run split \ --split_file_name=uspto.tsv ``` --- ## 📊 Logging By default the execution of any step will store all results locally. Optionally, you can log all results from the preprocessing to W&B using the `wandb_log` flag at any step. As an example ```bash choriso --run clean \ --wandb_log ``` will execute the analysis step and upload all results (plots, metrics) to W&B. ## 📈 Metrics You can also use the implemented metrics from the paper to evaluate your own results. We have adapted the evaluation pipeline to the files from the [benchmarking repo](https://github.com/schwallergroup/choriso-models). As an example: ``` analyse --results_folders='OpenNMT_Transformer' ``` This will launch the analysis on all the files of the `OpenNMT_Transformer` folder. The output files should have the same structure as the one included on the benchmarking repo as an example. The program computes the chemistry metrics by default, which require the presence of a template with radius=0 and a template with radius=1 (these columns should be present on the test set file). ### Flagging individual reactions You can use the metrics functions to check if a specific reaction is regio or stereoselective. As an example: ```python from choriso.metrics.selectivity import flag_regio_problem, flag_stereo_problem regio_rxn = 'BrCc1ccccc1.C1CCOC1.C=CC(O)CO.[H-].[Na+]>>C=CC(O)COCc1ccccc1' stereo_rxn = 'C=C(NC(C)=O)c1ccc(OC)cc1.ClCCl.[H][H].[Rh+]>>COc1ccc([C@@H](C)NC(C)=O)cc1' print(flag_regio_problem(regio_rxn)) print(flag_stereo_problem(stereo_rxn)) ``` The output will display the flagging labels ```python True True ```

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License.

🍪 Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution. ### Development Installation To install in development mode, use the following: ```bash $ git clone git+https://github.com/schwallergroup/choriso.git $ cd choriso $ pip install -e . ``` ### 🥼 Testing After cloning the repository and installing `tox` with `pip install tox`, the unit tests in the `tests/` folder can be run reproducibly with: ```shell $ tox ``` Additionally, these tests are automatically re-run with each commit in a [GitHub Action](https://github.com/schwallergroup/choriso/actions?query=workflow%3ATests). ### 📖 Building the Documentation The documentation can be built locally using the following: ```shell $ git clone git+https://github.com/schwallergroup/choriso.git $ cd choriso $ tox -e docs $ open docs/build/html/index.html ``` The documentation automatically installs the package as well as the `docs` extra specified in the [`setup.cfg`](setup.cfg). `sphinx` plugins like `texext` can be added there. Additionally, they need to be added to the `extensions` list in [`docs/source/conf.py`](docs/source/conf.py). ### 📦 Making a Release After installing the package in development mode and installing `tox` with `pip install tox`, the commands for making a new release are contained within the `finish` environment in `tox.ini`. Run the following from the shell: ```shell $ tox -e finish ``` This script does the following: 1. Uses [Bump2Version](https://github.com/c4urself/bump2version) to switch the version number in the `setup.cfg`, `src/choriso/version.py`, and [`docs/source/conf.py`](docs/source/conf.py) to not have the `-dev` suffix 2. Packages the code in both a tar archive and a wheel using [`build`](https://github.com/pypa/build) 3. Uploads to PyPI using [`twine`](https://github.com/pypa/twine). Be sure to have a `.pypirc` file configured to avoid the need for manual input at this step 4. Push to GitHub. You'll need to make a release going with the commit where the version was bumped. 5. Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use `tox -e bumpversion -- minor` after.

Citation (CITATION.cff)

cff-version: 1.0.2
message: "If you use this software, please cite it as below."
title: "ChORISO"
authors:
  - name: "Andres M Bran"
version: 0.0.1-dev
doi:
url: "https://github.com/schwallergroup/choriso"

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Dependencies

.github/workflows/tests.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v1 composite

environment.yml pypi

aiohttp ==3.8.6
aiosignal ==1.3.1
alabaster ==0.7.13
annotated-types ==0.5.0
anyio ==3.7.1
argon2-cffi ==23.1.0
argon2-cffi-bindings ==21.2.0
arrow ==1.3.0
asttokens ==2.2.1
async-lru ==2.0.4
async-timeout ==4.0.3
attrs ==23.1.0
babel ==2.12.1
backcall ==0.2.0
beautifulsoup4 ==4.12.2
bleach ==6.0.0
boto3 ==1.28.14
botocore ==1.31.14
brotli ==1.0.9
chempy ==0.8.3
cli-exit-tools ==1.2.6
cloudpickle ==2.2.1
cmake ==3.27.0
comm ==0.1.3
dask ==2023.5.0
dataclasses-json ==0.6.1
debugpy ==1.6.7
decorator ==5.1.1
defusedxml ==0.7.1
deprecated ==1.2.14
dill ==0.3.7
diskcache ==5.6.1
dnspython ==2.4.1
docker-pycreds ==0.4.0
docutils ==0.19
dot2tex ==2.11.3
einops ==0.6.1
executing ==1.2.0
fastjsonschema ==2.18.1
fqdn ==1.5.1
frozenlist ==1.4.0
fsspec ==2023.6.0
fuzzywuzzy ==0.18.0
gitdb ==4.0.10
gitpython ==3.1.32
greenlet ==3.0.0
huggingface-hub ==0.16.4
imagesize ==1.4.1
ipykernel ==6.25.0
ipython ==8.12.2
ipywidgets ==8.0.7
isoduration ==20.11.0
jedi ==0.18.2
jinja2 ==3.1.2
jmespath ==1.0.1
jpype1 ==1.4.1
json5 ==0.9.14
jsonpatch ==1.33
jsonpointer ==2.4
jsonschema ==4.19.2
jsonschema-specifications ==2023.7.1
jupyter-client ==8.3.0
jupyter-core ==5.3.1
jupyter-events ==0.8.0
jupyter-lsp ==2.2.0
jupyter-server ==2.9.1
jupyter-server-terminals ==0.4.4
jupyterlab ==4.0.8
jupyterlab-pygments ==0.2.2
jupyterlab-server ==2.25.0
jupyterlab-widgets ==3.0.8
langchain ==0.0.319
langsmith ==0.0.47
lib-detect-testenv ==2.0.8
lit ==16.0.6
locket ==1.0.0
markupsafe ==2.1.3
marshmallow ==3.20.1
matplotlib-inline ==0.1.6
metaflow ==2.9.11
mistune ==3.0.2
more-click ==0.1.2
more-itertools ==10.0.0
mpmath ==1.3.0
multidict ==6.0.4
multiprocess ==0.70.15
multivolumefile ==0.2.3
mypy-extensions ==1.0.0
nbclient ==0.8.0
nbconvert ==7.10.0
nbformat ==5.9.2
nest-asyncio ==1.5.7
networkx ==3.1
notebook ==7.0.6
notebook-shim ==0.2.3
nvidia-cublas-cu11 ==11.10.3.66
nvidia-cuda-cupti-cu11 ==11.7.101
nvidia-cuda-nvrtc-cu11 ==11.7.99
nvidia-cuda-runtime-cu11 ==11.7.99
nvidia-cudnn-cu11 ==8.5.0.96
nvidia-cufft-cu11 ==10.9.0.58
nvidia-curand-cu11 ==10.2.10.91
nvidia-cusolver-cu11 ==11.4.0.1
nvidia-cusparse-cu11 ==11.7.4.91
nvidia-nccl-cu11 ==2.14.3
nvidia-nvtx-cu11 ==11.7.91
overrides ==7.4.0
pandarallel ==1.6.5
pandocfilters ==1.5.0
parso ==0.8.3
partd ==1.4.0
pathtools ==0.1.2
pexpect ==4.8.0
pickleshare ==0.7.5
pkgutil-resolve-name ==1.3.10
prometheus-client ==0.18.0
prompt-toolkit ==3.0.39
protobuf ==4.23.4
psutil ==5.9.5
ptyprocess ==0.7.0
pubchempy ==1.0.4
pulp ==2.7.0
pure-eval ==0.2.2
py2opsin ==1.0.5
py7zr ==0.18.12
pybcj ==1.0.1
pycryptodomex ==3.18.0
pydantic ==2.1.1
pydantic-core ==2.4.0
pygments ==2.15.1
pymongo ==4.4.1
pyneqsys ==0.5.7
pyodesys ==0.14.2
pyppmd ==0.18.3
python-json-logger ==2.0.7
pyyaml ==5.4.1
pyzmq ==25.1.0
pyzstd ==0.15.9
quantities ==0.14.1
rdchiral ==1.1.0
rdkit ==2022.9.5
rdkit-pypi ==2022.9.1
reaction-utils ==1.2.0
referencing ==0.30.2
regex ==2023.6.3
rfc3339-validator ==0.1.4
rfc3986-validator ==0.1.1
rpds-py ==0.10.6
rxn-chem-utils ==1.1.5
rxn-utils ==1.1.11
rxnmapper ==0.3.0
s3transfer ==0.6.1
send2trash ==1.8.2
sentry-sdk ==1.28.1
setproctitle ==1.3.2
smmap ==5.0.0
sniffio ==1.3.0
snowballstemmer ==2.2.0
soupsieve ==2.5
sphinx ==5.3.0
sphinxcontrib-applehelp ==1.0.4
sphinxcontrib-devhelp ==1.0.2
sphinxcontrib-htmlhelp ==2.0.1
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.3
sphinxcontrib-serializinghtml ==1.1.5
sqlalchemy ==2.0.22
stack-data ==0.6.2
swifter ==1.3.5
sym ==0.3.5
sympy ==1.12
tenacity ==8.2.3
terminado ==0.17.1
texttable ==1.6.7
tinycss2 ==1.2.1
tokenizers ==0.12.1
toolz ==0.12.0
torch ==2.0.1
torchaudio ==2.0.2
torchvision ==0.15.2
traitlets ==5.9.0
transformers ==4.21.0
triton ==2.0.0
types-python-dateutil ==2.8.19.14
typing-inspect ==0.9.0
uri-template ==1.3.0
wandb ==0.15.7
wcwidth ==0.2.6
webcolors ==1.13
webencodings ==0.5.1
websocket-client ==1.6.4
widgetsnbextension ==4.0.8
wrapt ==1.15.0
wrapt-timeout-decorator ==1.4.0
xxhash ==2.0.2
yarl ==1.9.2
zipfile-deflate64 ==0.2.0

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science