bluephos

automated computational tool streamlining the development and analysis of blue phosphorescent materials

https://github.com/ssec-jhu/bluephos

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization ssec-jhu has institutional domain (ai.jhu.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

automated computational tool streamlining the development and analysis of blue phosphorescent materials

Basic Info
  • Host: GitHub
  • Owner: ssec-jhu
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.01 MB
Statistics
  • Stars: 2
  • Watchers: 5
  • Forks: 1
  • Open Issues: 4
  • Releases: 0
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Code of conduct Citation Codeowners Zenodo

README.md

SSEC-JHU bluephos

CI Documentation Status Security <!---DOI --->

SSEC-JHU Logo

BluePhos: An automated pipeline optimizing the synthesis and analysis of blue phosphorescent materials.

BluePhos Pipeline Introduction

Overview

The BluePhos pipeline is an automated computational tool streamlining the development and analysis of blue phosphorescent materials. It blends computational chemistry with machine learning to adeptly predict and hone the properties of essential compounds in light-emitting tech.

Workflow Evolution

The BluePhos pipeline functions like an automated assembly line, with a structured yet adaptable workflow that distributes tasks efficiently across computing resources. It optimizes batch processing and resource allocation, processing molecules individually for streamlined operation.

The current version of the pipeline comprises the following sequential tasks:

  • Ligand Generation Task: It commences by ingesting aromatic boronic acids and aromatic halides, generating ligand molecules via Suzuki coupling reactions.
  • SMILES to SDF Conversion Task: Molecular structures encoded in SMILES strings are converted into SDF files, facilitating in-depth chemical data manipulation.
  • Neural Network (NN) Task: This phase involves the extraction and engineering of features from each ligand. These features are processed through a trained graph neural network to predict the ligand's z-score, indicative of synthetic potential.

Planned enhancements include:

  • Optimization Geometry Task: Aiming to optimize molecular geometries, ensuring that the ligands adopt energetically favorable conformations.
  • Density Functional Theory (DFT) Calculation Task: Set to apply DFT calculations to optimized geometries for in-depth quantum mechanical insights into the ligands' electronic properties.

Setup and Installation

Step 1: Clone the GitHub Repository

  • git clone https://github.com/ssec-jhu/bluephos.git

Step 2: Set Up the Runtime Environment

Navigate to the Bluephos directory and create the blueenv environment using Conda: ``` cd bluephos conda env create -f blueenv.yml conda activate blue_env ` After cloning, navigate to the project directory: *cd bluephos``

Running the Pipeline

Usage

  • python bluephos_pipeline.py [options]

Command-Line Arguments

| Argument | Required | Type | Default | Description | |-----------------|--------------|----------|-------------|------------------------------------------------------------------------------------------------------------------------------------| | --halides | No | String | None | Path to the CSV file containing halides data. Required when no input directory or ligand SMILES CSV file is specified. | | --acids | No | String | None | Path to the CSV file containing boronic acids data. Required when no input directory or ligand SMILES CSV file is specified. | | --features | Yes | String | None | Path to the element feature file used for neural network predictions. | | --train | Yes | String | None | Path to the train stats file used to normalize input data. | | --weights | Yes | String | None | Path to the full energy model weights file for the neural network. | | --inputdir | No | String | None | Directory containing input parquet files for rerun mode. Used when mode 3 is not specified. | | --out-dir | No | String | None | Directory where the pipeline's output files will be saved. If not specified, defaults to the current directory. | | --tnn | No | Float | 1.5 | Threshold for the neural network 'z' score. Candidates with an absolute 'z' score below this threshold will be considered. | | --tste | No | Float | 1.9 | Threshold for 'ste' (Singlet-Triplet Energy gap). Candidates with an absolute 'ste' value below this threshold will be considered. | | --tdft | No | Float | 2.0 | Threshold for 'dft' (dftenergydiff). Candidates with an absolute 'dft' value below this threshold will be considered. | | --ligandsmiles | No | String | None | Path to the ligand SMILE file containing ligand SMILES data. If provided, mode 3 is used. | | --noxtb | No | Bool | True | Disable xTB optimization. Defaults (no this flag) to xTB optimization enabled; use --no_xtb to disable it. |

The BluePhos Discovery Pipeline now supports three modes of input data:

  1. Generate data from Halides and Acids CSV files: This mode is used when no input directory or ligand SMILES CSV file is specified. It generates ligand pairs from the provided halides and acids CSV files.
  2. Rerun data from parquet files: This mode is used when an input directory is specified. It reruns the pipeline using existing parquet files for ligand data.
  3. Input data from a ligand SMILES CSV file: This mode is prioritized if a ligand SMILES CSV file is provided. It directly processes ligands from the SMILES data.

  The priority order for these modes is 3 > 2 > 1, meaning:

  -If a ligand SMILES CSV file (--ligandsmiles) is provided, the pipeline operates in mode 3.
  -If an input directory (--input
dir) is specified, and no ligand SMILES CSV file is provided, the pipeline operates in mode 2.
  -If neither a ligand SMILES CSV file nor an input directory is provided, the pipeline defaults to mode 1.

Example Commands

  1. Generating Ligand Pairs and Running the Full Pipeline (Mode1)
    If you want to generate ligand pairs from halides and acids files and run the full pipeline, you must specify the paths to the halides and acids files:
  2. python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
  3. Rerunning the Pipeline with Existing Parquet Files (Mode2)
    If you have already run the pipeline for the ligands and want to rerun it for refiltering or recalculating the ligands based on previous results:
  4. python bluephos_pipeline.py --input_dir path/to/parquet_directory --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
  5. Using Ligand SMILES CSV File (Mode 3)
  6. python bluephos_pipeline.py --ligand_smiles path/to/ligand_smiles.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5
  7. Specifying Different Thresholds for NN and STE
    You can adjust the thresholds for the neural network 'z' score and the xTB standard error (STE) as needed:
  8. python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --t_nn 2.0 --t_ste 2.5
  9. Using a Different DFT Package
    By default, the pipeline uses the ORCA DFT package, but you can switch to ASE (to be implemented later) if preferred:
  10. python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --dft_package ase
  11. Disable xTB optimiazation
    By default, the geometries optimization task uses the xTB package.However you can disable it by running:
  12. python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --no_xtb

Execute the BluePhos pipeline within a tox environment for a consistent and reproducible setup:

  • tox -e run-pipeline -- --halide /path/to/aromatic_halides.csv --acid /path/to/aromatic_boronic_acids.csv --feature /path/to/element_features.csv --train /path/to/train_stats.csv --weight /path/to/model_weights.pt -o /path/to/output_dir/

Replace /path/to/... with the actual paths to your datasets and parameter files.

Example Usage with Test Data

To run the pipeline using example data provided in the repository:

  • tox -e run-pipeline -- --halide ./tests/input/aromatic_halides_with_id.csv --acid ./tests/input/aromatic_boronic_acids_with_id.csv --feature ./bluephos/parameters/element_features.csv --train ./bluephos/parameters/train_stats.csv --weight ./bluephos/parameters/full_energy_model_weights.pt -o .

This command uses test data to demonstrate the pipeline's functionality, ideal for initial testing and familiarization.

Result

Note:

  • The default output (-o or --output) dataframe is stored in Parquet format due to its efficient storage, faster data access, and enhanced support for complex data structures.
  • The pipeline's results are organized by task, with filtered-out data stored in specific subdirectories within the /output directory. For example:
    -The filtered-out data from the NN task is stored in /NNfilterout.
    -For the XTB task, the filtered-out data is saved in /XTBfilterout.
    -For the final DFT task, the results are divided into two directories: /DFTfilterin for filtered-in data and /DFTfilterout for filtered-out data.

The Parquet file can be accessed in several ways:

Using Pandas

Pandas can be used to read and analyze Parquet files. py import pandas as pd df = pd.read_parquet('08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet') print(df.describe())

Using DuckDB

DuckDB provides an efficient way to query Parquet files directly using SQL syntax. py import duckdb as ddb query_result = ddb.query('''SELECT * FROM '08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet' LIMIT 10''') print(query_result.to_df())

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to contribute to this project.

Owner

  • Name: Scientific Software Engineering Center at JHU
  • Login: ssec-jhu
  • Kind: organization
  • Email: ssec@jhu.edu
  • Location: United States of America

Accelerating Software Development for Science Research

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software in your work, please cite it using the following metadata."
authors:
- family-names: "Hunter"
  given-names: "Edward"
  orcid: "https://orcid.org/0000-0002-6876-001X"
- family-names: "Noss"
  given-names: "James"
  orcid: "https://orcid.org/0000-0002-0922-5770"
- family-names: "Kluzner"
  given-names: "Vladimir"
  orcid: "https://orcid.org/0009-0000-5844-661X"
- family-names: "Lemson"
  given-names: "Gerard"
  orcid: "https://orcid.org/0000-0001-5041-2458"
- family-names: "Mitschang"
  given-names: "Arik"
  orcid: "https://orcid.org/0000-0001-9239-012X"
- family-names: "Chen"
  given-names: "Xiang"
  orcid: "https://orcid.org/0009-0003-6402-9822"
- family-names: "Abbasinejad"
  given-names: "Fatemeh"
  orcid: "https://orcid.org/0009-0006-3239-7112"
title: "bluephos"
version: 0.0.1
doi: <insert zenodo DOI>
date-released: 2023-01-01
url: "https://github.com/ssec-jhu/bluephos"

GitHub Events

Total
  • Issues event: 9
  • Watch event: 1
  • Delete event: 13
  • Issue comment event: 24
  • Push event: 28
  • Pull request review comment event: 3
  • Pull request review event: 28
  • Pull request event: 47
  • Create event: 24
Last Year
  • Issues event: 9
  • Watch event: 1
  • Delete event: 13
  • Issue comment event: 24
  • Push event: 28
  • Pull request review comment event: 3
  • Pull request review event: 28
  • Pull request event: 47
  • Create event: 24

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • docker/build-push-action f2a1d5e99d037542a71f64918e516c093c6f3fc4 composite
  • docker/login-action 65b78e6e13532edd9afa3aa52ac7964289d1a9c1 composite
  • docker/metadata-action 9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7 composite
.github/workflows/dist.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
.github/workflows/security.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
Dockerfile docker
  • continuumio/miniconda3 latest build
pipeline-example/Dockerfile docker
  • continuumio/miniconda3 latest build
pyproject.toml pypi
  • fastapi [all]
requirements/build.txt pypi
  • bandit ==1.7.8
  • build ==1.0.3
  • pytest ==8.0.2
  • pytest-cov ==4.1.0
  • ruff ==0.1.9
requirements/dev.txt pypi
  • tox ==4.12.1 development
requirements/docs.txt pypi
  • nbsphinx ==0.9.3
  • sphinx ==7.2.6
  • sphinx-automodapi ==0.17.0
  • sphinx-issues ==4.0.0
  • sphinx_book_theme ==1.1.2
  • sphinx_rtd_theme ==2.0.0
requirements/prd.txt pypi
  • ase ==3.22.1
  • dplutils ==0.5.2
  • fastapi ==0.110.0
  • pandas <2.2
  • ray ==2.9.3
  • rdkit_pypi ==2022.9.5
  • torch ==2.2.0
  • torch_geometric ==2.5.2
  • uvicorn ==0.29.0
requirements/test.txt pypi
  • bandit ==1.7.8 test
  • pytest ==8.0.2 test
  • pytest-cov ==4.1.0 test
  • ruff ==0.3.4 test