bluephos
automated computational tool streamlining the development and analysis of blue phosphorescent materials
Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization ssec-jhu has institutional domain (ai.jhu.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary
Repository
automated computational tool streamlining the development and analysis of blue phosphorescent materials
Basic Info
Statistics
- Stars: 2
- Watchers: 5
- Forks: 1
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
SSEC-JHU bluephos

BluePhos: An automated pipeline optimizing the synthesis and analysis of blue phosphorescent materials.
BluePhos Pipeline Introduction
Overview
The BluePhos pipeline is an automated computational tool streamlining the development and analysis of blue phosphorescent materials. It blends computational chemistry with machine learning to adeptly predict and hone the properties of essential compounds in light-emitting tech.
Workflow Evolution
The BluePhos pipeline functions like an automated assembly line, with a structured yet adaptable workflow that distributes tasks efficiently across computing resources. It optimizes batch processing and resource allocation, processing molecules individually for streamlined operation.
The current version of the pipeline comprises the following sequential tasks:
- Ligand Generation Task: It commences by ingesting aromatic boronic acids and aromatic halides, generating ligand molecules via Suzuki coupling reactions.
- SMILES to SDF Conversion Task: Molecular structures encoded in SMILES strings are converted into SDF files, facilitating in-depth chemical data manipulation.
- Neural Network (NN) Task: This phase involves the extraction and engineering of features from each ligand. These features are processed through a trained graph neural network to predict the ligand's z-score, indicative of synthetic potential.
Planned enhancements include:
- Optimization Geometry Task: Aiming to optimize molecular geometries, ensuring that the ligands adopt energetically favorable conformations.
- Density Functional Theory (DFT) Calculation Task: Set to apply DFT calculations to optimized geometries for in-depth quantum mechanical insights into the ligands' electronic properties.
Setup and Installation
Step 1: Clone the GitHub Repository
git clone https://github.com/ssec-jhu/bluephos.git
Step 2: Set Up the Runtime Environment
Navigate to the Bluephos directory and create the blueenv environment using Conda:
```
cd bluephos
conda env create -f blueenv.yml
conda activate blue_env
`
After cloning, navigate to the project directory:
*cd bluephos``
Running the Pipeline
Usage
-
python bluephos_pipeline.py [options]
Command-Line Arguments
| Argument | Required | Type | Default | Description | |-----------------|--------------|----------|-------------|------------------------------------------------------------------------------------------------------------------------------------| | --halides | No | String | None | Path to the CSV file containing halides data. Required when no input directory or ligand SMILES CSV file is specified. | | --acids | No | String | None | Path to the CSV file containing boronic acids data. Required when no input directory or ligand SMILES CSV file is specified. | | --features | Yes | String | None | Path to the element feature file used for neural network predictions. | | --train | Yes | String | None | Path to the train stats file used to normalize input data. | | --weights | Yes | String | None | Path to the full energy model weights file for the neural network. | | --inputdir | No | String | None | Directory containing input parquet files for rerun mode. Used when mode 3 is not specified. | | --out-dir | No | String | None | Directory where the pipeline's output files will be saved. If not specified, defaults to the current directory. | | --tnn | No | Float | 1.5 | Threshold for the neural network 'z' score. Candidates with an absolute 'z' score below this threshold will be considered. | | --tste | No | Float | 1.9 | Threshold for 'ste' (Singlet-Triplet Energy gap). Candidates with an absolute 'ste' value below this threshold will be considered. | | --tdft | No | Float | 2.0 | Threshold for 'dft' (dftenergydiff). Candidates with an absolute 'dft' value below this threshold will be considered. | | --ligandsmiles | No | String | None | Path to the ligand SMILE file containing ligand SMILES data. If provided, mode 3 is used. | | --noxtb | No | Bool | True | Disable xTB optimization. Defaults (no this flag) to xTB optimization enabled; use --no_xtb to disable it. |
The BluePhos Discovery Pipeline now supports three modes of input data:
- Generate data from Halides and Acids CSV files: This mode is used when no input directory or ligand SMILES CSV file is specified. It generates ligand pairs from the provided halides and acids CSV files.
- Rerun data from parquet files: This mode is used when an input directory is specified. It reruns the pipeline using existing parquet files for ligand data.
- Input data from a ligand SMILES CSV file: This mode is prioritized if a ligand SMILES CSV file is provided. It directly processes ligands from the SMILES data.
The priority order for these modes is 3 > 2 > 1, meaning:
-If a ligand SMILES CSV file (--ligandsmiles) is provided, the pipeline operates in mode 3.
-If an input directory (--inputdir) is specified, and no ligand SMILES CSV file is provided, the pipeline operates in mode 2.
-If neither a ligand SMILES CSV file nor an input directory is provided, the pipeline defaults to mode 1.
Example Commands
- Generating Ligand Pairs and Running the Full Pipeline (Mode1)
If you want to generate ligand pairs from halides and acids files and run the full pipeline, you must specify the paths to the halides and acids files: python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5- Rerunning the Pipeline with Existing Parquet Files (Mode2)
If you have already run the pipeline for the ligands and want to rerun it for refiltering or recalculating the ligands based on previous results: python bluephos_pipeline.py --input_dir path/to/parquet_directory --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5- Using Ligand SMILES CSV File (Mode 3)
python bluephos_pipeline.py --ligand_smiles path/to/ligand_smiles.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5- Specifying Different Thresholds for NN and STE
You can adjust the thresholds for the neural network 'z' score and the xTB standard error (STE) as needed: python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --t_nn 2.0 --t_ste 2.5- Using a Different DFT Package
By default, the pipeline uses the ORCA DFT package, but you can switch to ASE (to be implemented later) if preferred: python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --dft_package ase- Disable xTB optimiazation
By default, the geometries optimization task uses the xTB package.However you can disable it by running: python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --no_xtb
Execute the BluePhos pipeline within a tox environment for a consistent and reproducible setup:
tox -e run-pipeline -- --halide /path/to/aromatic_halides.csv --acid /path/to/aromatic_boronic_acids.csv --feature /path/to/element_features.csv --train /path/to/train_stats.csv --weight /path/to/model_weights.pt -o /path/to/output_dir/
Replace /path/to/... with the actual paths to your datasets and parameter files.
Example Usage with Test Data
To run the pipeline using example data provided in the repository:
tox -e run-pipeline -- --halide ./tests/input/aromatic_halides_with_id.csv --acid ./tests/input/aromatic_boronic_acids_with_id.csv --feature ./bluephos/parameters/element_features.csv --train ./bluephos/parameters/train_stats.csv --weight ./bluephos/parameters/full_energy_model_weights.pt -o .
This command uses test data to demonstrate the pipeline's functionality, ideal for initial testing and familiarization.
Result
Note:
- The default output (-o or --output) dataframe is stored in Parquet format due to its efficient storage, faster data access, and enhanced support for complex data structures.
- The pipeline's results are organized by task, with filtered-out data stored in specific subdirectories within the /output directory. For example:
-The filtered-out data from the NN task is stored in /NNfilterout.
-For the XTB task, the filtered-out data is saved in /XTBfilterout.
-For the final DFT task, the results are divided into two directories: /DFTfilterin for filtered-in data and /DFTfilterout for filtered-out data.
The Parquet file can be accessed in several ways:
Using Pandas
Pandas can be used to read and analyze Parquet files.
py
import pandas as pd
df = pd.read_parquet('08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet')
print(df.describe())
Using DuckDB
DuckDB provides an efficient way to query Parquet files directly using SQL syntax.
py
import duckdb as ddb
query_result = ddb.query('''SELECT * FROM '08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet' LIMIT 10''')
print(query_result.to_df())
Contributing
We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to contribute to this project.
Owner
- Name: Scientific Software Engineering Center at JHU
- Login: ssec-jhu
- Kind: organization
- Email: ssec@jhu.edu
- Location: United States of America
- Website: https://ai.jhu.edu/ssec/
- Repositories: 1
- Profile: https://github.com/ssec-jhu
Accelerating Software Development for Science Research
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software in your work, please cite it using the following metadata." authors: - family-names: "Hunter" given-names: "Edward" orcid: "https://orcid.org/0000-0002-6876-001X" - family-names: "Noss" given-names: "James" orcid: "https://orcid.org/0000-0002-0922-5770" - family-names: "Kluzner" given-names: "Vladimir" orcid: "https://orcid.org/0009-0000-5844-661X" - family-names: "Lemson" given-names: "Gerard" orcid: "https://orcid.org/0000-0001-5041-2458" - family-names: "Mitschang" given-names: "Arik" orcid: "https://orcid.org/0000-0001-9239-012X" - family-names: "Chen" given-names: "Xiang" orcid: "https://orcid.org/0009-0003-6402-9822" - family-names: "Abbasinejad" given-names: "Fatemeh" orcid: "https://orcid.org/0009-0006-3239-7112" title: "bluephos" version: 0.0.1 doi: <insert zenodo DOI> date-released: 2023-01-01 url: "https://github.com/ssec-jhu/bluephos"
GitHub Events
Total
- Issues event: 9
- Watch event: 1
- Delete event: 13
- Issue comment event: 24
- Push event: 28
- Pull request review comment event: 3
- Pull request review event: 28
- Pull request event: 47
- Create event: 24
Last Year
- Issues event: 9
- Watch event: 1
- Delete event: 13
- Issue comment event: 24
- Push event: 28
- Pull request review comment event: 3
- Pull request review event: 28
- Pull request event: 47
- Create event: 24
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- docker/build-push-action f2a1d5e99d037542a71f64918e516c093c6f3fc4 composite
- docker/login-action 65b78e6e13532edd9afa3aa52ac7964289d1a9c1 composite
- docker/metadata-action 9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7 composite
- actions/checkout v4 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- continuumio/miniconda3 latest build
- continuumio/miniconda3 latest build
- fastapi [all]
- bandit ==1.7.8
- build ==1.0.3
- pytest ==8.0.2
- pytest-cov ==4.1.0
- ruff ==0.1.9
- tox ==4.12.1 development
- nbsphinx ==0.9.3
- sphinx ==7.2.6
- sphinx-automodapi ==0.17.0
- sphinx-issues ==4.0.0
- sphinx_book_theme ==1.1.2
- sphinx_rtd_theme ==2.0.0
- ase ==3.22.1
- dplutils ==0.5.2
- fastapi ==0.110.0
- pandas <2.2
- ray ==2.9.3
- rdkit_pypi ==2022.9.5
- torch ==2.2.0
- torch_geometric ==2.5.2
- uvicorn ==0.29.0
- bandit ==1.7.8 test
- pytest ==8.0.2 test
- pytest-cov ==4.1.0 test
- ruff ==0.3.4 test