qm7_database_process

QM7 Dataset Processing and Curation

https://github.com/shahram-boshra/qm7_database_process

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

chemical-informatics gnn graph-neural-networks machine-deep-learning python pytorch pytorch-geometric qm7-database rdkit-chem
Last synced: 6 months ago · JSON representation ·

Repository

QM7 Dataset Processing and Curation

Basic Info
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
chemical-informatics gnn graph-neural-networks machine-deep-learning python pytorch pytorch-geometric qm7-database rdkit-chem
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme Citation

README.md

QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs

Python Version License Code Style Imports: isort Linter: pylint Formatter: black

A robust Python pipeline for processing the QM7 quantum chemistry dataset into graph-based format optimized for PyTorch Geometric.

🧪 Overview

This repository provides a comprehensive solution for transforming the QM7 quantum chemistry dataset into PyTorch Geometric (PyG) graph format. The QM7 dataset contains 7,165 molecules with up to 23 atoms (C, O, N, S, H) and their corresponding atomization energies, making it an ideal benchmark for Graph Neural Networks (GNNs) in molecular property prediction tasks.

Key Capabilities

  • Multi-format Data Loading: Seamlessly processes SDF, CSV, and MAT files
  • Rich Graph Construction: Creates detailed molecular graphs with comprehensive node and edge features
  • Data Quality Assurance: Implements thorough consistency checks and alignment validation
  • Memory-Efficient Processing: Handles large datasets through intelligent chunking
  • Feature Normalization: Applies global standardization using scikit-learn's StandardScaler
  • Flexible Filtering: Supports custom pre-filtering based on molecular properties
  • Extensible Transforms: Integrates with PyTorch Geometric's transform ecosystem

🚀 Quick Start

Prerequisites

  • Python ≥ 3.8
  • CUDA-capable GPU (recommended for large-scale processing)

Installation

  1. Clone the repository bash git clone https://github.com/shahram-boshra/qm7_database_process.git cd qm7_database_process

  2. Create and activate virtual environment bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

  3. Install dependencies bash pip install -r requirements.txt

Dataset Setup

Download the QM7 dataset files and organize them as follows:

data/ └── qm7/ └── raw_data/ ├── gdb7.sdf # Molecular structures ├── atomization_energies.csv # Target energies └── qm7.mat # Coulomb matrices and charges

Dataset Sources: - QM7 on Figshare - MoleculeNet QM7

Basic Usage

```python from pathlib import Path from pygqm7processing import processqm7data from qm7curation import curateqm7_data

Define paths

basedir = Path("data/qm7") rawdir = basedir / "rawdata" processeddir = basedir / "processed"

Step 1: Process raw data into chunks

processqm7data( sdffile=rawdir / "gdb7.sdf", energiesfile=rawdir / "atomizationenergies.csv", matfile=rawdir / "qm7.mat", intermediatechunkoutputdir=basedir / "chunks", chunksize=1000 )

Step 2: Curate and normalize features

curateqm7data( chunkdir=basedir / "chunks", outputpath=processeddir / "qm7processed.pt", featurekeysfornorm=['x', 'edge_attr'] ) ```

📊 Dataset Features

Node Features (per atom)

  • Atom Type: One-hot encoded atomic species
  • Atomic Number: Raw atomic number values
  • Chemical Properties: Aromaticity, hybridization state (SP/SP2/SP3)
  • Hydrogen Count: Total number of bonded hydrogens
  • Quantum Properties: Atomic charges and Coulomb matrix diagonal elements

Edge Features (per bond)

  • Bond Type: One-hot encoded (Single, Double, Triple, Aromatic)
  • Coulomb Interactions: Off-diagonal Coulomb matrix elements

Graph Properties

  • 3D Coordinates: Atomic positions from conformers
  • Target Values: Atomization energies (eV)
  • Metadata: Original dataset indices for traceability

🏗️ Architecture

scripts/ ├── pyg_qm7_processing.py # Core data processing and graph construction ├── qm7_curation.py # Feature normalization and final transforms ├── exceptions.py # Custom exception handling └── main_process.py # Complete pipeline orchestration

Processing Pipeline

  1. Data Loading & Validation

    • Load molecular structures from SDF files
    • Parse energy targets from CSV
    • Extract quantum properties from MAT files
    • Validate data consistency across sources
  2. Graph Construction

    • Build molecular graphs using RDKit
    • Extract comprehensive node and edge features
    • Apply optional pre-filtering criteria
    • Save intermediate results in memory-efficient chunks
  3. Feature Curation

    • Calculate global feature statistics
    • Apply standardization transforms
    • Integrate custom preprocessing steps
    • Consolidate final dataset

🔧 Advanced Configuration

Custom Filtering

```python from functools import partial

Define molecule filters

def filterbycomplexity(data, minatoms=5, maxatoms=20): return minatoms <= data.numnodes <= max_atoms

def filterbycarboncontent(data, mincarbons=1): carboncount = (data.z == 6).sum().item() return carboncount >= min_carbons

Combine filters

combinedfilter = lambda data: ( filterbycomplexity(data, 5, 20) and filterbycarboncontent(data, 1) ) ```

Custom Transforms

```python from torchgeometric.transforms import Compose from qm7curation import CustomEdgeFeatureCombiner

transforms = Compose([ CustomEdgeFeatureCombiner(param1='value1'), # Add your custom transforms here ]) ```

📈 Performance & Scalability

  • Memory Efficiency: Chunked processing handles datasets of arbitrary size
  • Processing Speed: Optimized for large-scale molecular datasets
  • GPU Compatibility: Full CUDA support for accelerated computation
  • Robust Error Handling: Comprehensive exception management for production use

🤝 Contributing

We welcome contributions! Please see our contribution guidelines for details.

Development Setup

```bash

Install development dependencies

pip install -r requirements-dev.txt

Run tests

python -m pytest tests/

Code formatting

black scripts/ isort scripts/

Linting

pylint scripts/ ```

📋 Requirements

torch>=1.9.0 torch_geometric>=2.0.0 rdkit-pypi>=2022.3.5 numpy>=1.21.0 scipy>=1.7.0 pandas>=1.3.0 tqdm>=4.62.0 scikit-learn>=1.0.0 PyYAML>=6.0

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • QM7 Dataset Creators - For providing this valuable quantum chemistry benchmark
  • RDKit Team - Essential cheminformatics toolkit for molecular manipulation
  • PyTorch Geometric - Powerful graph neural network library
  • PyTorch Team - Foundational deep learning framework
  • Scientific Python Community - NumPy, Pandas, and scikit-learn developers

📚 Citation

If you use this code in your research, please cite:

bibtex @software{qm7_processing_2024, title={QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs}, author={[Your Name]}, year={2024}, url={https://github.com/shahram-boshra/qm7_database_process} }

🐛 Issues & Support


⭐ If this project helped your research, please consider giving it a star!

Made with ❤️ for the molecular machine learning community

Owner

  • Login: shahram-boshra
  • Kind: user

Citation (citations.py)

# citations.py
import logging
from typing import Dict, Any, List

try:
    from exceptions import (
        InvalidCitationDataError,
        MalformedCitationFieldWarning,
        CitationProcessingError
    )
except ImportError:
    class InvalidCitationDataError(TypeError): pass
    class MalformedCitationFieldWarning(UserWarning): pass
    class CitationProcessingError(ValueError): pass


logger = logging.getLogger(__name__)


def format_citation_for_log(citation_data: Dict[str, Any]) -> str:
    citation_key: str = citation_data.get('key', 'N/A')

    if not isinstance(citation_data, dict):
        raise InvalidCitationDataError(type(citation_data), message=f"Citation key '{citation_key}': citation_data must be a dictionary. Received type: {type(citation_data)}.")

    if "full_citation" in citation_data:
        full_citation_content: Any = citation_data["full_citation"]
        if not isinstance(full_citation_content, str):
            logger.warning(f"Warning: Attempting to convert 'full_citation' to string for key '{citation_key}' as it's not a string.")
            try:
                full_citation_content = str(full_citation_content)
                logger.warning(f"Successfully converted 'full_citation' for key '{citation_key}' to string. Original type: {type(citation_data['full_citation'])}.")
            except Exception as e:
                raise MalformedCitationFieldWarning(
                    citation_key, 'full_citation', full_citation_content,
                    message=f"Failed to convert 'full_citation' to string for key '{citation_key}': {e}",
                    original_exception=e
                ) from e
            
        try:
            indented_lines: List[str] = [f"    {line.strip()}" for line in full_citation_content.strip().split('\n')]
            return "\n".join(indented_lines)
        except AttributeError as e:
            raise MalformedCitationFieldWarning(
                citation_key, 'full_citation', full_citation_content,
                message=f"Error processing 'full_citation' for citation key '{citation_key}'. Value: {full_citation_content}",
                original_exception=e
            ) from e
        except Exception as e:
            raise CitationProcessingError(
                citation_key,
                message=f"An unexpected error occurred while processing 'full_citation' for key '{citation_key}': {e}",
                original_exception=e
            ) from e

    parts: List[str] = []
    
    def get_safe_field(data_dict: Dict[str, Any], field_name: str, prefix: str = "", suffix: str = "") -> str:
        value: Any = data_dict.get(field_name)
        if value is None:
            return ""
        if not isinstance(value, (str, int, float)):
            logger.warning(f"Warning: Citation key '{citation_key}' has '{field_name}' field of unexpected type '{type(value)}'. Attempting to convert to string.")
            try:
                return f"    {prefix}{str(value)}{suffix}"
            except Exception as e:
                raise MalformedCitationFieldWarning(
                    citation_key, field_name, value,
                    message=f"Failed to convert '{field_name}' to string for key '{citation_key}': {e}",
                    original_exception=e
                ) from e
        return f"    {value}{suffix}"
    
    try:
        parts.append(get_safe_field(citation_data, 'authors', suffix="."))
        parts.append(get_safe_field(citation_data, 'title', prefix="\"", suffix="\""))
        parts.append(get_safe_field(citation_data, 'journal', suffix="."))
        parts.append(get_safe_field(citation_data, 'conference', suffix="."))
        
        mid_info: List[str] = []
        for field in ['volume', 'issue', 'pages']:
            value: Any = citation_data.get(field)
            if value is not None:
                if not isinstance(value, (str, int)):
                    logger.warning(f"Warning: Citation key '{citation_key}' has '{field}' field of unexpected type '{type(value)}'. Attempting to convert to string.")
                    try:
                        value = str(value)
                    except Exception as e:
                        raise MalformedCitationFieldWarning(
                            citation_key, field, value,
                            message=f"Failed to convert '{field}' to string for key '{citation_key}': {e}",
                            original_exception=e
                        ) from e

                if value is not None:
                    if field == 'issue':
                        mid_info.append(f"({value})")
                    elif field == 'pages':
                        mid_info.append(f"pages {value}")
                    else:
                        mid_info.append(str(value))

        if mid_info:
            parts.append(f"    {', '.join(mid_info)}.")
            
        parts.append(get_safe_field(citation_data, 'year', suffix="."))
        parts.append(get_safe_field(citation_data, 'doi', prefix="DOI: ", suffix="."))
        parts.append(get_safe_field(citation_data, 'note', suffix="."))
        
        return "\n".join([part for part in parts if part.strip()])
    except MalformedCitationFieldWarning:
        raise
    except Exception as e:
        raise CitationProcessingError(
            citation_key,
            message=f"An unexpected error occurred during fallback citation formatting for key '{citation_key}': {e}",
            original_exception=e
        ) from e


def log_qm7_citations(citations_data: List[Dict[str, str]]) -> None:
    logger.info("\n--- Citations for QM7 Dataset ---")
    if not citations_data:
        logger.warning("No citation data provided. No citations to log.")
        logger.info("----------------------------------\n")
        return

    for i, citation_data in enumerate(citations_data):
        citation_key: str = citation_data.get('key', 'N/A')
        try:
            prefix: str = f"[{i+1}] " if not citation_data.get("full_citation", "").strip().startswith(f"[{i+1}]") else ""
            formatted_string: str = format_citation_for_log(citation_data)
            
            lines: List[str] = formatted_string.split('\n')
            if lines:
                lines[0] = f"    {prefix}{lines[0].lstrip()}"
            else:
                logger.warning(f"Formatting for citation at index {i} (key: {citation_key}) resulted in empty string. Logging placeholder.")
                lines = [f"    [CITATION {i+1}]: No content generated."]

            logger.info("\n".join(lines))
        except (InvalidCitationDataError, MalformedCitationFieldWarning, CitationProcessingError) as e:
            logger.error(f"Error processing citation entry (index {i}, key: {citation_key}): {e}")
            logger.info(f"    [UNLOGGED ENTRY {i+1}]: Skipping due to formatting error.")
            if hasattr(e, 'original_exception') and e.original_exception:
                logger.debug(f"Original exception for {citation_key}: {e.original_exception}", exc_info=True)
            continue
        except Exception as e:
            logger.critical(f"An unhandled critical error occurred while processing citation at index {i} (key: {citation_key}): {e}", exc_info=True)
            logger.info(f"    [FATAL ERROR] Could not process citation {i+1}. Skipping.")
    logger.info("----------------------------------\n")

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 2
  • Total Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Shahram a****a@g****m 2

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

Dockerfile docker
  • python 3.10-slim-buster build
requirements.txt pypi
  • PyYAML ==6.0.2
  • docutils ==0.21.2
  • numpy ==2.2.3
  • requests ==2.32.3
  • sphinx ==8.2.3
  • torch-cluster ==1.6.3
  • torch-geometric ==2.6.1
  • torch-scatter ==6.0.2
  • torch-sparse ==0.6.18
  • torch_geometric ==2.6.0
  • tqdm ==4.67.1