qm7_database_process
QM7 Dataset Processing and Curation
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Keywords
Repository
QM7 Dataset Processing and Curation
Basic Info
- Host: GitHub
- Owner: shahram-boshra
- Language: Python
- Default Branch: master
- Homepage: https://github.com/shahram-boshra/qm7_database_process
- Size: 42 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs
A robust Python pipeline for processing the QM7 quantum chemistry dataset into graph-based format optimized for PyTorch Geometric.
🧪 Overview
This repository provides a comprehensive solution for transforming the QM7 quantum chemistry dataset into PyTorch Geometric (PyG) graph format. The QM7 dataset contains 7,165 molecules with up to 23 atoms (C, O, N, S, H) and their corresponding atomization energies, making it an ideal benchmark for Graph Neural Networks (GNNs) in molecular property prediction tasks.
Key Capabilities
- Multi-format Data Loading: Seamlessly processes SDF, CSV, and MAT files
- Rich Graph Construction: Creates detailed molecular graphs with comprehensive node and edge features
- Data Quality Assurance: Implements thorough consistency checks and alignment validation
- Memory-Efficient Processing: Handles large datasets through intelligent chunking
- Feature Normalization: Applies global standardization using scikit-learn's StandardScaler
- Flexible Filtering: Supports custom pre-filtering based on molecular properties
- Extensible Transforms: Integrates with PyTorch Geometric's transform ecosystem
🚀 Quick Start
Prerequisites
- Python ≥ 3.8
- CUDA-capable GPU (recommended for large-scale processing)
Installation
Clone the repository
bash git clone https://github.com/shahram-boshra/qm7_database_process.git cd qm7_database_processCreate and activate virtual environment
bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies
bash pip install -r requirements.txt
Dataset Setup
Download the QM7 dataset files and organize them as follows:
data/
└── qm7/
└── raw_data/
├── gdb7.sdf # Molecular structures
├── atomization_energies.csv # Target energies
└── qm7.mat # Coulomb matrices and charges
Dataset Sources: - QM7 on Figshare - MoleculeNet QM7
Basic Usage
```python from pathlib import Path from pygqm7processing import processqm7data from qm7curation import curateqm7_data
Define paths
basedir = Path("data/qm7") rawdir = basedir / "rawdata" processeddir = basedir / "processed"
Step 1: Process raw data into chunks
processqm7data( sdffile=rawdir / "gdb7.sdf", energiesfile=rawdir / "atomizationenergies.csv", matfile=rawdir / "qm7.mat", intermediatechunkoutputdir=basedir / "chunks", chunksize=1000 )
Step 2: Curate and normalize features
curateqm7data( chunkdir=basedir / "chunks", outputpath=processeddir / "qm7processed.pt", featurekeysfornorm=['x', 'edge_attr'] ) ```
📊 Dataset Features
Node Features (per atom)
- Atom Type: One-hot encoded atomic species
- Atomic Number: Raw atomic number values
- Chemical Properties: Aromaticity, hybridization state (SP/SP2/SP3)
- Hydrogen Count: Total number of bonded hydrogens
- Quantum Properties: Atomic charges and Coulomb matrix diagonal elements
Edge Features (per bond)
- Bond Type: One-hot encoded (Single, Double, Triple, Aromatic)
- Coulomb Interactions: Off-diagonal Coulomb matrix elements
Graph Properties
- 3D Coordinates: Atomic positions from conformers
- Target Values: Atomization energies (eV)
- Metadata: Original dataset indices for traceability
🏗️ Architecture
scripts/
├── pyg_qm7_processing.py # Core data processing and graph construction
├── qm7_curation.py # Feature normalization and final transforms
├── exceptions.py # Custom exception handling
└── main_process.py # Complete pipeline orchestration
Processing Pipeline
Data Loading & Validation
- Load molecular structures from SDF files
- Parse energy targets from CSV
- Extract quantum properties from MAT files
- Validate data consistency across sources
Graph Construction
- Build molecular graphs using RDKit
- Extract comprehensive node and edge features
- Apply optional pre-filtering criteria
- Save intermediate results in memory-efficient chunks
Feature Curation
- Calculate global feature statistics
- Apply standardization transforms
- Integrate custom preprocessing steps
- Consolidate final dataset
🔧 Advanced Configuration
Custom Filtering
```python from functools import partial
Define molecule filters
def filterbycomplexity(data, minatoms=5, maxatoms=20): return minatoms <= data.numnodes <= max_atoms
def filterbycarboncontent(data, mincarbons=1): carboncount = (data.z == 6).sum().item() return carboncount >= min_carbons
Combine filters
combinedfilter = lambda data: ( filterbycomplexity(data, 5, 20) and filterbycarboncontent(data, 1) ) ```
Custom Transforms
```python from torchgeometric.transforms import Compose from qm7curation import CustomEdgeFeatureCombiner
transforms = Compose([ CustomEdgeFeatureCombiner(param1='value1'), # Add your custom transforms here ]) ```
📈 Performance & Scalability
- Memory Efficiency: Chunked processing handles datasets of arbitrary size
- Processing Speed: Optimized for large-scale molecular datasets
- GPU Compatibility: Full CUDA support for accelerated computation
- Robust Error Handling: Comprehensive exception management for production use
🤝 Contributing
We welcome contributions! Please see our contribution guidelines for details.
Development Setup
```bash
Install development dependencies
pip install -r requirements-dev.txt
Run tests
python -m pytest tests/
Code formatting
black scripts/ isort scripts/
Linting
pylint scripts/ ```
📋 Requirements
torch>=1.9.0
torch_geometric>=2.0.0
rdkit-pypi>=2022.3.5
numpy>=1.21.0
scipy>=1.7.0
pandas>=1.3.0
tqdm>=4.62.0
scikit-learn>=1.0.0
PyYAML>=6.0
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- QM7 Dataset Creators - For providing this valuable quantum chemistry benchmark
- RDKit Team - Essential cheminformatics toolkit for molecular manipulation
- PyTorch Geometric - Powerful graph neural network library
- PyTorch Team - Foundational deep learning framework
- Scientific Python Community - NumPy, Pandas, and scikit-learn developers
📚 Citation
If you use this code in your research, please cite:
bibtex
@software{qm7_processing_2024,
title={QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs},
author={[Your Name]},
year={2024},
url={https://github.com/shahram-boshra/qm7_database_process}
}
🐛 Issues & Support
- Bug Reports: Open an issue
- Feature Requests: Request a feature
- Questions: Start a discussion
⭐ If this project helped your research, please consider giving it a star!
Made with ❤️ for the molecular machine learning community
Owner
- Login: shahram-boshra
- Kind: user
- Repositories: 1
- Profile: https://github.com/shahram-boshra
Citation (citations.py)
# citations.py
import logging
from typing import Dict, Any, List
try:
from exceptions import (
InvalidCitationDataError,
MalformedCitationFieldWarning,
CitationProcessingError
)
except ImportError:
class InvalidCitationDataError(TypeError): pass
class MalformedCitationFieldWarning(UserWarning): pass
class CitationProcessingError(ValueError): pass
logger = logging.getLogger(__name__)
def format_citation_for_log(citation_data: Dict[str, Any]) -> str:
citation_key: str = citation_data.get('key', 'N/A')
if not isinstance(citation_data, dict):
raise InvalidCitationDataError(type(citation_data), message=f"Citation key '{citation_key}': citation_data must be a dictionary. Received type: {type(citation_data)}.")
if "full_citation" in citation_data:
full_citation_content: Any = citation_data["full_citation"]
if not isinstance(full_citation_content, str):
logger.warning(f"Warning: Attempting to convert 'full_citation' to string for key '{citation_key}' as it's not a string.")
try:
full_citation_content = str(full_citation_content)
logger.warning(f"Successfully converted 'full_citation' for key '{citation_key}' to string. Original type: {type(citation_data['full_citation'])}.")
except Exception as e:
raise MalformedCitationFieldWarning(
citation_key, 'full_citation', full_citation_content,
message=f"Failed to convert 'full_citation' to string for key '{citation_key}': {e}",
original_exception=e
) from e
try:
indented_lines: List[str] = [f" {line.strip()}" for line in full_citation_content.strip().split('\n')]
return "\n".join(indented_lines)
except AttributeError as e:
raise MalformedCitationFieldWarning(
citation_key, 'full_citation', full_citation_content,
message=f"Error processing 'full_citation' for citation key '{citation_key}'. Value: {full_citation_content}",
original_exception=e
) from e
except Exception as e:
raise CitationProcessingError(
citation_key,
message=f"An unexpected error occurred while processing 'full_citation' for key '{citation_key}': {e}",
original_exception=e
) from e
parts: List[str] = []
def get_safe_field(data_dict: Dict[str, Any], field_name: str, prefix: str = "", suffix: str = "") -> str:
value: Any = data_dict.get(field_name)
if value is None:
return ""
if not isinstance(value, (str, int, float)):
logger.warning(f"Warning: Citation key '{citation_key}' has '{field_name}' field of unexpected type '{type(value)}'. Attempting to convert to string.")
try:
return f" {prefix}{str(value)}{suffix}"
except Exception as e:
raise MalformedCitationFieldWarning(
citation_key, field_name, value,
message=f"Failed to convert '{field_name}' to string for key '{citation_key}': {e}",
original_exception=e
) from e
return f" {value}{suffix}"
try:
parts.append(get_safe_field(citation_data, 'authors', suffix="."))
parts.append(get_safe_field(citation_data, 'title', prefix="\"", suffix="\""))
parts.append(get_safe_field(citation_data, 'journal', suffix="."))
parts.append(get_safe_field(citation_data, 'conference', suffix="."))
mid_info: List[str] = []
for field in ['volume', 'issue', 'pages']:
value: Any = citation_data.get(field)
if value is not None:
if not isinstance(value, (str, int)):
logger.warning(f"Warning: Citation key '{citation_key}' has '{field}' field of unexpected type '{type(value)}'. Attempting to convert to string.")
try:
value = str(value)
except Exception as e:
raise MalformedCitationFieldWarning(
citation_key, field, value,
message=f"Failed to convert '{field}' to string for key '{citation_key}': {e}",
original_exception=e
) from e
if value is not None:
if field == 'issue':
mid_info.append(f"({value})")
elif field == 'pages':
mid_info.append(f"pages {value}")
else:
mid_info.append(str(value))
if mid_info:
parts.append(f" {', '.join(mid_info)}.")
parts.append(get_safe_field(citation_data, 'year', suffix="."))
parts.append(get_safe_field(citation_data, 'doi', prefix="DOI: ", suffix="."))
parts.append(get_safe_field(citation_data, 'note', suffix="."))
return "\n".join([part for part in parts if part.strip()])
except MalformedCitationFieldWarning:
raise
except Exception as e:
raise CitationProcessingError(
citation_key,
message=f"An unexpected error occurred during fallback citation formatting for key '{citation_key}': {e}",
original_exception=e
) from e
def log_qm7_citations(citations_data: List[Dict[str, str]]) -> None:
logger.info("\n--- Citations for QM7 Dataset ---")
if not citations_data:
logger.warning("No citation data provided. No citations to log.")
logger.info("----------------------------------\n")
return
for i, citation_data in enumerate(citations_data):
citation_key: str = citation_data.get('key', 'N/A')
try:
prefix: str = f"[{i+1}] " if not citation_data.get("full_citation", "").strip().startswith(f"[{i+1}]") else ""
formatted_string: str = format_citation_for_log(citation_data)
lines: List[str] = formatted_string.split('\n')
if lines:
lines[0] = f" {prefix}{lines[0].lstrip()}"
else:
logger.warning(f"Formatting for citation at index {i} (key: {citation_key}) resulted in empty string. Logging placeholder.")
lines = [f" [CITATION {i+1}]: No content generated."]
logger.info("\n".join(lines))
except (InvalidCitationDataError, MalformedCitationFieldWarning, CitationProcessingError) as e:
logger.error(f"Error processing citation entry (index {i}, key: {citation_key}): {e}")
logger.info(f" [UNLOGGED ENTRY {i+1}]: Skipping due to formatting error.")
if hasattr(e, 'original_exception') and e.original_exception:
logger.debug(f"Original exception for {citation_key}: {e.original_exception}", exc_info=True)
continue
except Exception as e:
logger.critical(f"An unhandled critical error occurred while processing citation at index {i} (key: {citation_key}): {e}", exc_info=True)
logger.info(f" [FATAL ERROR] Could not process citation {i+1}. Skipping.")
logger.info("----------------------------------\n")
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- python 3.10-slim-buster build
- PyYAML ==6.0.2
- docutils ==0.21.2
- numpy ==2.2.3
- requests ==2.32.3
- sphinx ==8.2.3
- torch-cluster ==1.6.3
- torch-geometric ==2.6.1
- torch-scatter ==6.0.2
- torch-sparse ==0.6.18
- torch_geometric ==2.6.0
- tqdm ==4.67.1