https://github.com/adafede/wd-labels-to-iupac
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Adafede
- License: other
- Language: Python
- Default Branch: main
- Size: 20.5 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Wikidata-Labels-to-IUPAC Conversion Dataset
Overview
This dataset contains the results of converting labels from Wikidata to molecular structures using OPSIN (Open Parser for Systematic IUPAC Nomenclature). The conversion process validates results by comparing generated InChIKeys with values from Wikidata.
Dataset Statistics
- Total input compounds: 1,341,287
- Successfully processed: 1,341,287
- Successful matches: 838,452
- Success rate: 62.51%
- Processing date: 2025-06-10
Files Description
Primary Data Files
chemical_matches.csv: Successful matches in CSV format- Contains: chemical names, InChIKeys, SMILES structures, processing metadata
- Suitable for: Spreadsheet analysis, statistical processing
chemical_matches.json: Successful matches in JSON format- Contains: Same data as CSV with additional metadata and schema
- Suitable for: Programmatic access, web applications
chemical_matches.sdf(optional): Molecular structures in SDF format- Contains: 3D molecular structures with properties
- Suitable for: Chemical visualization, molecular modeling
Reference and Metadata Files
wikidata_reference.csv: Original Wikidata compounds- Contains: All chemical names and InChIKeys from Wikidata query
- Purpose: Complete reference dataset for reproducibility
processing_statistics.json: Detailed processing statistics- Contains: Success rates, error counts, quality metrics
- Purpose: Quality assessment and method validation
metadata.json: Dataset metadata- Contains: Complete dataset description, methodology, provenance
- Purpose: Zenodo compliance and data citation
Methodology
Data Source
Chemical compound data was retrieved from Wikidata using SPARQL queries targeting:
- Compounds with English language labels (rdfs:label)
- Compounds with InChIKey identifiers (wdt:P235)
Processing Pipeline
- Data Retrieval: SPARQL query to Wikidata endpoint
- Name Processing: Batch conversion using OPSIN v2.8.0
- Structure Validation: SMILES validation using RDKit
- Quality Control: InChIKey comparison for accuracy verification
- Result Export: Multi-format output generation
Quality Assurance
- InChIKey format validation
- SMILES structure validation using RDKit
- Exact InChIKey matching for success determination
- Comprehensive error logging and statistics
Software Requirements
Runtime Dependencies
- Python 3.10+
- Java Runtime Environment (JRE) 8+
- OPSIN v2.8.0 (automatically downloaded)
Python Packages
- RDKit (chemistry toolkit)
- SPARQLWrapper (SPARQL queries)
- requests (HTTP requests)
- tqdm (progress bars)
Limitations and Considerations
- Coverage: Results represent only compounds with both Wikidata names and InChIKeys
- Accuracy: Success determined by exact InChIKey matching
- Scope: Limited to systematic chemical nomenclature parseable by OPSIN
- Language: Only English language chemical names processed
Use
```bash docker build -t wd-labels-to-iupac .
Run (assuming script outputs to current directory)
docker run -v $(pwd):/app/output wd-labels-to-iupac ```
Reproducibility
To reproduce this dataset: 1. Install required dependencies 2. Run the conversion script with default settings 3. Compare results with provided reference data
Citation
If you use this dataset in your research, please cite:
Adriano Rutz (2025). Wikidata-Labels-to-IUPAC Conversion Dataset.
Version 0.0.1. [Dataset]. Zenodo. https://doi.org/[DOI]
License
This dataset is released under the MIT (code) and CC0 (data) License. The original chemical data is from Wikidata (CC0 1.0 Universal).
Contact
For questions or issues regarding this dataset: - Email: adafede@gmail.com - GitHub: https://github.com/Adafede/wd-labels-to-iupac
Acknowledgments
- Egon Willighagen (0000-0001-7542-0286) for the original idea (see https://doi.org/10.59350/dycsw-qeq51)
- Wikidata contributors for chemical compound data
- OPSIN developers for the nomenclature parsing tool
- RDKit developers for chemical informatics capabilities
Owner
- Name: Adriano Rutz
- Login: Adafede
- Kind: user
- Location: Zürich, Switzerland
- Website: https://adafede.github.io/
- Repositories: 10
- Profile: https://github.com/Adafede
Pharmacist | Computational Metabolomics
GitHub Events
Total
- Issues event: 1
- Delete event: 1
- Push event: 2
- Pull request event: 1
- Create event: 3
Last Year
- Issues event: 1
- Delete event: 1
- Push event: 2
- Pull request event: 1
- Create event: 3
Dependencies
- python 3.13-slim build
- SPARQLWrapper <3.0.0,>=2.0.0
- dataclasses python_version<'3.10'
- rdkit <2026.0.0,>=2025.3.1
- requests <3.0.0,>=2.32.4
- tqdm <5.0.0,>=4.67.1
- typing-extensions <5.0.0,>=4.14.0