https://github.com/cedergrouphub/materialparser
Utility to compile string of chemical terms into data structure with chemical formula and composition
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Keywords
chemical-compounds
chemical-terms
composition
material-parser
materials-science
natural-language-processing
python
text-mining
Last synced: 5 months ago
·
JSON representation
Repository
Utility to compile string of chemical terms into data structure with chemical formula and composition
Basic Info
Statistics
- Stars: 7
- Watchers: 4
- Forks: 6
- Open Issues: 5
- Releases: 0
Topics
chemical-compounds
chemical-terms
composition
material-parser
materials-science
natural-language-processing
python
text-mining
Created almost 8 years ago
· Last pushed over 4 years ago
https://github.com/CederGroupHub/MaterialParser/blob/master/
# MaterialParser
This is an old repo for the text2chem parser. The new one is available at https://github.com/CederGroupHub/text2chem.
Material Parser is a package that provides various functionality to parse material entities and convert them into a
chemical structure.
The Material Parser addresses the problem of unification of materials entities widely used in scientific literature.
This tool is build to facilitate and promote text mining research in material science.
**Material Parser** functionality includes:
* converting standard chemical terms into a chemical formula
* parsing chemical formula into chemical composition,
* constructing dictionary of materials abbreviations from a text snippets,
* finding the values of stoichiometric and elements variables from text snippets,
* splitting mixtures/composites/alloys/solid solutions into compounds
Note: Material Parser is not intended to be used for finding material entities in the text or to perform any
chemical named entities recognition (NER) in a text.
It rather processes and standardizes already extracted chemical (material) entities and extracts relevant chemical
information from them.
The parser was tested to work well for inorganic material terms, its performance on organic terms was not
evaluated thoroughly.
### Pipeline overview
Material Parser runs as a pipeline of default, pre- and post-processing modules.
Default module of the pipeline process chemical formula and converts into a chemical structure.
Preprocessing modules deal with an input material string,
separate relevant information from an input string and fill it into an output chemical structure.
Postprocessing modules augment chemical structure with any other information that can be found in a surrounding text.
The current version of Material Parser package includes the following preprocessing modules:
* ``PhaseProcessing`` - gets information about material phase if it is presented in the string,
e.g. for "P2-Na2/3(CoxNi1/3-xMn2/3)O2", "P2" will be separated and stored in the structure
while the rest of the string will be sent down the pipeline.
* ``AdditivesProcessing`` - gets information about any additives, such as dopants, stabilizers or mixed compounds,
e.g. i) for "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2", "CuF2" will be separated as an additive
and "(Na0.5K0.5)NbO3" will be sent down the pipeline; ii) for "Ca2BO3Cl:Sm3+, Eu3+", "Sm3+" and "Eu3+" will be separated
and "Ca2BO3Cl" will be sent down the pipeline.
* ``ChemicalNameProcessing`` - attempts to convert sequence of chemical terms into a formula,
e.g. "zinc (II) acetate dihydrate" will be converted into "Zn(CH3COO)22H2O" that will be sent down the pipeline
and the chemical terms will be stored in output structure as material name.
This tool can also separate chemical formula from the rest of the terms is they are combined by tokenizer,
e.g. "ammonium molybdate (NH4)6Mo7O244H2O" will be split into "ammonium molybdate" - material name
and "(NH4)6Mo7O244H2O" - chemical formula that will be sent down the pipeline.
* ``MixtureProcessing`` - split mixture/solid solution/alloy/composite/hydrate into constituting compounds,
e.g. "(1-x-y)BaTiO3-xBaBiO3-y(Bi0.5Na0.5)TiO3" will be split into "BaTiO3", "BaBiO3", and "(Bi0.5Na0.5)TiO3"
with the corresponding amounts "(1-x-y)", "x" and "y", respectively.
* ``PubchemPreprocessing`` - looks up chemical terms in PubChem database, if chemical formula is not found.
This step usually slows down the overall pipeline perfomance.
The postprocessing module includes:
* ``Substitute_Additives`` - appends dopand elements to the formula to have integer total stoichiometry
or add mixture compound to the composition,
e.g. i) "Zn1.92-2xYxLixSiO4:0.08Mn2+" becomes "Mn0.08Zn1.92-2xYxLixSiO4" and
ii) "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2" becomes "(1-x)(Na0.5K0.5)NbO3-(x)CuF2".
### Installation:
```
git clone https://github.com/CederGroupHub/MaterialParser.git
cd MaterialParser
pip install -r requirements.txt -e .
```
### Running default setup:
By default, Material Parser gets chemical composition from a proper material formula
(i.e. no chemical terms, dopants or mixtures):
```
from material_parser.core.material_parser import MaterialParserBuilder
mp = mp = MaterialParserBuilder().build()
output = mp.parse("Li5+xLa3Ta2-xGexO12")
print(output.to_dict())
```
Output:
```
{'additives': [],
'amounts_x': {'x': []},
'composition': [{'amount': '1',
'elements': OrderedDict([('Li', 'x+5'),
('La', '3'),
('Ta', '2-x'),
('Ge', 'x'),
('O', '12')]),
'formula': 'Li5+xLa3Ta2-xGexO12',
'species': OrderedDict([('Li', 'x+5'),
('La', '3'),
('Ta', '2-x'),
('Ge', 'x'),
('O', '12')])}],
'elements_x': {},
'material_formula': 'Li5+xLa3Ta2-xGexO12',
'material_name': '',
'material_string': 'Li5+xLa3Ta2-xGexO12',
'oxygen_deficiency': '',
'phase': ''}
```
Other attributes can be filled using additional pipeline blockes if required.
### Adding modules to the pipeline:
```
from material_parser.core.material_parser import MaterialParserBuilder
from material_parser.core.preprocessing_tools.additives_processing import AdditivesProcessing
from material_parser.core.preprocessing_tools.chemical_name_processing import ChemicalNameProcessing
from material_parser.core.preprocessing_tools.mixture_processing import MixtureProcessing
from material_parser.core.preprocessing_tools.phase_processing import PhaseProcessing
from material_parser.core.postprocessing_tools.substitute_additives import SubstituteAdditives
mp = MaterialParserBuilder() \
.addPreprocessing(AdditivesProcessing()) \
.addPreprocessing(ChemicalNameProcessing()) \
.addPreprocessing(PhaseProcessing()) \
.addPreprocessing(MixtureProcessing())\
.addPostprocessing(SubstituteAdditives())\
.build()
mp.parse("Nasicon (Na1+x+yZr2-yYySixP3-xO12)")
```
Output:
```
{'additives': [],
'amounts_x': {'x': [], 'y': []},
'composition': [{'amount': '1',
'elements': OrderedDict([('Na', 'x+y+1'),
('Zr', '2-y'),
('Y', 'y'),
('Si', 'x'),
('P', '3-x'),
('O', '12')]),
'formula': 'Na1+x+yZr2-yYySixP3-xO12',
'species': OrderedDict([('Na', 'x+y+1'),
('Zr', '2-y'),
('Y', 'y'),
('Si', 'x'),
('P', '3-x'),
('O', '12')])}],
'elements_x': {},
'material_formula': 'Na1+x+yZr2-yYySixP3-xO12',
'material_name': 'Nasicon',
'material_string': 'Nasicon (Na1+x+yZr2-yYySixP3-xO12)',
'oxygen_deficiency': '',
'phase': ''}
```
Note: the order of the modules may affect the resulted output.
### Running customized pipeline:
Material Parser provides capabilities for creating customized pre- and post-processing modules.
This are defined by the corresponding interface:
``core/preprocessing_tools/preprocessing_abc.py`` and ``core/postprocessing_tools/postprocessing_abc.py``.
Add the class implementing the interface into a corresponding directory and import as a regular module.
### Output
mp.parse() output the ChemicalStructure with the following attributes:
* material_string: ``string``
* material_name: ``string``
* material_formula: ``string``
* additives: ``list`` or ``string``
* phase: ``string``
* oxygen_deficiency: ``char``
* amounts_x: ``Variables``
* elements_x: ``Variables``
* composition: ``list`` of ``Compound``
### Citing
If you use Material Parser in your work, please cite the following paper:
* Kononova et. al "Text-mined dataset of inorganic materials synthesis recipes", Scientific Data 6 (1), 1-11 (2019)
[10.1038/s41597-019-0224-1](https://www.nature.com/articles/s41597-019-0224-1)
Owner
- Name: Ceder Group
- Login: CederGroupHub
- Kind: organization
- Website: http://ceder.berkeley.edu/
- Repositories: 19
- Profile: https://github.com/CederGroupHub
GitHub Events
Total
Last Year
Dependencies
requirements.txt
pypi
- pubchempy *
- regex *
- sympy *
setup.py
pypi
- pubchempy *
- regex *
- sympy *