https://github.com/berquist/reptar
A tool for computing, storing, and analyzing manuscript-scale computational chemistry and biology data
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Last synced: 7 months ago
·
JSON representation
Repository
A tool for computing, storing, and analyzing manuscript-scale computational chemistry and biology data
Basic Info
- Host: GitHub
- Owner: berquist
- License: mit
- Default Branch: main
- Homepage: https://www.aalexmmaldonado.com/reptar/
- Size: 23.9 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of oasci/reptar
Created over 2 years ago
· Last pushed over 2 years ago
https://github.com/berquist/reptar/blob/main/
reptar
A tool for computing, storing, and analyzing manuscript-scale computational chemistry and biology data
Documentation
Motivation Installation File Types Key-value pairs Workflow License
# Motivation The computational chemistry and biology communities often fails to openly provide raw and/or processed data used to draw their scientific conclusions. For large projects, frameworks such as [QCArchive](https://qcarchive.molssi.org/), [Materials Project](https://materialsproject.org/), [Pitt Quantum Repository](https://pqr.pitt.edu/), [ioChem-BD](https://www.iochem-bd.org/) and many others provide great storage solutions. This approach would not be practical for fluid data pipelines and small-scale projects such as a single manuscript. Alternatively, you could use individual files in formats such as JSON, XML, YAML, npz, etc. These are great options for customizable data storage with their own advantages and disadvantages. However, you often must choose between (1) a standardized parser that might not support your workflow or (2) writing your own. Reptar is designed for easy data storage and analysis for individual projects. Customizable parsers provide a simple way to extract new data without submitting issues and pull requests (although this is highly encouraged). While files are the heart of reptar, it strives to be file-type agnostic by providing the same interface for all supported file types. The result is a user-specified file streamlined for analysis in Python and archival on places such as [GitHub](https://github.com/) and [Zenodo](https://zenodo.org/). # Installation You can install reptar from [PyPI](https://pypi.org/project/reptar/) by using `pip install reptar`. Or, the latest development version can be installed directly from the [GitHub repository](https://github.com/aalexmmaldonado/reptar) or from [TestPyPI](https://test.pypi.org/project/reptar/). ```bash git clone https://github.com/aalexmmaldonado/reptar cd reptar pip install . ``` # File types Reptar supports four file types with a single interface: exdir, zarr, JSON, and npz. JSON is a text file for storing key-value pairs with few dimensions (i.e., no large arrays). NumPy's npz format is useful for arrays; however, no nesting is possible and loading data often requires postprocessing for 0D arrays (e.g., ``np.array('data')``). Exdir is a simple, yet powerful open file format that mimics the [HDF5](https://www.hdfgroup.org/solutions/hdf5/) format with metadata and data stored in directories with YAML and npy files instead of a single binary file. For more detailed information, please read this [*Front. Neuroinform.* article about exdir](https://doi.org/10.3389/fninf.2018.00016). [Zarr](https://zarr.dev/) is a similar hierarchical data format for chunked and compressed NumPy-like arrays and JSON attributes. Both of these file types provide several advantages such as mixing human-readable and binary files, being easier for version control, and only loading requested portions of arrays into memory. # Key-value pairs All data is stored under a ``key``-``value`` pair within the reptar framework. The ``key`` tells reptar where the data is stored and is conceptually related to standard file paths (without file extensions). Nested data is specified by separating the nested keys with a ``/``. For example, ``energy_pot``, ``md_run/geometry``, and ``entity_ids`` are all valid keys. Note that ``gradients`` and ``/gradients`` would translate to the same value (``/`` species the "root" of the file). # Workflow ## Storing data We refer to a "reptar file" as any file that can be used with the ``reptar.File`` class. Creating a reptar file starts by having a set of data files generated from some calculation. Paths to these data files are passed into ``reptar.Creator.from_calc`` that extracts information using a ``reptar.parser`` class. Information parsed from these files, ``parsed_info``, is then used to populate a ``reptar.File`` object. Data can also be manually added by using ``File.put(key, data)`` where ``key`` is a string specifying where to store the data. ## Accessing data Data can be added or retrieved using the same interface regardless of the underlying file format (e.g., exdir, JSON, and npz). The only thing required is the respective ``key`` specifying where it is stored. Then, ``File.get(key)`` can retrieve the data. When working with JSON and npz files, ``File.save()`` must be explicitly called after any modification. ## Writing to other formats Other packages often require data to be formatted in their own specific way. Reptar provides ways to extract data from reptar files using ``File.get(key)`` and passing it into the desired ``reptar.writer`` function. Reptar currently automates the creation of: - [Atomic simulation environment (ASE) databases](https://wiki.fysik.dtu.dk/ase/tutorials/tut06_database/database.html), - [Gaussian approximate potentials (GAP) extended XYZ files](https://libatoms.github.io/GAP/gap_fit.html#data), - [Protein data bank (PDB) files](https://www.wwpdb.org/documentation/file-format), - [Schnetpack databases](https://schnetpack.readthedocs.io/en/stable/tutorials/tutorial_01_preparing_data.html), - [XYZ files](https://en.wikipedia.org/wiki/XYZ_file_format). # License Distributed under the MIT License. See `LICENSE` for more information.
Owner
- Name: Eric Berquist
- Login: berquist
- Kind: user
- Location: Boston, MA
- Company: Sandia National Laboratories
- Repositories: 403
- Profile: https://github.com/berquist
full-stack quantum chemist