https://github.com/catalyst-cooperative/mozilla-sec-eia

Exploratory development for SEC to EIA linkage

https://github.com/catalyst-cooperative/mozilla-sec-eia

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Exploratory development for SEC to EIA linkage

Basic Info
  • Host: GitHub
  • Owner: catalyst-cooperative
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 8.82 MB
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 0
  • Open Issues: 14
  • Releases: 0
Created over 2 years ago · Last pushed 10 months ago
Metadata Files
Readme Funding License Code of conduct

README.rst

pudl-models: ML models developed for PUDL
=======================================================================================

.. readme-intro

.. image:: https://github.com/catalyst-cooperative/mozilla-sec-eia/workflows/tox-pytest/badge.svg
   :target: https://github.com/catalyst-cooperative/mozilla-sec-eia/actions?query=workflow%3Atox-pytest
   :alt: Tox-PyTest Status

.. image:: https://img.shields.io/codecov/c/github/catalyst-cooperative/mozilla-sec-eia?style=flat&logo=codecov
   :target: https://codecov.io/gh/catalyst-cooperative/mozilla-sec-eia
   :alt: Codecov Test Coverage

.. image:: https://img.shields.io/readthedocs/catalystcoop-mozilla-sec-eia?style=flat&logo=readthedocs
   :target: https://catalystcoop-mozilla-sec-eia.readthedocs.io/en/latest/
   :alt: Read the Docs Build Status

.. image:: https://img.shields.io/pypi/v/catalystcoop.mozilla-sec-eia?style=flat&logo=python
   :target: https://pypi.org/project/catalystcoop.mozilla-sec-eia/
   :alt: PyPI Latest Version

.. image:: https://img.shields.io/conda/vn/conda-forge/catalystcoop.mozilla-sec-eia?style=flat&logo=condaforge
   :target: https://anaconda.org/conda-forge/catalystcoop.mozilla-sec-eia
   :alt: conda-forge Version

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
   :target: https://github.com/psf/black>
   :alt: Any color you want, so long as it's black.

About
-----
The `PUDL `__ project makes US energy data free and open
for all. For more information, see the PUDL repo and `website `__.

This repo implements machine learning models which support PUDL. The types of
modelling performed here include record linkage between datasets, and extracting
structured data from unstructured documents. The outputs of these models then feed
into PUDL tables, and are distributed in the PUDL data warehouse.

Project Structure
-----------------
This repo is split into two main sections, with shared tooling being implemented in
``src/mozilla_sec_eia/library`` and actual models implemented in
``src/mozilla_sec_eia/models``.

Models
^^^^^^
Each model is contained in its own Dagster
`code location `__. This keeps models
isolated from each other, allowing finetuned dependency management, and provides useful
organization in the Dagster UI. To add a new model, you must create a new python module
in the ``src/mozilla_sec_eia/models/`` directory. This module should define a single
Dagster ``Definitions`` object which can be imported from the top-level of the module.
For reference on how to structure a code location, see
``src/mozilla_sec_eia/models/sec10k/`` for an example. After creating a new model,
it must be added to
`workspace.yaml `__.

There are three types of dagster `jobs `__
expected in a model code location:

* **Production Jobs**: Production jobs define a pipeline to execute a model and produce
  outputs which typicall feed into PUDL.
* **Validation Jobs**: Validation jobs are used to test/validate models. They will be
  run in a single process with an
  `mlflow `__ run backing
  them to allow logging results to a tracking server.
* **Training Jobs**: Training jobs are meant to train models and log results with
  mlflow for use in production jobs.

There are helper functions in ``src/mozilla_sec_eia/library/model_jobs.py`` for
constructing each of these jobs. These functions help to ensure each job will
use the appropriate executor and supply the job with necessary resources.

Library
^^^^^^^
There's generic shared tooling for ``pudl-models`` defined in
``src/mozilla_sec_eia/library/``. This includes the helper functions for
constructing dagster jobs discussed above, as well as useful methods for computing
validation metrics, and an interface to our mlflow tracking server integrated with
our tracking server.

MlFlow
""""""
We use a remote `mlflow tracking `__ to aid in the
development and management of ``pudl-models``. In the ``mlflow`` module, there are
several dagster resources and IO-managers that can be used in any models to allow simple
seamless interface to the server.

.. TODO: Add mlflow resource/io-manager examples

Development
-----------
To launch the dagster UI to load all ``pudl-models``, run the command ``dagster dev``
in the top-level of this repo. This will load the file ``workspace.yaml``, which points
to each model. You can also work on a single model in isolation by running the command:
``dagster dev -m mozilla_sec_eia.models.{your_cool_model}``.

About Catalyst Cooperative
---------------------------------------------------------------------------------------
`Catalyst Cooperative `__ is a small group of data
wranglers and policy wonks organized as a worker-owned cooperative consultancy.
Our goal is a more just, livable, and sustainable world. We integrate public
data and perform custom analyses to inform public policy (`Hire us!
`__). Our focus is primarily on mitigating
climate change and improving electric utility regulation in the United States.

Contact Us
^^^^^^^^^^
* For general support, questions, or other conversations around the project
  that might be of interest to others, check out the
  `GitHub Discussions `__
* If you'd like to get occasional updates about our projects
  `sign up for our email list `__.
* Want to schedule a time to chat with us one-on-one? Join us for
  `Office Hours `__
* Follow us on Twitter: `@CatalystCoop `__
* More info on our website: https://catalyst.coop
* For private communication about the project or to hire us to provide customized data
  extraction and analysis, you can email the maintainers:
  `pudl@catalyst.coop `__

Owner

  • Name: Catalyst Cooperative
  • Login: catalyst-cooperative
  • Kind: organization
  • Email: hello@catalyst.coop
  • Location: United States of America

Catalyst is a small data engineering cooperative working on electricity regulation and climate change.

GitHub Events

Total
  • Issues event: 13
  • Delete event: 25
  • Issue comment event: 22
  • Push event: 108
  • Pull request review event: 47
  • Pull request review comment event: 50
  • Pull request event: 84
  • Create event: 34
Last Year
  • Issues event: 13
  • Delete event: 25
  • Issue comment event: 22
  • Push event: 108
  • Pull request review event: 47
  • Pull request review comment event: 50
  • Pull request event: 84
  • Create event: 34

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 1
  • Total pull requests: 20
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.6
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 18
Past Year
  • Issues: 1
  • Pull requests: 20
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.6
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 18
Top Authors
Issue Authors
  • katie-lamb (22)
  • zschira (6)
  • dependabot[bot] (2)
  • zaneselvans (1)
  • pre-commit-ci[bot] (1)
Pull Request Authors
  • dependabot[bot] (57)
  • pre-commit-ci[bot] (27)
  • zschira (11)
  • katie-lamb (10)
  • jdangerx (2)
Top Labels
Issue Labels
dependencies (2)
Pull Request Labels
dependencies (57) github_actions (3)

Dependencies

.github/workflows/bot-auto-merge.yml actions
  • ridedott/merge-me-action v2 composite
  • tibdex/github-app-token v2 composite
.github/workflows/release.yml actions
  • 8398a7/action-slack v3 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
  • sigstore/gh-action-sigstore-python v2.1.1 composite
.github/workflows/tox-pytest.yml actions
  • 8398a7/action-slack v3 composite
  • actions/checkout v4 composite
  • codecov/codecov-action v4 composite
  • google-github-actions/auth v2 composite
  • mamba-org/setup-micromamba v1 composite
pyproject.toml pypi
  • pandas [parquet,excel,fss,gcp,compression]>=2,<3
  • pydantic [email]>=2,<3
  • sqlalchemy >=2,<3
environment.yml conda
  • jupyterlab >=3.2,<4
  • nbconvert >=6,<7
  • nodejs
  • pip >=21,<24
  • python >=3.10,<3.12
  • setuptools >=66,<69