datazimmer

Data artifact orchestrator

https://github.com/sscu-budapest/datazimmer

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
    Organization sscu-budapest has institutional domain (sscu-budapest.github.io)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

data-engineering pipelines-as-code research-software
Last synced: 6 months ago · JSON representation ·

Repository

Data artifact orchestrator

Basic Info
  • Host: GitHub
  • Owner: sscu-budapest
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 765 KB
Statistics
  • Stars: 3
  • Watchers: 0
  • Forks: 2
  • Open Issues: 3
  • Releases: 38
Topics
data-engineering pipelines-as-code research-software
Created almost 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

datazimmer

Documentation Status codeclimate codecov pypi DOI

To create a new project

  • make sure that python points to python>=3.8 and you have pip and git then pip install datazimmer
  • run dz init project-name
  • add a remote
    • both to git and dvc (can run dz build-meta to see available dvc remotes)
    • git remote can be given with dz init
  • create, register and document steps in a pipeline you will run in different environments
  • build metadata to exportable and serialized format with dz build-meta
    • if you defined importable data from other artifacts in the config, you can import them with load-external-data
    • ensure that you import envs that are served from sources you have access to
  • build and run pipeline steps by running dz run
  • validate that the data matches the datascript description with dz validate

Scheduling

  • a project as a whole has a cron expression in zimmer.yaml to determine the schedule of reruns
  • additionally, aswan projects within the dz project can have different cron expressions for scheduling new runs of the aswan projects

Test projects

TODO: document dogshow and everything else much better here

Lookahead

  • overlapping names convention
  • resolve naming confusion with colassigner, colaccessor and table feature / composite type / index base classes
  • abstract composite type + subclass of entity class
    • import ACT, inherit from it and specify
    • importing composite type is impossible now if it contains foreign key :(
  • add option to infer data type of assigned feature
    • can be problematic b/c pandas int/float/nan issue
  • create similar sets of features in a dry way
  • overlapping in entities
    • detect / signal the same type of entity
  • exports: postgres, postgis , superset

W3C compliancy plan

  • test suite for compliance: https://w3c.github.io/csvw/publishing-snapshots/PR-earl/earl.html
  • https://github.com/w3c/csvw
    • https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/
    • https://www.w3.org/TR/tabular-metadata/

@article{tennison2015model, title={Model for tabular data and metadata on the web}, author={Tennison, Jeni and Kellogg, Gregg and Herman, Ivan}, year={2015} }

@article{pollock2015metadata, title={Metadata vocabulary for tabular data}, author={Pollock, Rufus and Tennison, Jeni and Kellogg, Gregg and Herman, Ivan}, journal={W3C Recommendation}, volume={17}, year={2015} }

Owner

  • Name: Social Science Computing Unit Budapest
  • Login: sscu-budapest
  • Kind: organization
  • Email: borza.endre@krtk.mta.hu

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
url: https://github.com/sscu-budapest/datazimmer
authors:
- family-names: Borza
  given-names: Endre Márk
  orcid: https://orcid.org/0000-0002-8804-4520
- family-names: Kovács
  given-names: Bence
  orcid: https://orcid.org/0000-0002-2225-9895
- family-names: Pap
  given-names: Sebestyén
  orcid: https://orcid.org/0000-0002-1987-845X
title: sscu-budapest/datazimmer
version: 0.5.4
date-released: 2023-07-25

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 724
  • Total Committers: 3
  • Avg Commits per committer: 241.333
  • Development Distribution Score (DDS): 0.012
Top Committers
Name Email Commits
Endre Márk Borza e****a@g****m 715
papsebestyen p****n@g****m 7
kbenya k****5@g****m 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.75
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • endremborza (2)
Pull Request Authors
  • endremborza (4)
  • papsebestyen (3)
  • renovate[bot] (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,529 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 4
  • Total versions: 38
  • Total maintainers: 1
pypi.org: datazimmer

sscu-budapest utilities for scientific data engineering

  • Versions: 38
  • Dependent Packages: 0
  • Dependent Repositories: 4
  • Downloads: 1,529 Last month
Rankings
Downloads: 5.6%
Dependent repos count: 7.5%
Dependent packages count: 10.0%
Average: 13.5%
Forks count: 19.1%
Stargazers count: 25.0%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/compatibility_test.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/test.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v3 composite
  • postgres * docker
.github/workflows/twine_release.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
pyproject.toml pypi
  • colassigner >=0.2.2
  • cookiecutter *
  • flit *
  • metazimmer *
  • pandas >=2.0.1
  • parquetranger >=0.2.3
  • pip >=22.0.0
  • pyyaml *
  • setuptools >=60.0.0
  • sqlalchemy >=2.0.0
  • sqlmermaid *
  • structlog *
  • toml *
  • typer *
  • wheel >=0.37.0
  • zimmauth >=0.1.0