open-data-pipeline

A pipeline for processing, enhancing, and sharing open datasets.

https://github.com/uk-ipop/open-data-pipeline

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
✓
Institutional organization owner
Organization uk-ipop has institutional domain (pharmacy.uky.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Keywords

actions automation data python

Keywords from Contributors

interpretability standardization hack

Last synced: 6 months ago · JSON representation ·

Repository

A pipeline for processing, enhancing, and sharing open datasets.

Basic Info

Host: GitHub
Owner: UK-IPOP
License: gpl-3.0
Language: Python
Default Branch: main
Homepage: https://uk-ipop.github.io/open-data-pipeline/
Size: 45.6 MB

Statistics

Stars: 2
Watchers: 0
Forks: 0
Open Issues: 4
Releases: 0

Topics

actions automation data python

Created over 3 years ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

Medical Examiner Open Data Pipeline

logo

This repository contains the code for the Medical Examiner Open Data Pipeline.

We currently fetch data from the following sources:

Cook County Medical Examiner's Archives
San Diego Medical Examiner's Office
Milwaukee County Medical Examiner's Office
Connecticut (State) Accidental Drug Deaths
Santa Clara County Medical Examiner's Office
Sacramento County Medical Examiner's Office
Pima County Medical Examiner's Office
- This source is a manual data dump in collaboration with the Pima County ME/C Office. Data is refreshed monthly.

The results of this data are used in various other analysis here on GitHub:

Cook County
- Where we add geospatial data to the Cook County data
- This was excluded from this automated pipeline due to specific requirements for the data for only Cook County

Getting Started

This repo exists mainly to take advantage of GitHub actions for automation.

The actions workflow is located in .github/workflows/pipeline.yml and is triggered weekly or manually.

This workflow fetches data from the configured data sources inside config.json, geocodes addresses (when available) using ArcGIS, extracts drugs using the drug extraction toolbox and then compiles and zips up the results into the GitHub Releases page.

The data is then available for download from the releases page page.

Further, the entire workflow effectively runs a series of commands using the CLI application opendata-pipeline which is located in the src directory.

This is also available via a docker image hosted on ghcr.io. The benefits of using the CLI via a docker image is that you don't have to have Python or the drug toolbox on your local machine 🙂.

We utilize async methods to speed up the large number of web requests we make to the data sources.

It is important to regularly fetch/pull from this repo to maintain an updated config.json

We currently do not guarantee Windows support unfortunately. If you want to help make that a reality, please submit a new Pull Request

There is further API-documentation available on the GitHub Pages website for this repo if you want to interact with the CLI. I would recommend using the docker image as it is easier to use and always referring to the CLI --help for more information.

NOTE: The Census has recently made changes making it harder to download files running on servers so if you add a location to the configuration, make sure its corresponding CensusTract file is downloaded and placed into the data/spatial folder. You can do this by running the following command:

wget -P data/spatial <URL> where the URL should be the URL of the the TIGER TRACT zip file, for example: https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl202409_tract.zip

Or, an example of the url: https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl2024_tract.zip

Workflow

The workflow can best be described by looking at the pipeline.yml file.

CleanShot 2023-01-18 at 10 38 29@2x

Data Enhancements

The following table shows the fields that we add to the original datafiles:

Drug Columns

In addition to providing the extracted drugs as a separate file in each release, we also convert this data to wide-form for each dataset. This adds the following columns in the subsequent pattern:

Requirements

uv

Installation

To install the python cli I recommend using uv.

bash uvx opendata-pipeline

To install the docker image, you can use the following command:

bash docker pull ghcr.io/uk-ipop/opendata-pipeline:latest

Usage

Usage is very similar to any other command line application. The most important thing is to follow the workflow defined above.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Help me write some tests!

License

MIT

BibTex Citation

If you use this software or the enhanced data, please cite this repository:

@software{Anthony_Medical_Examiner_OpenData_2022, author = {Anthony, Nicholas}, month = {9}, title = {{Medical Examiner OpenData Pipeline}}, url = {https://github.com/UK-IPOP/open-data-pipeline}, version = {0.2.1}, year = {2022} }

Thank you.

Owner

Name: UK IPOP
Login: UK-IPOP
Kind: organization
Location: Lexington, KY

Website: https://pharmacy.uky.edu/ipop-cloned
Repositories: 11
Profile: https://github.com/UK-IPOP

University of Kentucky IPOP

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software or the data, please cite it as below."
authors:
  - family-names: "Anthony"
    given-names: "Nicholas"
    orcid: "https://orcid.org/my-orcid?orcid=0000-0002-6692-3401"
title: "Medical Examiner OpenData Pipeline"
version: 0.2.1
date-released: 2022-09-13
url: "https://github.com/UK-IPOP/open-data-pipeline"

GitHub Events

Total

Create event: 35
Issues event: 21
Release event: 35
Delete event: 3
Issue comment event: 32
Push event: 75
Pull request event: 4

Last Year

Create event: 35
Issues event: 21
Release event: 35
Delete event: 3
Issue comment event: 32
Push event: 75
Pull request event: 4

Committers

Last synced: 9 months ago

All Time

Total Commits: 385
Total Committers: 2
Avg Commits per committer: 192.5
Development Distribution Score (DDS): 0.442

Past Year

Commits: 93
Committers: 2
Avg Commits per committer: 46.5
Development Distribution Score (DDS): 0.43

Top Committers

Name	Email	Commits
Nick Anthony	n**7@g**m	215
github-actions[bot]	4****]	170

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 41
Total pull requests: 12
Average time to close issues: 3 months
Average time to close pull requests: 20 days
Total issue authors: 4
Total pull request authors: 1
Average comments per issue: 1.24
Average comments per pull request: 0.42
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 12
Pull requests: 2
Average time to close issues: 12 days
Average time to close pull requests: 3 days
Issue authors: 4
Pull request authors: 1
Average comments per issue: 2.25
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

nanthony007 (26)
cdelcher (12)
yalbal4 (1)
nabarunDG (1)

Pull Request Authors

nanthony007 (18)

Top Labels

Issue Labels

enhancement (12) bug (2) question (2) documentation (1)

Pull Request Labels

enhancement (2)

Dependencies

poetry.lock pypi

appnope 0.1.3 develop
asttokens 2.0.8 develop
backcall 0.2.0 develop
black 22.8.0 develop
cffi 1.15.1 develop
debugpy 1.6.3 develop
decorator 5.1.1 develop
entrypoints 0.4 develop
executing 1.0.0 develop
flake8 5.0.4 develop
iniconfig 1.1.1 develop
ipykernel 6.15.2 develop
ipython 8.5.0 develop
isort 5.10.1 develop
jedi 0.18.1 develop
jupyter-client 7.3.5 develop
jupyter-core 4.11.1 develop
matplotlib-inline 0.1.6 develop
mccabe 0.7.0 develop
mkdocs-material 8.4.4 develop
mkdocs-material-extensions 1.0.3 develop
mypy-extensions 0.4.3 develop
nest-asyncio 1.5.5 develop
parso 0.8.3 develop
pathspec 0.10.1 develop
pexpect 4.8.0 develop
pickleshare 0.7.5 develop
platformdirs 2.5.2 develop
pluggy 1.0.0 develop
prompt-toolkit 3.0.31 develop
psutil 5.9.2 develop
ptyprocess 0.7.0 develop
pure-eval 0.2.2 develop
py 1.11.0 develop
pycodestyle 2.9.1 develop
pycparser 2.21 develop
pyflakes 2.5.0 develop
pytest 7.1.3 develop
pywin32 304 develop
pyzmq 23.2.1 develop
stack-data 0.5.0 develop
tomli 2.0.1 develop
tornado 6.2 develop
traitlets 5.3.0 develop
wcwidth 0.2.5 develop
aiohttp 3.8.1
aiosignal 1.2.0
async-timeout 4.0.2
attrs 22.1.0
certifi 2022.6.15
charset-normalizer 2.1.1
click 8.1.3
colorama 0.4.5
commonmark 0.9.1
frozenlist 1.3.1
ghp-import 2.1.0
griffe 0.22.1
idna 3.3
importlib-metadata 4.12.0
jinja2 3.1.2
markdown 3.3.7
markupsafe 2.1.1
mergedeep 1.3.4
mkdocs 1.3.1
mkdocs-autorefs 0.4.1
mkdocstrings 0.19.0
mkdocstrings-python 0.7.1
multidict 6.0.2
numpy 1.23.3
orjson 3.8.0
packaging 21.3
pandas 1.4.4
pydantic 1.10.2
pygments 2.13.0
pymdown-extensions 9.5
pyparsing 3.0.9
python-dateutil 2.8.2
python-dotenv 0.21.0
pytz 2022.2.1
pyyaml 6.0
pyyaml-env-tag 0.1
requests 2.28.1
rich 12.5.1
six 1.16.0
typer 0.6.1
typing-extensions 4.3.0
urllib3 1.26.12
watchdog 2.1.9
yarl 1.8.1
zipp 3.8.1

pyproject.toml pypi

black ^22.8.0 develop
flake8 ^5.0.4 develop
ipykernel ^6.15.2 develop
isort ^5.10.1 develop
mkdocs ^1.3.1 develop
mkdocs-material ^8.4.4 develop
mkdocstrings ^0.19.0 develop
pytest ^7.1.3 develop
aiohttp ^3.8.1
mkdocstrings ^0.19.0
orjson ^3.8.0
pandas ^1.4.4
pydantic ^1.10.2
python ^3.10
requests ^2.28.1
rich ^12.5.1
typer ^0.6.1

.github/workflows/pipeline.yml actions

actions-rs/toolchain v1 composite
actions/cache v3 composite
actions/checkout v3 composite
actions/download-artifact v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
marvinpinto/action-automatic-releases latest composite

.devcontainer/Dockerfile docker

mcr.microsoft.com/vscode/devcontainers/python 0-${VARIANT} build

open-data-pipeline

Science Score: 52.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Medical Examiner Open Data Pipeline

Getting Started

Workflow

Data Enhancements

Drug Columns

Requirements

Installation

Usage

Contributing

License

BibTex Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies