open-data-pipeline
A pipeline for processing, enhancing, and sharing open datasets.
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
✓Institutional organization owner
Organization uk-ipop has institutional domain (pharmacy.uky.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
A pipeline for processing, enhancing, and sharing open datasets.
Basic Info
- Host: GitHub
- Owner: UK-IPOP
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Homepage: https://uk-ipop.github.io/open-data-pipeline/
- Size: 45.6 MB
Statistics
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 4
- Releases: 0
Topics
Metadata Files
README.md
Medical Examiner Open Data Pipeline
This repository contains the code for the Medical Examiner Open Data Pipeline.
We currently fetch data from the following sources:
- Cook County Medical Examiner's Archives
- San Diego Medical Examiner's Office
- Milwaukee County Medical Examiner's Office
- Connecticut (State) Accidental Drug Deaths
- Santa Clara County Medical Examiner's Office
- Sacramento County Medical Examiner's Office
- Pima County Medical Examiner's Office
- This source is a manual data dump in collaboration with the Pima County ME/C Office. Data is refreshed monthly.
The results of this data are used in various other analysis here on GitHub:
- Cook County
- Where we add geospatial data to the Cook County data
- This was excluded from this automated pipeline due to specific requirements for the data for only Cook County
Getting Started
This repo exists mainly to take advantage of GitHub actions for automation.
The actions workflow is located in .github/workflows/pipeline.yml and is triggered weekly or manually.
This workflow fetches data from the configured data sources inside config.json,
geocodes addresses (when available) using ArcGIS, extracts drugs using the drug extraction toolbox
and then compiles and zips up the results into the GitHub Releases page.
The data is then available for download from the releases page page.
Further, the entire workflow effectively runs a series of commands using the CLI application opendata-pipeline which is located in the src directory.
This is also available via a docker image hosted on ghcr.io. The benefits of using the CLI via a docker image is that you don't have to have Python or the drug toolbox on your local machine 🙂.
We utilize async methods to speed up the large number of web requests we make to the data sources.
It is important to regularly fetch/pull from this repo to maintain an updated
config.json
We currently do not guarantee Windows support unfortunately. If you want to help make that a reality, please submit a new Pull Request
There is further API-documentation available on the GitHub Pages website for this repo if you want to interact with the CLI.
I would recommend using the docker image as it is easier to use and always referring to the CLI --help for more information.
NOTE: The Census has recently made changes making it harder to download files running on servers so if you add
a location to the configuration, make sure its corresponding CensusTract file is downloaded and placed into the data/spatial folder. You can do this by running the following command:
wget -P data/spatial <URL> where the URL should be the URL of the the TIGER TRACT zip file, for example: https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl202409_tract.zip
Or, an example of the url: https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl2024
Workflow
The workflow can best be described by looking at the pipeline.yml file.

Data Enhancements
The following table shows the fields that we add to the original datafiles:
| Column Name | Description |
| :------ | :------ |
| CaseIdentifier | A unique identifier across all the datasets. |
| death_day | Day of the Month death occurred |
| death_month | Month Name death occurred |
| death_month_num | Month Number death occurred |
| death_year | Year death occurred |
| death_day_of_week | Day of week death occurred. Starting with 0 on Monday. Weekends are 5 (Saturday) & 6 (Sunday). |
| death_day_is_weekend | Death occurred on weekend day |
| death_day_week_of_year | Week of the year (of 52) that death occurred |
| geocoded_latitude | Geocoded latitude. |
| geocoded_longitude | Geocoded longitude. |
| geocoded_score | Confidence of geocoding. 70-100. |
| geocoded_address| The address that the geocoded results correspond to. Not the address provided to the geocoder. |
Drug Columns
In addition to providing the extracted drugs as a separate file in each release, we also convert this data to wide-form for each dataset. This adds the following columns in the subsequent pattern:
| Column Name/Pattern | Description |
| :--- | :--- |
| *_1 | * drug found in first search column provided in drug configuration |
| *_2 | * drug found in second search column provided in drug configuration |
| *_meta | Drug of * category/class found in this record across any search column.
Requirements
uv
Installation
To install the python cli I recommend using uv.
bash
uvx opendata-pipeline
To install the docker image, you can use the following command:
bash
docker pull ghcr.io/uk-ipop/opendata-pipeline:latest
Usage
Usage is very similar to any other command line application. The most important thing is to follow the workflow defined above.
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Help me write some tests!
License
BibTex Citation
If you use this software or the enhanced data, please cite this repository:
@software{Anthony_Medical_Examiner_OpenData_2022,
author = {Anthony, Nicholas},
month = {9},
title = {{Medical Examiner OpenData Pipeline}},
url = {https://github.com/UK-IPOP/open-data-pipeline},
version = {0.2.1},
year = {2022}
}
Thank you.
Owner
- Name: UK IPOP
- Login: UK-IPOP
- Kind: organization
- Location: Lexington, KY
- Website: https://pharmacy.uky.edu/ipop-cloned
- Repositories: 11
- Profile: https://github.com/UK-IPOP
University of Kentucky IPOP
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software or the data, please cite it as below."
authors:
- family-names: "Anthony"
given-names: "Nicholas"
orcid: "https://orcid.org/my-orcid?orcid=0000-0002-6692-3401"
title: "Medical Examiner OpenData Pipeline"
version: 0.2.1
date-released: 2022-09-13
url: "https://github.com/UK-IPOP/open-data-pipeline"
GitHub Events
Total
- Create event: 35
- Issues event: 21
- Release event: 35
- Delete event: 3
- Issue comment event: 32
- Push event: 75
- Pull request event: 4
Last Year
- Create event: 35
- Issues event: 21
- Release event: 35
- Delete event: 3
- Issue comment event: 32
- Push event: 75
- Pull request event: 4
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Nick Anthony | n****7@g****m | 215 |
| github-actions[bot] | 4****] | 170 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 41
- Total pull requests: 12
- Average time to close issues: 3 months
- Average time to close pull requests: 20 days
- Total issue authors: 4
- Total pull request authors: 1
- Average comments per issue: 1.24
- Average comments per pull request: 0.42
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 12
- Pull requests: 2
- Average time to close issues: 12 days
- Average time to close pull requests: 3 days
- Issue authors: 4
- Pull request authors: 1
- Average comments per issue: 2.25
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- nanthony007 (26)
- cdelcher (12)
- yalbal4 (1)
- nabarunDG (1)
Pull Request Authors
- nanthony007 (18)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- appnope 0.1.3 develop
- asttokens 2.0.8 develop
- backcall 0.2.0 develop
- black 22.8.0 develop
- cffi 1.15.1 develop
- debugpy 1.6.3 develop
- decorator 5.1.1 develop
- entrypoints 0.4 develop
- executing 1.0.0 develop
- flake8 5.0.4 develop
- iniconfig 1.1.1 develop
- ipykernel 6.15.2 develop
- ipython 8.5.0 develop
- isort 5.10.1 develop
- jedi 0.18.1 develop
- jupyter-client 7.3.5 develop
- jupyter-core 4.11.1 develop
- matplotlib-inline 0.1.6 develop
- mccabe 0.7.0 develop
- mkdocs-material 8.4.4 develop
- mkdocs-material-extensions 1.0.3 develop
- mypy-extensions 0.4.3 develop
- nest-asyncio 1.5.5 develop
- parso 0.8.3 develop
- pathspec 0.10.1 develop
- pexpect 4.8.0 develop
- pickleshare 0.7.5 develop
- platformdirs 2.5.2 develop
- pluggy 1.0.0 develop
- prompt-toolkit 3.0.31 develop
- psutil 5.9.2 develop
- ptyprocess 0.7.0 develop
- pure-eval 0.2.2 develop
- py 1.11.0 develop
- pycodestyle 2.9.1 develop
- pycparser 2.21 develop
- pyflakes 2.5.0 develop
- pytest 7.1.3 develop
- pywin32 304 develop
- pyzmq 23.2.1 develop
- stack-data 0.5.0 develop
- tomli 2.0.1 develop
- tornado 6.2 develop
- traitlets 5.3.0 develop
- wcwidth 0.2.5 develop
- aiohttp 3.8.1
- aiosignal 1.2.0
- async-timeout 4.0.2
- attrs 22.1.0
- certifi 2022.6.15
- charset-normalizer 2.1.1
- click 8.1.3
- colorama 0.4.5
- commonmark 0.9.1
- frozenlist 1.3.1
- ghp-import 2.1.0
- griffe 0.22.1
- idna 3.3
- importlib-metadata 4.12.0
- jinja2 3.1.2
- markdown 3.3.7
- markupsafe 2.1.1
- mergedeep 1.3.4
- mkdocs 1.3.1
- mkdocs-autorefs 0.4.1
- mkdocstrings 0.19.0
- mkdocstrings-python 0.7.1
- multidict 6.0.2
- numpy 1.23.3
- orjson 3.8.0
- packaging 21.3
- pandas 1.4.4
- pydantic 1.10.2
- pygments 2.13.0
- pymdown-extensions 9.5
- pyparsing 3.0.9
- python-dateutil 2.8.2
- python-dotenv 0.21.0
- pytz 2022.2.1
- pyyaml 6.0
- pyyaml-env-tag 0.1
- requests 2.28.1
- rich 12.5.1
- six 1.16.0
- typer 0.6.1
- typing-extensions 4.3.0
- urllib3 1.26.12
- watchdog 2.1.9
- yarl 1.8.1
- zipp 3.8.1
- black ^22.8.0 develop
- flake8 ^5.0.4 develop
- ipykernel ^6.15.2 develop
- isort ^5.10.1 develop
- mkdocs ^1.3.1 develop
- mkdocs-material ^8.4.4 develop
- mkdocstrings ^0.19.0 develop
- pytest ^7.1.3 develop
- aiohttp ^3.8.1
- mkdocstrings ^0.19.0
- orjson ^3.8.0
- pandas ^1.4.4
- pydantic ^1.10.2
- python ^3.10
- requests ^2.28.1
- rich ^12.5.1
- typer ^0.6.1
- actions-rs/toolchain v1 composite
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- marvinpinto/action-automatic-releases latest composite
- mcr.microsoft.com/vscode/devcontainers/python 0-${VARIANT} build