harmonize-wq

harmonize-wq: Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats - Published in JOSS (2024)

https://github.com/usepa/harmonize-wq

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

ord

Scientific Fields

Political Science Social Sciences - 42% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

Basic Info
Statistics
  • Stars: 18
  • Watchers: 5
  • Forks: 8
  • Open Issues: 27
  • Releases: 4
Topics
ord
Created over 3 years ago · Last pushed 11 months ago
Metadata Files
Readme Contributing License Citation

README.md

PyPi Documentation Status Project Status: Active – The project has reached a stable, usable state and is being actively developed. test Python Version from PEP 621 TOML pyOpenSci Peer-Reviewed DOI

harmonize-wq

Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)
  • Identify differences in sampling or analytic methods
  • Resolve data errors using transparent assumptions
  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

For complete documentation see docs. For more complete tutorial information see: demos

Quick Start

harmonize_wq can be installed using pip: bash python3 -m pip install harmonize-wq

To install the latest development version of harmonize_wq using pip:

bash pip install git+https://github.com/USEPA/harmonize-wq.git

Example Workflow

dataretrieval Query for a geojson

```python import dataretrieval.wqp as wqp from harmonize_wq import wrangle

File for area of interest

aoiurl = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonizewq/tests/data/PPBays_NCCA.geojson'

Build query

query = {'characteristicName': ['Temperature, water', 'Depth, Secchi disk depth', ]} query['bBox'] = wrangle.getboundingbox(aoi_url) query['dataProfile'] = 'narrowResult'

Run query

resnarrow, mdnarrow = wqp.get_results(**query)

dataframe of downloaded results

res_narrow ```

Harmonize results

```python from harmonize_wq import harmonize

Harmonize all results

dfharmonized = harmonize.harmonizeall(resnarrow, errors='raise') dfharmonized ```

Clean results

```python from harmonize_wq import clean

Clean up other columns of data

dfcleaned = clean.datetime(dfharmonized) # datetime dfcleaned = clean.harmonizedepth(dfcleaned) # Sample depth dfcleaned ```

Transform results from long to wide format

There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

```python from harmonize_wq import wrangle

Split QA column into multiple characteristic specific QA columns

dffull = wrangle.splitcol(df_cleaned)

Divide table into columns of interest (maindf) and characteristic specific metadata (charsdf)

maindf, charsdf = wrangle.splittable(dffull)

Combine rows with the same sample organization, activity, location, and datetime

dfwide = wrangle.collapseresults(main_df)

```

The number of columns in the resulting table is greatly reduced

Output Column | Type | Source | Changes --- | --- | --- | --- MonitoringLocationIdentifier | Defines row | MonitoringLocationIdentifier | NA Activitydatetime | Defines row | ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode | Combined and UTC ActivityIdentifier | Defines row | ActivityIdentifier | NA OrganizationIdentifier | Defines row | OrganizationIdentifier | NA OrganizationFormalName | Metadata| OrganizationFormalName | NA ProviderName | Metadata | ProviderName | NA StartDate | Metadata | ActivityStartDate | Preserves date where time NAT Depth | Metadata | ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode | standardized to meters Secchi | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to meters QASecchi | QA | NA | harmonization processing quality issues Temperature | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to degrees Celcius QA_Temperature | QA | NA | harmonization processing quality issues

Issue Tracker

harmonize_wq is under development. Please report any bugs and enhancement ideas using issues

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

Owner

  • Name: U.S. Environmental Protection Agency
  • Login: USEPA
  • Kind: organization
  • Location: United States of America

JOSS Publication

harmonize-wq: Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats
Published
October 22, 2024
Volume 9, Issue 102, Page 7305
Authors
Justin Bousquin ORCID
U.S. Environmental Protection Agency, Gulf Ecosystem Measurement and Modeling Division, Gulf Breeze, FL 32561
Cristina A. Mullin ORCID
U.S. Environmental Protection Agency, Watershed Restoration, Assessment and Protection Division, Washington, D.C. 20460
Editor
Kristen Thyng ORCID
Tags
water quality data set analysis

Citation (CITATION.cff)

CITATION.cff

cff-version: "1.2.0"
authors:
- family-names: Bousquin
  given-names: Justin
  orcid: "https://orcid.org/0000-0001-5797-4322"
- family-names: Mullin
  given-names: Cristina A.
  orcid: "https://orcid.org/0000-0002-0615-6087"
doi: 10.5281/zenodo.13356847
message: If you use this package, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Bousquin
    given-names: Justin
    orcid: "https://orcid.org/0000-0001-5797-4322"
  - family-names: Mullin
    given-names: Cristina A.
    orcid: "https://orcid.org/0000-0002-0615-6087"
  date-published: 2024-10-22
  doi: 10.21105/joss.07305
  issn: 2475-9066
  issue: 102
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 7305
  title: "harmonize-wq: Standardize, clean and wrangle Water Quality
    Portal data into more analytic-ready formats"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.07305"
  volume: 9
title: "harmonize-wq: Standardize, clean and wrangle Water Quality
  Portal data into more analytic-ready formats"

GitHub Events

Total
  • Issues event: 7
  • Watch event: 3
  • Delete event: 4
  • Issue comment event: 2
  • Push event: 17
  • Pull request event: 15
  • Fork event: 4
  • Create event: 8
Last Year
  • Issues event: 7
  • Watch event: 3
  • Delete event: 4
  • Issue comment event: 2
  • Push event: 17
  • Pull request event: 15
  • Fork event: 4
  • Create event: 8

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 509
  • Total Committers: 5
  • Avg Commits per committer: 101.8
  • Development Distribution Score (DDS): 0.043
Past Year
  • Commits: 29
  • Committers: 3
  • Avg Commits per committer: 9.667
  • Development Distribution Score (DDS): 0.172
Top Committers
Name Email Commits
Bousquin B****n@e****v 487
Cristina Mullin m****a@e****v 14
github-actions[bot] g****] 4
Romain Caneill r****l@e****g 3
Timothy Hodson 3****s 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 45
  • Total pull requests: 63
  • Average time to close issues: 3 months
  • Average time to close pull requests: 8 days
  • Total issue authors: 2
  • Total pull request authors: 6
  • Average comments per issue: 0.64
  • Average comments per pull request: 0.27
  • Merged pull requests: 51
  • Bot issues: 0
  • Bot pull requests: 5
Past Year
  • Issues: 6
  • Pull requests: 10
  • Average time to close issues: 18 days
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.17
  • Average comments per pull request: 0.1
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 5
Top Authors
Issue Authors
  • jbousquin (38)
  • rcaneill (6)
Pull Request Authors
  • jbousquin (79)
  • rcaneill (8)
  • dependabot[bot] (5)
  • cristinamullin (2)
  • thodson-usgs (2)
  • Batalex (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (5) github_actions (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 23 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: harmonize-wq

Package to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

  • Homepage: https://github.com/USEPA/harmonize-wq
  • Documentation: https://usepa.github.io/harmonize-wq/
  • License: MIT License Copyright (c) 2023 U.S. Federal Government (in countries where recognized) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • Latest release: 0.5.0
    published over 1 year ago
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 23 Last month
Rankings
Dependent packages count: 7.4%
Forks count: 19.4%
Stargazers count: 19.5%
Average: 22.8%
Dependent repos count: 44.8%
Maintainers (1)
Last synced: 4 months ago

Dependencies

requirements.txt pypi
  • dataretrieval >=0.7
  • descartes >=1.1.0
  • geopandas >=0.10.2
  • pint >=0.18