harmonize-wq
harmonize-wq: Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats - Published in JOSS (2024)
Science Score: 98.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Scientific Fields
Repository
Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
Basic Info
- Host: GitHub
- Owner: USEPA
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://usepa.github.io/harmonize-wq/
- Size: 87.8 MB
Statistics
- Stars: 18
- Watchers: 5
- Forks: 8
- Open Issues: 27
- Releases: 4
Topics
Metadata Files
README.md
harmonize-wq
Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:
- Identify differences in data units (including speciation and basis)
- Identify differences in sampling or analytic methods
- Resolve data errors using transparent assumptions
- Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.
For complete documentation see docs. For more complete tutorial information see: demos
Quick Start
harmonize_wq can be installed using pip:
bash
python3 -m pip install harmonize-wq
To install the latest development version of harmonize_wq using pip:
bash
pip install git+https://github.com/USEPA/harmonize-wq.git
Example Workflow
dataretrieval Query for a geojson
```python import dataretrieval.wqp as wqp from harmonize_wq import wrangle
File for area of interest
aoiurl = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonizewq/tests/data/PPBays_NCCA.geojson'
Build query
query = {'characteristicName': ['Temperature, water', 'Depth, Secchi disk depth', ]} query['bBox'] = wrangle.getboundingbox(aoi_url) query['dataProfile'] = 'narrowResult'
Run query
resnarrow, mdnarrow = wqp.get_results(**query)
dataframe of downloaded results
res_narrow ```
Harmonize results
```python from harmonize_wq import harmonize
Harmonize all results
dfharmonized = harmonize.harmonizeall(resnarrow, errors='raise') dfharmonized ```
Clean results
```python from harmonize_wq import clean
Clean up other columns of data
dfcleaned = clean.datetime(dfharmonized) # datetime dfcleaned = clean.harmonizedepth(dfcleaned) # Sample depth dfcleaned ```
Transform results from long to wide format
There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.
```python from harmonize_wq import wrangle
Split QA column into multiple characteristic specific QA columns
dffull = wrangle.splitcol(df_cleaned)
Divide table into columns of interest (maindf) and characteristic specific metadata (charsdf)
maindf, charsdf = wrangle.splittable(dffull)
Combine rows with the same sample organization, activity, location, and datetime
dfwide = wrangle.collapseresults(main_df)
```
The number of columns in the resulting table is greatly reduced
Output Column | Type | Source | Changes --- | --- | --- | --- MonitoringLocationIdentifier | Defines row | MonitoringLocationIdentifier | NA Activitydatetime | Defines row | ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode | Combined and UTC ActivityIdentifier | Defines row | ActivityIdentifier | NA OrganizationIdentifier | Defines row | OrganizationIdentifier | NA OrganizationFormalName | Metadata| OrganizationFormalName | NA ProviderName | Metadata | ProviderName | NA StartDate | Metadata | ActivityStartDate | Preserves date where time NAT Depth | Metadata | ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode | standardized to meters Secchi | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to meters QASecchi | QA | NA | harmonization processing quality issues Temperature | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to degrees Celcius QA_Temperature | QA | NA | harmonization processing quality issues
Issue Tracker
harmonize_wq is under development. Please report any bugs and enhancement ideas using issues
Disclaimer
The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.
Owner
- Name: U.S. Environmental Protection Agency
- Login: USEPA
- Kind: organization
- Location: United States of America
- Website: https://www.epa.gov
- Twitter: EPA
- Repositories: 449
- Profile: https://github.com/USEPA
JOSS Publication
harmonize-wq: Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats
Authors
Tags
water quality data set analysisCitation (CITATION.cff)
CITATION.cff
cff-version: "1.2.0"
authors:
- family-names: Bousquin
given-names: Justin
orcid: "https://orcid.org/0000-0001-5797-4322"
- family-names: Mullin
given-names: Cristina A.
orcid: "https://orcid.org/0000-0002-0615-6087"
doi: 10.5281/zenodo.13356847
message: If you use this package, please cite our article in the
Journal of Open Source Software.
preferred-citation:
authors:
- family-names: Bousquin
given-names: Justin
orcid: "https://orcid.org/0000-0001-5797-4322"
- family-names: Mullin
given-names: Cristina A.
orcid: "https://orcid.org/0000-0002-0615-6087"
date-published: 2024-10-22
doi: 10.21105/joss.07305
issn: 2475-9066
issue: 102
journal: Journal of Open Source Software
publisher:
name: Open Journals
start: 7305
title: "harmonize-wq: Standardize, clean and wrangle Water Quality
Portal data into more analytic-ready formats"
type: article
url: "https://joss.theoj.org/papers/10.21105/joss.07305"
volume: 9
title: "harmonize-wq: Standardize, clean and wrangle Water Quality
Portal data into more analytic-ready formats"
GitHub Events
Total
- Issues event: 7
- Watch event: 3
- Delete event: 4
- Issue comment event: 2
- Push event: 17
- Pull request event: 15
- Fork event: 4
- Create event: 8
Last Year
- Issues event: 7
- Watch event: 3
- Delete event: 4
- Issue comment event: 2
- Push event: 17
- Pull request event: 15
- Fork event: 4
- Create event: 8
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Bousquin | B****n@e****v | 487 |
| Cristina Mullin | m****a@e****v | 14 |
| github-actions[bot] | g****] | 4 |
| Romain Caneill | r****l@e****g | 3 |
| Timothy Hodson | 3****s | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 45
- Total pull requests: 63
- Average time to close issues: 3 months
- Average time to close pull requests: 8 days
- Total issue authors: 2
- Total pull request authors: 6
- Average comments per issue: 0.64
- Average comments per pull request: 0.27
- Merged pull requests: 51
- Bot issues: 0
- Bot pull requests: 5
Past Year
- Issues: 6
- Pull requests: 10
- Average time to close issues: 18 days
- Average time to close pull requests: 2 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.17
- Average comments per pull request: 0.1
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 5
Top Authors
Issue Authors
- jbousquin (38)
- rcaneill (6)
Pull Request Authors
- jbousquin (79)
- rcaneill (8)
- dependabot[bot] (5)
- cristinamullin (2)
- thodson-usgs (2)
- Batalex (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 23 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
pypi.org: harmonize-wq
Package to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
- Homepage: https://github.com/USEPA/harmonize-wq
- Documentation: https://usepa.github.io/harmonize-wq/
- License: MIT License Copyright (c) 2023 U.S. Federal Government (in countries where recognized) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 0.5.0
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- dataretrieval >=0.7
- descartes >=1.1.0
- geopandas >=0.10.2
- pint >=0.18
