earthdatalogin
Making public access of public data a bit easier
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.2%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Making public access of public data a bit easier
Basic Info
- Host: GitHub
- Owner: boettiger-lab
- License: other
- Language: R
- Default Branch: main
- Homepage: https://boettiger-lab.github.io/earthdatalogin/
- Size: 10.9 MB
Statistics
- Stars: 26
- Watchers: 2
- Forks: 4
- Open Issues: 5
- Releases: 0
Created almost 3 years ago
· Last pushed 12 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# earthdatalogin :artificial_satellite: :earth_asia:
[](https://github.com/boettiger-lab/earthdatalogin/actions/workflows/R-CMD-check.yaml)
`earthdatalogin` seeks to streamline the process of accessing NASA data from the Earth Data cloud program _from anywhere_. Because Amazon Web Services (AWS) typically charges egress fees whenever network traffic leaves the data center which hosts it, NASA has restricted access to its data hosted by Amazon to only be accessible from AWS servers running in the same data center (`us-west-2`) when using the S3 access protocol. However, NASA also makes this cloud data available publicly to any machine using a standard HTTPS connection. Both cases require requesting and managing credentials tied to a registered user name. This package merely makes that process easier.
## Installation
`earthdatalogin` is now on CRAN, and can simply be installed with
```r
install.packages("earthdatalogin")
```
Or you can install the development version of `earthdatalogin` from [GitHub](https://github.com/):
``` r
# install.packages("devtools")
devtools::install_github("boettiger-lab/earthdatalogin")
```
## Getting started
Access to EarthData is free, but users are required to [register](https://urs.earthdata.nasa.gov/home).
Currently, `earthdatalogin` provides it's own default credentials for a quick start.
Users are still encouraged to **[register](https://urs.earthdata.nasa.gov/home)**
their own credentials!
```{r}
library(earthdatalogin)
```
The easiest and most portable method of access is using the netrc basic authentication protocol for HTTP. Call `edl_netrc()` to set this up given your username and password (passed as optional arguments or read from the environmental variables. If neither provides credentials, `earthdatalogin` will provide it's own credentials, but you may experience rate limits more readily):
```{r}
edl_netrc()
```
If `edl_netrc()` has been called, then most existing spatial data packages in R can then seamlessly access NASA Earthdata over HTTP links.
```{r}
url <- "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T56JKT.2023246T235950.v2.0/HLS.L30.T56JKT.2023246T235950.v2.0.SAA.tif"
terra::rast(url, vsi=TRUE)
```
Note that no special `earthdatalogin` functions are needed in the rest of our code. This is important as it lets the user take advantage of any existing R packages or tutorials without modification, as if there was no authentication barrier to NASA EarthData in the first place. If we had not called `edl_netrc()` for authentication, this would throw an error that the file does not exist. This call needs be made only once per session (i.e. at the start of a script.)
## How does it work?
Most R packages (`terra`, `sf`, `stars`, and others) access spatial data by using an underlying C++ library called GDAL.[^1] GDAL is also used under the hood of many other spatial tools, from Python (`geopandas`, `rasterio`, others) to QGIS and Google Earth Engine. `earthdatalogin` sets a collection of config files and environmental variables used by GDAL to allow it to access authentication credentials. Crucially, the use of `netrc`-based authentication works just as well if you are running from a laptop or if you are running from inside AWS compute in `us-west-2` -- such as using the popular Openscapes 2i2c hub. This portability does not hold for other mechanisms, such as S3-based login, which in the case of NASA EarthData only works from inside AWS-based compute, and not true of the bearer token mechanism, which only works from _outside_ AWS-based compute. The `earthdatalogin` package does provide functions for using these other authentication mechanisms (see `edl_s3_token()` and `edl_set_token()`), but discourages their use as they are less portable while offering no performance advantage.[^2]
[^1]: Some R users have heard that the `rgdal` package is being deprecated. Don't confuse this with the GDAL C++ library being deprecated -- `rgdal` was only one of many R packages that used the GDAL C++ libraries, and was deprecated in favor of the same bindings to GDAL being available in `sf`.
[^2]: NASA's own documentation often points users to the S3-based access protocol when working on AWS compute. Note that the S3 tokens NASA's AWS setup provides expire every hour, and are specific to each DAAC, making it very difficult for users to work across data products from multiple DAACs in a single workflow. Note also that this authentication mechanism provides essentially no advantage in terms of speed or cost to the user or to NASA, especially for large data.
This function takes care of managing tokens for you. If you don't have any tokens, it will request one be minted. If your user name has tokens already, it will look them up and re-use them. (EDL will issue at most two tokens per user, and tokens expire after 6 month, but users shouldn't need to worry about this since `edl_set_token()` handles it). This function will also set the token in as a GDAL environmental variable. This means that popular R packages such as `terra`, `sf` or `stars` that all involve bindings to GDAL will automatically be able to utilize this token for any operations reading from HTTPS (using the `vsicurl` prefix).
## Cloud native workflows
Because NASA EarthData is also the first introduction to cloud-hosted data for many researchers, the fact that NASA tries to minimize egress charges by restricting S3 access to requests coming from AWS `us-west-2` compute center this sometimes gives the false impression that accessing data "from the cloud" _requires_ also paying Amazon Web Services for compute time. This is entirely spurious. For instance, [NOAA also provides](https://registry.opendata.aws/collab/noaa/) an extensive set of regularly updated data products on AWS without this restriction, which can be accessed over a standard HTTPS connection or using the S3 protocol as an anonymous user (with no keys or tokens). To maximize performance, heavy users of NOAA data will frequently choose to access that data from AWS compute in the same region (mostly `us-east-1` for NOAA), but this is not required. Technically speaking, we frequently use the vernacular phrase of "accessing cloud data" to refer to network based access of data using **HTTP range requests** -- the ability to ask a web server to transfer some range of bytes from an individual file rather than transferring the entire file across the network.
Note that we can now successfully access this file over https from any machine with an internet connection, and with no further authentication steps.
That URL could have been obtained in a variety of ways, including , searching individual DAACs, or programmatically using the EarthData STAC catalog.
The point here is that despite the barrier of `earthdatalogin`, the R code required for cloud-native access is now matches the standard strategies we would use for cloud-native access of any other data source.
A key feature that makes 'cloud native' access fast is that this access is _lazy_. All though these individual files could be quite large, our request has not downloaded the entire file -- it has instead used its knowledge of spatial data formats to read just those bytes of the file that provide critical metadata such as extent, projection, bands and coordinate ranges. Using that information, we can extract just the bits of data (locations, variables) we care about without having to download everything else as well. This saves the RAM on our computer, and drive space on our disks, as well as speeding up download. Without these techniques, processing the massive amounts of data coming from modern earth observation methods would be impractical.
*However*, not all data formats are equally amenable to this approach. Requesting a few bytes from a file across hundreds of miles of network connection is not the same thing as requesting a few bytes across the six inches of PCI connection between your processor and your hard-drive. More recent formats like "Cloud Optimized Geotiff" (COG) files are, as the name suggests, optimized for this. Complex older formats like HDF5 or GRIB are much less efficient. Network based range requests are not possible on some older (but still widely used) formats, like HDF4. In this last case, we will need to download the file to a local disk (a POSIX filesystem, not a hyperscale Object Store) to read it. Use `edl_download()` to handle authenticated downloads in this case.
## Vignettes
To facilitate cloud-native access of NASA EarthData from R, this package also includes a series of vignettes illustrating the use of some popular R packages in there (often less well-known) cloud-native modes. In each of these vignettes, we will take care to leverage "lazy evaluation" to avoid downloading more than we have to. With the exception of the vignette on S3 access from within `us-west-2`, these vignettes should run most anywhere, but will be most effective on machines with fast network access. Many university networks, and any cloud-hosted platform, such as GitHub Codespaces, offer excellent network performance for this purpose.
Owner
- Name: Boettiger Lab, UC Berkeley
- Login: boettiger-lab
- Kind: organization
- Location: Berkeley, CA
- Website: http://carlboettiger.info
- Repositories: 17
- Profile: https://github.com/boettiger-lab
Department of Environmental Science, Policy and Management
GitHub Events
Total
- Issues event: 6
- Watch event: 2
- Delete event: 1
- Member event: 1
- Issue comment event: 17
- Push event: 22
- Pull request event: 7
- Create event: 3
Last Year
- Issues event: 6
- Watch event: 2
- Delete event: 1
- Member event: 1
- Issue comment event: 17
- Push event: 22
- Pull request event: 7
- Create event: 3
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 5
- Total pull requests: 4
- Average time to close issues: 5 months
- Average time to close pull requests: 1 day
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 0.8
- Average comments per pull request: 2.5
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 4
- Average time to close issues: 1 day
- Average time to close pull requests: 1 day
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 2.5
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ateucher (5)
- wildintellect (2)
- eeholmes (1)
Pull Request Authors
- ateucher (4)
- cboettig (2)
- allanjust (1)
- eeholmes (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 254 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
cran.r-project.org: earthdatalogin
NASA 'EarthData' Access Utilities
- Homepage: https://boettiger-lab.github.io/earthdatalogin/
- Documentation: http://cran.r-project.org/web/packages/earthdatalogin/earthdatalogin.pdf
- License: MIT + file LICENSE
-
Latest release: 0.0.3
published 12 months ago
Rankings
Stargazers count: 18.3%
Forks count: 27.9%
Dependent packages count: 28.8%
Dependent repos count: 36.8%
Average: 39.7%
Downloads: 86.5%
Maintainers (1)
Last synced:
10 months ago
Dependencies
DESCRIPTION
cran
- httr * imports
- openssl * imports
- purrr * imports
- utils * imports
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml
actions
- JamesIves/github-pages-deploy-action v4.4.1 composite
- actions/checkout v3 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite