noctua

Connect R to Athena using paws SDK (DBI Interface)

https://github.com/dyfanjones/noctua

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.1%) to scientific vocabulary

Keywords

athena aws database r
Last synced: 6 months ago · JSON representation

Repository

Connect R to Athena using paws SDK (DBI Interface)

Basic Info
Statistics
  • Stars: 45
  • Watchers: 3
  • Forks: 6
  • Open Issues: 18
  • Releases: 14
Topics
athena aws database r
Created over 6 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog License

README.md

noctua

Project Status: Active – The project has reached a stable, usable
state and is being actively
developed. CRAN_Status_Badge downloads Codecov test coverage R build status noctua status badge

The goal of the noctua package is to provide a DBI-compliant interface to Amazon’s Athena (https://aws.amazon.com/athena/) using paws SDK. This allows for an efficient, easy setup connection to Athena using the paws SDK as a driver.

NOTE: Before using noctua you must have an aws account or have access to aws account with permissions allowing you to use Athena.

Why is the package called noctua

Athena/Minerva is the Greek/Roman god of wisdom, handicraft, and warfare. One of the main symbols for Athena is the Owl. Noctua is the latin word for Owl.

Installation:

To install noctua you can get it from CRAN with: r install.packages("noctua")

Or to get the development version from Github with: r remotes::install_github("dyfanjones/noctua")

Connection Methods

Hard Coding

The most basic way to connect to AWS Athena is to hard-code your access key and secret access key. However this method is not recommended as your credentials are hard-coded. ```r library(DBI)

con <- dbConnect(noctua::athena(), awsaccesskeyid='YOURACCESSKEYID', awssecretaccesskey='YOURSECRETACCESSKEY', s3stagingdir='s3://path/to/query/bucket/', region_name='eu-west-1') ```

AWS Profile Name

The next method is to use profile names set up by AWS CLI or created manually in the ~/.aws directory. To create the profile names manually please refer to: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html.

Setting up AWS CLI

noctua is compatible with AWS CLI. This allows your aws credentials to be stored and not be hard coded in your connection.

To install AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html, to configure AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Once AWS CLI has been set up you will be able to connect to Athena by only putting the s3_staging_dir.

Using default profile name: r library(DBI) con <- dbConnect(noctua::athena(), s3_staging_dir = 's3://path/to/query/bucket/') Connecting to Athena using profile name other than default. r library(DBI) con <- dbConnect(noctua::athena(), profile_name = "your_profile", s3_staging_dir = 's3://path/to/query/bucket/')

Assuming ARN Role for connection

Another method in connecting to Athena is to use Amazon Resource Name (ARN) role.

Setting credentials in environmental variables: ```r library(noctua) assumerole(profilename = "YOURPROFILENAME", rolearn = "arn:aws:sts::123456789012:assumed-role/rolename/rolesessionname", set_env = TRUE)

Connect to Athena using temporary credentials

con <- dbConnect(athena(), s3stagingdir = 's3://path/to/query/bucket/') ``` Connecting to Athena directly using ARN role:

r library(DBI) con <- dbConnect(athena(), profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", s3_staging_dir = 's3://path/to/query/bucket/') To change the duration of ARN role session please change the parameter duration_seconds. By default duration_seconds is set to 3600 seconds (1 hour).

Usage

Basic Usage

Connect to athena, and send a query and return results back to R.

``` r library(DBI)

using default profile to connect

con <- dbConnect(noctua::athena(), s3stagingdir = 's3://path/to/query/bucket/')

res <- dbExecute(con, "SELECT * FROM one_row") dbFetch(res) dbClearResult(res) ```

To retrieve query in 1 step.

r dbGetQuery(con, "SELECT * FROM one_row")

Intermediate Usage

To create a tables in athena, dbExecute will send the query to athena and wait until query has been executed. This makes it and idea method to create tables within athena.

``` r query <- "CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string, number string, processId string, browserCookie string, requestEndTime string, timers struct, threadId string, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;"

dbExecute(con, query) ```

noctua has 2 extra function to return extra information around Athena tables: dbGetParitiions and dbShow

dbGetPartitions will return all the partitions (returns data.frame):

r noctua::dbGetPartition(con, "impressions")

dbShow will return the table’s ddl, so you will able to see how the table was constructed in Athena (returns SQL character):

r noctua::dbShow(con, "impressions")

Advanced Usage

r library(DBI) con <- dbConnect(noctua::athena(), s3_staging_dir = 's3://path/to/query/bucket/')

Sending data to Athena

noctua has created a method to send data.frame from R to Athena.

``` r

Check existing tables

dbListTables(con)

Upload iris to Athena

dbWriteTable(con, "iris", iris, partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")))

Read in iris from Athena

dbReadTable(con, "iris")

Check new existing tables in Athena

dbListTables(con)

Check if iris exists in Athena

dbExistsTable(con, "iris") ```

Please check out noctua method for dbWriteTable for more information in how to upload data to AWS Athena and AWS S3.

For more information around how to get the most out of AWS Athena when uploading data please check out: Top 10 Performance Tuning Tips for Amazon Athena

Tidyverse Usage

Creating a connection to Athena and query and already existing table iris that was created in previous example.

``` r library(DBI) library(dplyr)

con <- dbConnect(noctua::athena(), awsaccesskeyid='YOURACCESSKEYID', awssecretaccesskey='YOURSECRETACCESSKEY', s3stagingdir='s3://path/to/query/bucket/', region_name='eu-west-1') tbl(con, sql("SELECT * FROM iris")) ```

# Source:   SQL [?? x 5]
# Database: Athena 0.1.4 [eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with more rows

dplyr provides lazy querying with allows to short hand tbl(con, sql("SELECT * FROM iris")) to tbl(con, "iris"). For more information please look at https://solutions.posit.co/connections/db/r-packages/dplyr/

r tbl(con, "iris")

# Source:   table<iris> [?? x 5]
# Database: Athena 0.1.4 [eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with more rows

Querying Athena with profile_name instead of hard coding aws_access_key_id and aws_secret_access_key. By using profile_name extra Meta Data is returned in the query to give users extra information.

r con <- dbConnect(noctua::athena(), profile_name = "your_profile", s3_staging_dir='s3://path/to/query/bucket/') tbl(con, "iris")) %>% filter(petal_length < 1.3)

# Source:   lazy query [?? x 5]
# Database: Athena 0.1.4 [your_profile@eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          4.7         3.2          1.3         0.2 setosa 
 2          4.3         3            1.1         0.1 setosa 
 3          5.8         4            1.2         0.2 setosa 
 4          5.4         3.9          1.3         0.4 setosa 
 5          4.6         3.6          1           0.2 setosa 
 6          5           3.2          1.2         0.2 setosa 
 7          5.5         3.5          1.3         0.2 setosa 
 8          4.4         3            1.3         0.2 setosa 
 9          5           3.5          1.3         0.3 setosa 
10          4.5         2.3          1.3         0.3 setosa 
# … with more rows

r tbl(con, "iris") %>% select(contains("sepal"), contains("petal"))

# Source:   lazy query [?? x 4]
# Database: Athena 0.1.4 [your_profile@eu-west-1/default]
   sepal_length sepal_width petal_length petal_width
          <dbl>       <dbl>        <dbl>       <dbl>
 1          5.1         3.5          1.4         0.2
 2          4.9         3            1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5           3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4
 7          4.6         3.4          1.4         0.3
 8          5           3.4          1.5         0.2
 9          4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
# … with more rows

Upload data using dplyr function copy_to and compute.

``` r library(DBI) library(dplyr)

con <- dbConnect(noctua::athena(), profilename = "yourprofile", s3stagingdir='s3://path/to/query/bucket/') ```

Write data.frame to Athena table r copy_to(con, mtcars, s3_location = "s3://mybucket/data/")

Write Athena table from tblsql ```r athenamtcars <- tbl(con, "mtcars") mtcarsfilter <- athenamtcars %>% filter(gear >=4) ```

Create athena with unique table name r mtcars_filer %>% compute()

Create athena with specified name and s3 location ```r mtcarsfiler %>% compute("mtcarsfiler", s3location = "s3://mybucket/mtcarsfiler/")

Disconnect from Athena

dbDisconnect(con) ```

Work Groups

Creating work group:

``` r library(noctua) library(DBI)

con <- dbConnect(noctua::athena(), profilename = "yourprofile", encryptionoption = "SSES3", s3stagingdir='s3://path/to/query/bucket/')

createworkgroup(con, "demoworkgroup", description = "This is a demo work group", tags = tagoptions(key= "demoworkgroup", value = "demo01")) ```

List work groups:

r list_work_groups(con)

[[1]]
[[1]]$Name
[1] "demo_work_group"

[[1]]$State
[1] "ENABLED"

[[1]]$Description
[1] "This is a demo work group"

[[1]]$CreationTime
2019-09-06 18:51:28.902000+01:00


[[2]]
[[2]]$Name
[1] "primary"

[[2]]$State
[1] "ENABLED"

[[2]]$Description
[1] ""

[[2]]$CreationTime
2019-08-22 16:14:47.902000+01:00

Update work group:

r update_work_group(con, "demo_work_group", description = "This is a demo work group update")

Return work group meta data:

r get_work_group(con, "demo_work_group")

$Name
[1] "demo_work_group"

$State
[1] "ENABLED"

$Configuration
$Configuration$ResultConfiguration
$Configuration$ResultConfiguration$OutputLocation
[1] "s3://path/to/query/bucket/"

$Configuration$ResultConfiguration$EncryptionConfiguration
$Configuration$ResultConfiguration$EncryptionConfiguration$EncryptionOption
[1] "SSE_S3"



$Configuration$EnforceWorkGroupConfiguration
[1] FALSE

$Configuration$PublishCloudWatchMetricsEnabled
[1] FALSE

$Configuration$BytesScannedCutoffPerQuery
[1] 10000000

$Configuration$RequesterPaysEnabled
[1] FALSE


$Description
[1] "This is a demo work group update"

$CreationTime
2019-09-06 18:51:28.902000+01:00

Connect to Athena using work group:

r con <- dbConnect(noctua::athena(), work_group = "demo_work_group")

Delete work group:

r delete_work_group(con, "demo_work_group")

Similar Projects

Python:

R:

Comparison:

noctua is basically the same as RAthena however it utilises the R AWS SDK paws to achieve the same goal.

Owner

  • Name: Larefly
  • Login: DyfanJones
  • Kind: user
  • Location: United Kingdom

GitHub Events

Total
  • Issues event: 8
  • Watch event: 1
  • Issue comment event: 17
  • Push event: 3
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Issues event: 8
  • Watch event: 1
  • Issue comment event: 17
  • Push event: 3
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 1
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 995
  • Total Committers: 6
  • Avg Commits per committer: 165.833
  • Development Distribution Score (DDS): 0.211
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
DyfanJones d****s@s****m 785
Dyfan Jones d****s@t****m 201
Ossi Lehtinen o****n@g****m 5
Dan Snow d****w@c****v 2
olivroy 5****y 1
Maximilian Girlich m****h@m****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 59
  • Total pull requests: 72
  • Average time to close issues: 10 days
  • Average time to close pull requests: 1 day
  • Total issue authors: 20
  • Total pull request authors: 6
  • Average comments per issue: 4.59
  • Average comments per pull request: 0.4
  • Merged pull requests: 69
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 8
  • Pull requests: 2
  • Average time to close issues: 7 minutes
  • Average time to close pull requests: about 4 hours
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 0.13
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • DyfanJones (20)
  • OssiLehtinen (7)
  • tyner (7)
  • mgirlich (2)
  • chrissuh17 (2)
  • davidski (2)
  • jeancochrane (2)
  • ellmanj (2)
  • fabhans (2)
  • tamuanand (2)
  • emmansh (2)
  • dfsnow (1)
  • joeramirez (1)
  • vhpietil (1)
  • nicholsn (1)
Pull Request Authors
  • DyfanJones (65)
  • pegeler (2)
  • dfsnow (2)
  • OssiLehtinen (2)
  • olivroy (2)
  • mgirlich (1)
Top Labels
Issue Labels
bug (13) enhancement (11) Release issue (3) question (3) documentation (2) Performance (2) sdk issue (1) help wanted (1)
Pull Request Labels
Performance (1)

Packages

  • Total packages: 1
  • Total downloads:
    • cran 748 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 27
  • Total maintainers: 1
cran.r-project.org: noctua

Connect to 'AWS Athena' using R 'AWS SDK' 'paws' ('DBI' Interface)

  • Versions: 27
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 748 Last month
Rankings
Stargazers count: 7.5%
Forks count: 10.8%
Downloads: 14.9%
Average: 17.1%
Dependent repos count: 23.9%
Dependent packages count: 28.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.2.0 depends
  • DBI >= 0.7 imports
  • data.table >= 1.12.4 imports
  • methods * imports
  • paws >= 0.1.5 imports
  • stats * imports
  • utils * imports
  • uuid >= 0.1 imports
  • arrow * suggests
  • bit64 * suggests
  • covr * suggests
  • dbplyr >= 1.4.3 suggests
  • dplyr >= 0.8.0 suggests
  • jsonify * suggests
  • jsonlite * suggests
  • knitr * suggests
  • readr * suggests
  • rmarkdown * suggests
  • testthat * suggests
  • tibble * suggests
  • vroom >= 1.2.0 suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact main composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
docker/Dockerfile docker
  • rstudio/r-base 4.0.2-centos7 build
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/pr-fetch v2 composite
  • r-lib/actions/pr-push v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite