https://github.com/awslabs/emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB

https://github.com/awslabs/emr-dynamodb-connector

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 42 committers (2.4%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords from Contributors

data-pipeline projection generic sequences scheduling interactive jdbc optim orchestration data-engineering
Last synced: 10 months ago · JSON representation

Repository

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: Java
  • Default Branch: master
  • Homepage:
  • Size: 469 KB
Statistics
  • Stars: 227
  • Watchers: 52
  • Forks: 139
  • Open Issues: 59
  • Releases: 3
Created almost 10 years ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

emr-dynamodb-connector

Access data stored in Amazon DynamoDB with Apache Hadoop, Apache Hive, and Apache Spark

Introduction

You can use this connector to access data in Amazon DynamoDB using Apache Hadoop, Apache Hive, and Apache Spark in Amazon EMR. You can process data directly in DynamoDB using these frameworks, or join data in DynamoDB with data in Amazon S3, Amazon RDS, or other storage layers that can be accessed by Amazon EMR.

Currently, the connector supports the following data types:

| Hive type | Default DynamoDB type | Alternate DynamoDb type(s) | | --- | --- | --- | | string | string (S) | | | bigint or double | number (N) | | | binary | binary (B) | | | boolean | boolean (BOOL) | | | array | list (L) | number set (NS), string set (SS), binary set (BS) | | map | item (ITEM) | map (M) | | map | map (M) | | | struct | map (M) | |

The connector can serialize null values as DynamoDB null type (NULL).

Hive StorageHandler Implementation

For more information, see Hive Commands Examples for Exporting, Importing, and Querying Data in DynamoDB in the Amazon DynamoDB Developer Guide.

Hadoop InputFormat and OutputFormat Implementation

An implementation of Apache Hadoop InputFormat interface and OutputFormat are included, which allows DynamoDB AttributeValues to be directly ingested by MapReduce jobs. For an example of how to use these classes, see Set Up a Hive Table to Run Hive Commands in the Amazon EMR Release Guide, as well as their usage in the Import/Export tool classes in DynamoDBExport.java and DynamoDBImport.java.

Import/Export Tool

This simple tool that makes use of the InputFormat and OutputFormat implementations provides an easy way to import to and export data from DynamoDB.

Supported Versions

Currently the project builds against Hive 2.3.0, 1.2.1, and 1.0.0. Set this by using the hive1.version, hive1.2.version and hive2.version properties in the root Maven pom.xml, respectively.

How to Build

After cloning, run mvn clean install.

Example: Hive StorageHandler

Syntax to create a table using the DynamoDBStorageHandler class: CREATE EXTERNAL TABLE hive_tablename ( hive_column1_name column1_datatype, hive_column2_name column2_datatype ) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ( "dynamodb.table.name" = "dynamodb_tablename", "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name", "dynamodb.type.mapping" = "hive_column1_name:dynamodb_attribute1_type_abbreviation", "dynamodb.null.serialization" = "true" );

dynamodb.type.mapping and dynamodb.null.serialization are optional parameters.

Hive query will automatically choose the most suitable secondary index if there is any based on the search condition. For an index that can be chosen, it should have following properties: 1. It has all its index keys in Hive query search condition; 2. It contains all the DynamoDB attributes mentioned in dynamodb.column.mapping. (If you have to map more columns than index attributes in your Hive table but still want to use an index when running queries that only select the attributes within that index, consider create another Hive table and narrow down the mappings to only include the index attributes. Use that table for reading the index attributes to reduce table scans)

Example: Input/Output Formats with Spark

Using the DynamoDBInputFormat and DynamoDBOutputFormat classes with spark-shell: ``` $ spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar ... import org.apache.hadoop.io.Text; import org.apache.hadoop.dynamodb.DynamoDBItemWritable import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat import org.apache.hadoop.mapred.JobConf import org.apache.hadoop.io.LongWritable

var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")

jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])

orders.count() ```

Example: Import/Export Tool

Export usage

java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name

Import usage

java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBImport /where/input/data/is my-dynamo-table-name

Additional options

``` export [] []

read-ratio: maximum percent of the specified DynamoDB table's read capacity to use for export

total-segments: number of desired MapReduce splits to use for the export ```

``` import []

write-ratio: maximum percent of the specified DynamoDB table's write capacity to use for import ```

Maven Dependency

To depend on the specific components in your projects, add one (or both) of the following to your pom.xml.

Hadoop InputFormat/OutputFormats & DynamoDBItemWritable

<dependency> <groupId>com.amazon.emr</groupId> <artifactId>emr-dynamodb-hadoop</artifactId> <version>4.2.0</version> </dependency>

Hive SerDes & StorageHandler

<dependency> <groupId>com.amazon.emr</groupId> <artifactId>emr-dynamodb-hive</artifactId> <version>4.2.0</version> </dependency>

Contributing

  • If you find a bug or would like to see an improvement, open an issue.

    Check first to make sure there isn't one already open. We'll do our best to respond to issues and review pull-requests

  • Want to fix it yourself? Open a pull request!

    If adding new functionality, include new, passing unit tests, as well as documentation. Also include a snippet in your pull request showing that all current unit tests pass. Tests are ran by default when invoking any goal for maven that results in the package goal being executed (mvn clean install will run them and produce output showing such).

  • Follow the Google Java Style Guide

    Style is enforced at build time using the Apache Maven Checkstyle Plugin.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Create event: 4
  • Issues event: 1
  • Release event: 1
  • Watch event: 9
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 12
  • Push event: 10
  • Pull request review event: 4
  • Pull request event: 23
  • Fork event: 2
Last Year
  • Create event: 4
  • Issues event: 1
  • Release event: 1
  • Watch event: 9
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 12
  • Push event: 10
  • Pull request review event: 4
  • Pull request event: 23
  • Fork event: 2

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 87
  • Total Committers: 42
  • Avg Commits per committer: 2.071
  • Development Distribution Score (DDS): 0.897
Past Year
  • Commits: 19
  • Committers: 7
  • Avg Commits per committer: 2.714
  • Development Distribution Score (DDS): 0.684
Top Committers
Name Email Commits
Michele Miao m****o@a****m 9
Kevin Zhao k****o@a****m 6
Mike Grimes g****i@a****m 6
Kevin Zhao 1****o 5
Mike Grimes m****s@m****u 5
norvellj n****j 4
dependabot[bot] 4****] 4
foscraig f****g@a****m 4
Yuanhao y****u@a****m 3
Aki Tanaka t****h@a****m 3
Sam Garrett s****1 2
mimaomao 1****o 2
Junyang Li j****l@a****m 2
James Norvell n****j@a****m 2
TAK LON WU w****n@a****m 2
z-york z****0@g****m 2
Hernan Vivani v****h@a****m 1
Hernan Vivani v****h@d****m 1
Derick Anderson e****s@g****m 1
Illya Yalovyy y****i@a****m 1
Daniel Haviv d****v@f****m 1
Martin Dam m****m@g****m 1
Ajay Jadhav j****b@a****m 1
Robin Tang r****s@g****m 1
Seth Fitzsimmons s****h@m****t 1
Rajasekaran r****y@f****m 1
dauphinwill s****g@g****m 1
Rahil Chertara r****r@a****m 1
ruankd r****d@g****m 1
Ian Carpenter s****n@x****g 1
and 12 more...

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 66
  • Total pull requests: 74
  • Average time to close issues: 7 months
  • Average time to close pull requests: 5 months
  • Total issue authors: 57
  • Total pull request authors: 28
  • Average comments per issue: 1.86
  • Average comments per pull request: 0.66
  • Merged pull requests: 52
  • Bot issues: 0
  • Bot pull requests: 8
Past Year
  • Issues: 2
  • Pull requests: 22
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Issue authors: 2
  • Pull request authors: 6
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.32
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sivankumar86 (5)
  • anmsf (2)
  • anandhs (2)
  • billonahill (2)
  • fkhantsi (2)
  • railsmith (2)
  • 0xCafeDude (1)
  • dlylitsuka (1)
  • derekmceachern (1)
  • huanchh (1)
  • kaichen2000 (1)
  • gdoron (1)
  • syudb (1)
  • aajisaka (1)
  • ganeshashree (1)
Pull Request Authors
  • nickherzig (19)
  • michqm (8)
  • dependabot[bot] (8)
  • kevnzhao (7)
  • srbgupta86 (6)
  • Edmynx (5)
  • julienrf (4)
  • gguifelixamz (4)
  • foscraig (4)
  • ejeffrli (4)
  • smadurawe-oss (4)
  • chloeyamzn (3)
  • mimaomao (2)
  • phoenixphoebus (2)
  • luyuanhao (2)
Top Labels
Issue Labels
enhancement (3) needs investigation (1)
Pull Request Labels
dependencies (8)

Packages

  • Total packages: 11
  • Total downloads: unknown
  • Total docker downloads: 1,314,385
  • Total dependent packages: 17
    (may contain duplicates)
  • Total dependent repositories: 24
    (may contain duplicates)
  • Total versions: 103
repo1.maven.org: com.amazon.emr:emr-dynamodb-hadoop

EMR DynamoDB Hadoop Connector

  • Versions: 11
  • Dependent Packages: 4
  • Dependent Repositories: 8
  • Docker Downloads: 1,312,843
Rankings
Docker downloads count: 2.0%
Dependent repos count: 8.7%
Average: 12.7%
Dependent packages count: 13.9%
Forks count: 16.9%
Stargazers count: 22.2%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:shims-common

Common shim interface

  • Versions: 11
  • Dependent Packages: 6
  • Dependent Repositories: 3
  • Docker Downloads: 257
Rankings
Docker downloads count: 5.6%
Dependent packages count: 11.5%
Dependent repos count: 13.8%
Average: 14.0%
Forks count: 16.9%
Stargazers count: 22.2%
Last synced: 11 months ago
repo1.maven.org: com.amazon.emr:hive1-shims

Shims for Hive-1.x compatibility

  • Versions: 4
  • Dependent Packages: 2
  • Dependent Repositories: 3
  • Docker Downloads: 257
Rankings
Docker downloads count: 5.6%
Dependent repos count: 13.8%
Average: 16.3%
Forks count: 16.9%
Stargazers count: 22.2%
Dependent packages count: 23.1%
Last synced: 11 months ago
repo1.maven.org: com.amazon.emr:hive2-shims

Shims for Hive-2.x compatibility

  • Versions: 11
  • Dependent Packages: 2
  • Dependent Repositories: 3
  • Docker Downloads: 257
Rankings
Docker downloads count: 5.6%
Dependent repos count: 13.8%
Average: 16.3%
Forks count: 16.9%
Stargazers count: 22.2%
Dependent packages count: 23.1%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:shims-loader

Loader for the EMRDynamoDBShims classes

  • Versions: 11
  • Dependent Packages: 1
  • Dependent Repositories: 3
  • Docker Downloads: 257
Rankings
Docker downloads count: 5.6%
Dependent repos count: 13.8%
Forks count: 16.9%
Average: 18.3%
Stargazers count: 22.2%
Dependent packages count: 33.0%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:hive1.2-shims

Shims for Hive-1.2.x compatibility

  • Versions: 4
  • Dependent Packages: 1
  • Dependent Repositories: 3
  • Docker Downloads: 257
Rankings
Docker downloads count: 5.6%
Dependent repos count: 13.8%
Forks count: 16.9%
Average: 18.3%
Stargazers count: 22.2%
Dependent packages count: 33.0%
Last synced: 11 months ago
repo1.maven.org: com.amazon.emr:emr-dynamodb-tools

EMR DynamoDB Import/Export Tools

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 11.7%
Stargazers count: 16.2%
Average: 27.2%
Dependent repos count: 32.0%
Dependent packages count: 48.9%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:emr-dynamodb-connector

EMR DynamoDB Connector

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 11.7%
Stargazers count: 16.2%
Average: 27.2%
Dependent repos count: 32.0%
Dependent packages count: 48.9%
Last synced: 11 months ago
repo1.maven.org: com.amazon.emr:shims

Shims for Hive 1.0/1.2/2.0 compatibility

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 11.7%
Stargazers count: 16.2%
Average: 27.2%
Dependent repos count: 32.0%
Dependent packages count: 48.9%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:emr-dynamodb-hive

EMR DynamoDB Hive Connector

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 1
Rankings
Forks count: 16.9%
Dependent repos count: 20.7%
Stargazers count: 22.2%
Average: 27.5%
Dependent packages count: 50.1%
Last synced: 10 months ago
repo1.maven.org: com.amazon.emr:hive3-shims

Shims for Hive-3.x compatibility

  • Versions: 7
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Docker Downloads: 257
Rankings
Dependent packages count: 33.0%
Average: 34.1%
Dependent repos count: 35.3%
Last synced: 11 months ago

Dependencies

emr-dynamodb-hadoop/pom.xml maven
  • com.amazonaws:aws-java-sdk-dynamodb
  • com.fasterxml.jackson.core:jackson-databind ${jackson-databind.version}
  • com.google.code.gson:gson
  • joda-time:joda-time ${joda-time.version}
  • junit:junit
  • org.apache.hadoop:hadoop-common
  • org.apache.hadoop:hadoop-mapreduce-client-app
  • org.mockito:mockito-core
  • org.powermock:powermock-api-mockito
  • org.powermock:powermock-module-junit4
emr-dynamodb-hive/pom.xml maven
  • com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT provided
  • com.amazon.emr:shims-common 4.17.0-SNAPSHOT
  • com.amazon.emr:shims-loader 4.17.0-SNAPSHOT
  • org.apache.hadoop:hadoop-common ${hadoop.version}
  • org.apache.hadoop:hadoop-hdfs
  • org.apache.hadoop:hadoop-mapreduce-client-jobclient
  • org.apache.hive:hive-exec
  • org.apache.hive:hive-metastore
  • org.apache.hive:hive-service
  • com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT test
  • org.apache.hadoop:hadoop-mapreduce-client-hs ${hadoop.version} test
  • org.apache.hadoop:hadoop-yarn-server-tests ${hadoop.version} test
emr-dynamodb-tools/pom.xml maven
  • com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT
  • com.google.guava:guava
  • junit:junit
  • org.apache.hadoop:hadoop-common
  • org.apache.hadoop:hadoop-mapreduce-client-core
  • org.hamcrest:hamcrest-all
  • org.powermock:powermock-api-mockito
pom.xml maven
  • org.apache.hadoop:hadoop-common 2.7.3 provided
  • org.apache.hadoop:hadoop-mapreduce-client-app 2.7.3 provided
  • org.apache.hadoop:hadoop-mapreduce-client-core 2.7.3 provided
  • org.apache.hive:hive-exec 2.3.0 provided
  • org.apache.hive:hive-metastore 2.3.0 provided
  • org.apache.hive:hive-service 2.3.0 provided
  • com.amazonaws:aws-java-sdk-dynamodb 1.11.475
  • com.google.code.gson:gson 2.1
  • com.google.guava:guava 24.1.1-jre
  • pl.project13.maven:git-commit-id-plugin 2.2.4
  • junit:junit 4.13.1 test
  • org.apache.hadoop:hadoop-hdfs 2.7.3 test
  • org.apache.hadoop:hadoop-mapreduce-client-jobclient 2.7.3 test
  • org.apache.hive:hive-service 2.3.0 test
  • org.hamcrest:hamcrest-all 1.3 test
  • org.mockito:mockito-core 1.10.19 test
  • org.powermock:powermock-api-mockito 1.6.4 test
  • org.powermock:powermock-module-junit4 1.6.4 test
shims/common/pom.xml maven
  • org.apache.hadoop:hadoop-common
  • org.apache.hive:hive-exec
shims/hive2-shims/pom.xml maven
  • com.amazon.emr:shims-common 4.17.0-SNAPSHOT
  • org.apache.hive:hive-exec ${hive2.version}
shims/loader/pom.xml maven
  • com.amazon.emr:hive1-shims 4.17.0-SNAPSHOT
  • com.amazon.emr:hive1.2-shims 4.17.0-SNAPSHOT
  • com.amazon.emr:hive2-shims 4.17.0-SNAPSHOT
  • com.amazon.emr:shims-common 4.17.0-SNAPSHOT
  • org.apache.hive:hive-exec ${hive.version}
shims/hive3-shims/pom.xml maven
  • com.amazon.emr:shims-common 4.17.0-SNAPSHOT
  • org.apache.hive:hive-exec ${hive3.version}
shims/pom.xml maven