https://github.com/awslabs/emr-dynamodb-connector
Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 42 committers (2.4%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords from Contributors
Repository
Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Basic Info
Statistics
- Stars: 227
- Watchers: 52
- Forks: 139
- Open Issues: 59
- Releases: 3
Metadata Files
README.md
emr-dynamodb-connector
Access data stored in Amazon DynamoDB with Apache Hadoop, Apache Hive, and Apache Spark
Introduction
You can use this connector to access data in Amazon DynamoDB using Apache Hadoop, Apache Hive, and Apache Spark in Amazon EMR. You can process data directly in DynamoDB using these frameworks, or join data in DynamoDB with data in Amazon S3, Amazon RDS, or other storage layers that can be accessed by Amazon EMR.
- Using Apache Hive in Amazon EMR with Amazon DynamoDB
- Accessing data in Amazon DynamoDB with Apache Spark
- Connecting to DynamoDB with Amazon EMR Serverless
Currently, the connector supports the following data types:
| Hive type | Default DynamoDB type | Alternate DynamoDb type(s) |
| --- | --- | --- |
| string | string (S) | |
| bigint or double | number (N) | |
| binary | binary (B) | |
| boolean | boolean (BOOL) | |
| array | list (L) | number set (NS), string set (SS), binary set (BS) |
| map
The connector can serialize null values as DynamoDB null type (NULL).
Hive StorageHandler Implementation
For more information, see Hive Commands Examples for Exporting, Importing, and Querying Data in DynamoDB in the Amazon DynamoDB Developer Guide.
Hadoop InputFormat and OutputFormat Implementation
An implementation of Apache Hadoop InputFormat interface and OutputFormat are included, which allows DynamoDB AttributeValues to be directly ingested by MapReduce jobs. For an example of how to use these classes, see Set Up a Hive Table to Run Hive Commands in the Amazon EMR Release Guide, as well as their usage in the Import/Export tool classes in DynamoDBExport.java and DynamoDBImport.java.
Import/Export Tool
This simple tool that makes use of the InputFormat and OutputFormat implementations provides an easy way to import to and export data from DynamoDB.
Supported Versions
Currently the project builds against Hive 2.3.0, 1.2.1, and 1.0.0. Set this by using the hive1.version,
hive1.2.version and hive2.version properties in the root Maven pom.xml, respectively.
How to Build
After cloning, run mvn clean install.
Example: Hive StorageHandler
Syntax to create a table using the DynamoDBStorageHandler class:
CREATE EXTERNAL TABLE hive_tablename (
hive_column1_name column1_datatype,
hive_column2_name column2_datatype
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb_tablename",
"dynamodb.column.mapping" =
"hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name",
"dynamodb.type.mapping" =
"hive_column1_name:dynamodb_attribute1_type_abbreviation",
"dynamodb.null.serialization" = "true"
);
dynamodb.type.mapping and dynamodb.null.serialization are optional parameters.
Hive query will automatically choose the most suitable secondary index if there is any based on the
search condition. For an index that can be chosen, it should have following properties:
1. It has all its index keys in Hive query search condition;
2. It contains all the DynamoDB attributes mentioned in dynamodb.column.mapping. (If you have to
map more columns than index attributes in your Hive table but still want to use an index when
running queries that only select the attributes within that index, consider create another
Hive table and narrow down the mappings to only include the index attributes. Use that table for
reading the index attributes to reduce table scans)
Example: Input/Output Formats with Spark
Using the DynamoDBInputFormat and DynamoDBOutputFormat classes with spark-shell:
```
$ spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
...
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
orders.count() ```
Example: Import/Export Tool
Export usage
java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name
Import usage
java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBImport /where/input/data/is my-dynamo-table-name
Additional options
```
export
read-ratio: maximum percent of the specified DynamoDB table's read capacity to use for export
total-segments: number of desired MapReduce splits to use for the export ```
```
import
write-ratio: maximum percent of the specified DynamoDB table's write capacity to use for import ```
Maven Dependency
To depend on the specific components in your projects, add one (or both) of the following to your
pom.xml.
Hadoop InputFormat/OutputFormats & DynamoDBItemWritable
<dependency>
<groupId>com.amazon.emr</groupId>
<artifactId>emr-dynamodb-hadoop</artifactId>
<version>4.2.0</version>
</dependency>
Hive SerDes & StorageHandler
<dependency>
<groupId>com.amazon.emr</groupId>
<artifactId>emr-dynamodb-hive</artifactId>
<version>4.2.0</version>
</dependency>
Contributing
If you find a bug or would like to see an improvement, open an issue.
Check first to make sure there isn't one already open. We'll do our best to respond to issues and review pull-requests
Want to fix it yourself? Open a pull request!
If adding new functionality, include new, passing unit tests, as well as documentation. Also include a snippet in your pull request showing that all current unit tests pass. Tests are ran by default when invoking any goal for maven that results in the
packagegoal being executed (mvn clean installwill run them and produce output showing such).Follow the Google Java Style Guide
Style is enforced at build time using the Apache Maven Checkstyle Plugin.
Owner
- Name: Amazon Web Services - Labs
- Login: awslabs
- Kind: organization
- Location: Seattle, WA
- Website: http://amazon.com/aws/
- Repositories: 914
- Profile: https://github.com/awslabs
AWS Labs
GitHub Events
Total
- Create event: 4
- Issues event: 1
- Release event: 1
- Watch event: 9
- Delete event: 1
- Member event: 1
- Issue comment event: 12
- Push event: 10
- Pull request review event: 4
- Pull request event: 23
- Fork event: 2
Last Year
- Create event: 4
- Issues event: 1
- Release event: 1
- Watch event: 9
- Delete event: 1
- Member event: 1
- Issue comment event: 12
- Push event: 10
- Pull request review event: 4
- Pull request event: 23
- Fork event: 2
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Michele Miao | m****o@a****m | 9 |
| Kevin Zhao | k****o@a****m | 6 |
| Mike Grimes | g****i@a****m | 6 |
| Kevin Zhao | 1****o | 5 |
| Mike Grimes | m****s@m****u | 5 |
| norvellj | n****j | 4 |
| dependabot[bot] | 4****] | 4 |
| foscraig | f****g@a****m | 4 |
| Yuanhao | y****u@a****m | 3 |
| Aki Tanaka | t****h@a****m | 3 |
| Sam Garrett | s****1 | 2 |
| mimaomao | 1****o | 2 |
| Junyang Li | j****l@a****m | 2 |
| James Norvell | n****j@a****m | 2 |
| TAK LON WU | w****n@a****m | 2 |
| z-york | z****0@g****m | 2 |
| Hernan Vivani | v****h@a****m | 1 |
| Hernan Vivani | v****h@d****m | 1 |
| Derick Anderson | e****s@g****m | 1 |
| Illya Yalovyy | y****i@a****m | 1 |
| Daniel Haviv | d****v@f****m | 1 |
| Martin Dam | m****m@g****m | 1 |
| Ajay Jadhav | j****b@a****m | 1 |
| Robin Tang | r****s@g****m | 1 |
| Seth Fitzsimmons | s****h@m****t | 1 |
| Rajasekaran | r****y@f****m | 1 |
| dauphinwill | s****g@g****m | 1 |
| Rahil Chertara | r****r@a****m | 1 |
| ruankd | r****d@g****m | 1 |
| Ian Carpenter | s****n@x****g | 1 |
| and 12 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 66
- Total pull requests: 74
- Average time to close issues: 7 months
- Average time to close pull requests: 5 months
- Total issue authors: 57
- Total pull request authors: 28
- Average comments per issue: 1.86
- Average comments per pull request: 0.66
- Merged pull requests: 52
- Bot issues: 0
- Bot pull requests: 8
Past Year
- Issues: 2
- Pull requests: 22
- Average time to close issues: N/A
- Average time to close pull requests: 2 days
- Issue authors: 2
- Pull request authors: 6
- Average comments per issue: 0.0
- Average comments per pull request: 0.32
- Merged pull requests: 18
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- sivankumar86 (5)
- anmsf (2)
- anandhs (2)
- billonahill (2)
- fkhantsi (2)
- railsmith (2)
- 0xCafeDude (1)
- dlylitsuka (1)
- derekmceachern (1)
- huanchh (1)
- kaichen2000 (1)
- gdoron (1)
- syudb (1)
- aajisaka (1)
- ganeshashree (1)
Pull Request Authors
- nickherzig (19)
- michqm (8)
- dependabot[bot] (8)
- kevnzhao (7)
- srbgupta86 (6)
- Edmynx (5)
- julienrf (4)
- gguifelixamz (4)
- foscraig (4)
- ejeffrli (4)
- smadurawe-oss (4)
- chloeyamzn (3)
- mimaomao (2)
- phoenixphoebus (2)
- luyuanhao (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 11
- Total downloads: unknown
- Total docker downloads: 1,314,385
-
Total dependent packages: 17
(may contain duplicates) -
Total dependent repositories: 24
(may contain duplicates) - Total versions: 103
repo1.maven.org: com.amazon.emr:emr-dynamodb-hadoop
EMR DynamoDB Hadoop Connector
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/emr-dynamodb-hadoop/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:shims-common
Common shim interface
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/shims-common/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:hive1-shims
Shims for Hive-1.x compatibility
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/hive1-shims/
- License: Apache License, Version 2.0
-
Latest release: 4.16.0
published almost 5 years ago
Rankings
repo1.maven.org: com.amazon.emr:hive2-shims
Shims for Hive-2.x compatibility
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/hive2-shims/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:shims-loader
Loader for the EMRDynamoDBShims classes
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/shims-loader/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:hive1.2-shims
Shims for Hive-1.2.x compatibility
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/hive1.2-shims/
- License: Apache License, Version 2.0
-
Latest release: 4.16.0
published almost 5 years ago
Rankings
repo1.maven.org: com.amazon.emr:emr-dynamodb-tools
EMR DynamoDB Import/Export Tools
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/emr-dynamodb-tools/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:emr-dynamodb-connector
EMR DynamoDB Connector
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/emr-dynamodb-connector/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:shims
Shims for Hive 1.0/1.2/2.0 compatibility
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/shims/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:emr-dynamodb-hive
EMR DynamoDB Hive Connector
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/emr-dynamodb-hive/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
repo1.maven.org: com.amazon.emr:hive3-shims
Shims for Hive-3.x compatibility
- Homepage: https://github.com/awslabs/emr-dynamodb-connector
- Documentation: https://appdoc.app/artifact/com.amazon.emr/hive3-shims/
- License: Apache License, Version 2.0
-
Latest release: 5.6.0
published over 1 year ago
Rankings
Dependencies
- com.amazonaws:aws-java-sdk-dynamodb
- com.fasterxml.jackson.core:jackson-databind ${jackson-databind.version}
- com.google.code.gson:gson
- joda-time:joda-time ${joda-time.version}
- junit:junit
- org.apache.hadoop:hadoop-common
- org.apache.hadoop:hadoop-mapreduce-client-app
- org.mockito:mockito-core
- org.powermock:powermock-api-mockito
- org.powermock:powermock-module-junit4
- com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT provided
- com.amazon.emr:shims-common 4.17.0-SNAPSHOT
- com.amazon.emr:shims-loader 4.17.0-SNAPSHOT
- org.apache.hadoop:hadoop-common ${hadoop.version}
- org.apache.hadoop:hadoop-hdfs
- org.apache.hadoop:hadoop-mapreduce-client-jobclient
- org.apache.hive:hive-exec
- org.apache.hive:hive-metastore
- org.apache.hive:hive-service
- com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT test
- org.apache.hadoop:hadoop-mapreduce-client-hs ${hadoop.version} test
- org.apache.hadoop:hadoop-yarn-server-tests ${hadoop.version} test
- com.amazon.emr:emr-dynamodb-hadoop 4.17.0-SNAPSHOT
- com.google.guava:guava
- junit:junit
- org.apache.hadoop:hadoop-common
- org.apache.hadoop:hadoop-mapreduce-client-core
- org.hamcrest:hamcrest-all
- org.powermock:powermock-api-mockito
- org.apache.hadoop:hadoop-common 2.7.3 provided
- org.apache.hadoop:hadoop-mapreduce-client-app 2.7.3 provided
- org.apache.hadoop:hadoop-mapreduce-client-core 2.7.3 provided
- org.apache.hive:hive-exec 2.3.0 provided
- org.apache.hive:hive-metastore 2.3.0 provided
- org.apache.hive:hive-service 2.3.0 provided
- com.amazonaws:aws-java-sdk-dynamodb 1.11.475
- com.google.code.gson:gson 2.1
- com.google.guava:guava 24.1.1-jre
- pl.project13.maven:git-commit-id-plugin 2.2.4
- junit:junit 4.13.1 test
- org.apache.hadoop:hadoop-hdfs 2.7.3 test
- org.apache.hadoop:hadoop-mapreduce-client-jobclient 2.7.3 test
- org.apache.hive:hive-service 2.3.0 test
- org.hamcrest:hamcrest-all 1.3 test
- org.mockito:mockito-core 1.10.19 test
- org.powermock:powermock-api-mockito 1.6.4 test
- org.powermock:powermock-module-junit4 1.6.4 test
- org.apache.hadoop:hadoop-common
- org.apache.hive:hive-exec
- com.amazon.emr:shims-common 4.17.0-SNAPSHOT
- org.apache.hive:hive-exec ${hive2.version}
- com.amazon.emr:hive1-shims 4.17.0-SNAPSHOT
- com.amazon.emr:hive1.2-shims 4.17.0-SNAPSHOT
- com.amazon.emr:hive2-shims 4.17.0-SNAPSHOT
- com.amazon.emr:shims-common 4.17.0-SNAPSHOT
- org.apache.hive:hive-exec ${hive.version}
- com.amazon.emr:shims-common 4.17.0-SNAPSHOT
- org.apache.hive:hive-exec ${hive3.version}