https://github.com/awslabs/analytics-accelerator-s3

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: Java
Default Branch: main
Homepage:
Size: 8.93 MB

Statistics

Stars: 52
Watchers: 5
Forks: 14
Open Issues: 30
Releases: 8

Created about 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

Analytics Accelerator Library for Amazon S3

The Analytics Accelerator Library for Amazon S3 helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads.

With this library, you can: * Lower processing times and compute costs for data analytics workloads. * Implement S3 best practices for performance. * Utilize optimizations specific to Apache Parquet files, such as pre-fetching metadata located in the footer of the object and predictive column pre-fetching. * Improve the price performance for your data analytics applications, such as workloads based on Apache Spark.

Current Status

The Analytics Accelerator Library for Amazon S3 has been tested and integrated with the Apache Hadoop S3A client, which will be released in version 3.4.2. For the Apache Iceberg S3FileIO client, it's released in version 1.9.0. It is also tested for datasets stored in S3 Table Buckets.

We're constantly working on improving Analytics Accelerator Library for Amazon S3 and are interested in hearing your feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue.

Getting Started

You can use Analytics Accelerator Library for Amazon S3 either as a standalone library or with Hadoop library and Spark engine. For Spark engine, it also supports Iceberg's open table format as well as S3 Table Buckets. Integrations with Hadoop and Iceberg are turned off by default. We describe how to enable Analytics Accelerator Library for Amazon S3 below.

Standalone Usage

To get started, import the library dependency from Maven into your project:

<dependency> <groupId>software.amazon.s3.analyticsaccelerator</groupId> <artifactId>analyticsaccelerator-s3</artifactId> <version>1.3.0</version> <scope>compile</scope> </dependency>

Then, initialize the library S3SeekableInputStreamFactory

S3AsyncClient crtClient = S3CrtAsyncClient.builder().maxConcurrency(600).build(); S3SeekableInputStreamFactory s3SeekableInputStreamFactory = new S3SeekableInputStreamFactory( new S3SdkObjectClient(this.crtClient), S3SeekableInputStreamConfiguration.DEFAULT);

Note: The S3SeekableInputStreamFactory can be initialized with either the S3AsyncClient or the S3 CRT client. We recommend that you use the S3 CRT client due to its enhanced connection pool management and higher throughput on downloads. For either client, we recommend you initialize with a higher concurrency value to fully benefit from the library's optimizations. This is because the library makes multiple parallel requests to S3 to prefetch data asynchronously. For the Java S3AsyncClient, you can increase the maximum connections by doing the following:

``` NettyNioAsyncHttpClient.Builder httpClientBuilder = NettyNioAsyncHttpClient.builder() .maxConcurrency(600);

S3AsyncClient s3AsyncClient = S3AsyncClient.builder().httpClientBuilder(httpClientBuilder).build(); ```

To open a stream:

S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key));

For more details on the usage of this stream, refer to the SeekableInputStream interface.

When the S3SeekableInputStreamFactory is no longer required to create new streams, close it to free resources (eg: caches for prefetched data) held by the factory.

s3SeekableInputStreamFactory.close();

Accessing SSE_C encrypted objects

To access SSE_C encrypted objects using AAL, set the customer key which was used to encrypt the object in the OpenStreamInformation object and pass the openStreamInformation object in the stream. The customer key must be base64 encoded.

``` OpenStreamInformation openStreamInformation = OpenStreamInformation.builder() .encryptionSecrets( EncryptionSecrets.builder().sseCustomerKey(Optional.of(base64EncodedCustomerKey)).build()) .build();

S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key), openStreamInformation);

```

Using with Hadoop

If you are using Analytics Accelerator Library for Amazon S3 with Hadoop, you need to set the stream type to analytics in the Hadoop configuration. An example configuration is as follows:

<property> <name>fs.s3a.input.stream.type</name> <value>analytics</value> </property>

For more information, see the Hadoop documentation on the Analytics stream type.

Using with Spark (without Iceberg)

Using the Analytics Accelerator Library for Amazon S3 with Spark is similar to the Hadoop usage. The only difference is to prepend the property names with spark.hadoop. For example, to enable Analytics Accelerator Library for Amazon S3 on Spark engine, you need to set following property in Spark configuration.

<property> <name>spark.hadoop.fs.s3a.input.stream.type</name> <value>analytics</value> </property>

Using with Spark (with Iceberg)

When your data in S3 is organised in Iceberg open table format, data will be retrieved using Iceberg S3FileIO client instead of Hadoop S3A client. Analytics Accelerator Library for Amazon S3 is currently being integrated into Iceberg S3FileIO client. To enable it in the Spark engine, set the following Spark property.

<property> <name>spark.sql.catalog.<CATALOG_NAME>.s3.analytics-accelerator.enabled</name> <value>true</value> </property>

For S3 General Purpose Buckets and S3 Directory Buckets set the <CATALOG_NAME>to spark_catalog, the default catalog.

S3 Table Buckets require you to set a custom catalog name, as outlined here. Once you set the catalog, you can replace the <CATALOG_NAME> parameter with your chosen name.

To learn more about how to set rest of the configurations, read our configuration documents.

Summary of Optimizations

Analytics Accelerator Library for Amazon S3 accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimizations specific to Apache Parquet files. The AWS CRT is a software library built for interacting with AWS services, that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput. You can use S3SeekableInputStreamFactory to initialize streams for all file types to benefit from read optimizations on top of benefits coming from CRT.

These optimizations are:

Sequential prefetching - The library detects sequential read patterns to prefetch data and reduce latency, and reads the full object when the object is small to minimize the number of read operations.
Small object prefetching - The library will prefetch the object if the object size is less than 8MB.
Closed Range requests - The library exclusively uses closed range requests when accessing S3, which is the recommended best practice for making requests to S3.
Read Vectored support - The library provides built-in implementation of Read Vectored functionality, enabling efficient reading of multiple non-contiguous ranges of data in a single operation.

When the object key ends with the file extension .parquet or .par, we use the following Apache Parquet specific optimizations:

Parquet footer caching - The library reads the tail of the object with configurable size (1MB by default) as soon as a stream to a Parquet object is opened and caches it in memory. This is done to prevent multiple small GET requests that occur at the tail of the file for the Parquet metadata, pageIndex, and bloom filter structures.
Predictive column prefetching - The library tracks recent columns being read using parquet metadata. When subsequent Parquet files which have these columns are opened, the library will prefetch these columns. For example, if columns x and y are read from A.parquet , and then B.parquet is opened, and it also contains columns named x and y, the library will prefetch them asynchronously.

When the object key ends with the file extension .csv, .json, or .txt, we use the following sequential format optimizations:

Partition-aligned prefetching - The library implements proactive prefetching up to the configured partition size. The default partition size is 128MB, which can be modified by setting the partition.size configuration parameter. This optimization reduces the number of GET requests by fetching larger chunks of data in advance, resulting in improved read throughput for sequential access patterns. To disable prefetching, set use.format.specific.io to false.

Memory Used by Library

Analytics Accelerator Library for Amazon S3 implements a best-effort memory limiting mechanism. The library fetches data from S3 in blocks of bytes and keeps them in memory. Memory management is achieved through a dual strategy combining Time-to-Live (TTL) and maximum memory threshold. When time to live or memory usage exceeds the configured threshold, blocks to be removed are identified using Timebasedeviction and Window TinyLfu algorithm respectively, implemented by Caffeine library. Removal is done using an async process that runs at configured intervals, meaning memory usage might temporarily exceed the threshold. This overflow period can be minimized by increasing the cleanup frequency, though at the cost of higher CPU utilization. You can change TTL, memory usage threshold and cleanup frequency as follows: Note: We allow only positive values for the below configs. * Memory limit can be set using the key max.memory.limit by default which is 2GB. Take into consideration workload and system resources when configuring this value. For eg: For parquet workload consider factors like row group size and number of vCPUs on executors. * Cache data timeout can be set using the key cache.timeout by default which is 1s. * Cleanup frequency can be set using the key memory.cleanup.frequency by default which is 5s. To learn more about how to set the configurations, read our configuration documents.

User Agent

We prepend user agent prefixes from both USER_AGENT_PREFIX_KEY set in ObjectClientConfiguration and USER_AGENT_PREFIX in s3AsyncClient configuration to s3analyticsaccelerator user agent. For CRT clients as of today there is no value set in USER_AGENT_PREFIX, so if you need to set the custom user agent pass it in the ObjectClientConfiguration.

Benchmark Results

Benchmarking Results -- November 25, 2024

The current benchmarking results are provided for reference only. It is important to note that the performance of these queries can be affected by a variety of factors, including compute and storage variability, cluster configuration, and compute choice. All of the results presented have a margin of error of up to 3%.

To establish the performance impact of changes, we rely on a benchmark derived from an industry standard TPC-DS benchmark at a 3 TB scale. It is important to note that our TPC-DS derived benchmark results are not directly comparable with official TPC-DS benchmark results. We also found that the sizing of Apache Parquet files and partitioning of the dataset have a substantive impact on the workload performance. As a result, we have created several versions of the test dataset, with a focus on different object sizes, ranging from singular MiBs to tens of GiBs, as well as various partitioning approaches

On S3A, we have observed a total suite execution acceleration between 10% and 27%, with some queries showing a speed-up of up to 40%.

Contributions

We welcome contributions to Analytics Accelerator Library for Amazon S3! See the contributing guidelines for more information on how to report bugs, build from source code, or submit pull requests.

Security

If you discover a potential security issue in this project we ask that you notify Amazon Web Services (AWS) Security via our vulnerability reporting page. Do not create a public GitHub issue.

License

Analytics Accelerator Library for Amazon S3 is licensed under the Apache-2.0 license. The pull request template will ask you to confirm the licensing of your contribution and to agree to the Developer Certificate of Origin (DCO).

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Create event: 38
Commit comment event: 3
Release event: 6
Delete event: 18
Member event: 6
Pull request event: 227
Fork event: 14
Issues event: 18
Watch event: 47
Issue comment event: 51
Push event: 141
Public event: 1
Pull request review comment event: 646
Pull request review event: 641

Last Year

Create event: 38
Commit comment event: 3
Release event: 6
Delete event: 18
Member event: 6
Pull request event: 227
Fork event: 14
Issues event: 18
Watch event: 47
Issue comment event: 51
Push event: 141
Public event: 1
Pull request review comment event: 646
Pull request review event: 641

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 9
Total pull requests: 122
Average time to close issues: 2 months
Average time to close pull requests: 22 days
Total issue authors: 4
Total pull request authors: 11
Average comments per issue: 0.56
Average comments per pull request: 0.26
Merged pull requests: 69
Bot issues: 0
Bot pull requests: 11

Past Year

Issues: 9
Pull requests: 121
Average time to close issues: 2 months
Average time to close pull requests: 20 days
Issue authors: 4
Pull request authors: 11
Average comments per issue: 0.56
Average comments per pull request: 0.26
Merged pull requests: 69
Bot issues: 0
Bot pull requests: 11

View more stats

Top Authors

Issue Authors

oleg-lvovitch-aws (7)
stubz151 (4)
ahmarsuhail (2)
fuatbasik (2)
dependabot[bot] (1)
Neuw84 (1)

Pull Request Authors

ahmarsuhail (29)
ozkoca (19)
dependabot[bot] (16)
fuatbasik (16)
rajdchak (13)
SanjayMarreddi (12)
vaibhav5140 (9)
stubz151 (6)
CsengerG (4)
sullis (2)
petergalati (2)
oleg-lvovitch-aws (1)

Top Labels

Issue Labels

bug (8) enhancement (7) parquet (4) build (3) CICD (3) compatibility (2) performance (2) github_actions (1) minor (1) microbenchmarking (1) dependencies (1) java (1) good first issue (1) physical IO (1) logical IO (1)

Pull Request Labels

dependencies (16) java (11) github_actions (5) minor (2)

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 8

repo1.maven.org: software.amazon.s3.analyticsaccelerator:analyticsaccelerator-s3

S3 Analytics Accelerator Library for Amazon S3

Homepage: https://github.com/awslabs/analytics-accelerator-s3
Documentation: https://appdoc.app/artifact/software.amazon.s3.analyticsaccelerator/analyticsaccelerator-s3/
License: The Apache License, Version 2.0
Latest release: 1.2.1
published 12 months ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 34.0%

Average: 41.3%

Dependent packages count: 48.6%

Last synced: 10 months ago

Dependencies

.github/workflows/build-upload.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-java v4 composite
actions/upload-artifact v4 composite
aws-actions/configure-aws-credentials v4.0.2 composite
gradle/actions/setup-gradle 417ae3ccd767c252f5661f1ace9f835f9654f2b5 composite
webfactory/ssh-agent v0.9.0 composite

.github/workflows/gradle-integration-test.yml actions

actions/checkout v4 composite
actions/setup-java v4 composite
aws-actions/configure-aws-credentials v4.0.2 composite
gradle/actions/setup-gradle v3 composite

.github/workflows/gradle-reference-test.yml actions

actions/checkout v4 composite
actions/setup-java v4 composite
gradle/actions/setup-gradle v3 composite

.github/workflows/gradle.yml actions

actions/checkout v4 composite
actions/setup-java v4 composite
gradle/actions/setup-gradle v3 composite

buildSrc/build.gradle.kts maven

common/build.gradle.kts maven

input-stream/build.gradle.kts maven

object-client/build.gradle.kts maven

https://github.com/awslabs/analytics-accelerator-s3

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Analytics Accelerator Library for Amazon S3

Current Status

Getting Started

Standalone Usage

Accessing SSE_C encrypted objects

Using with Hadoop

Using with Spark (without Iceberg)

Using with Spark (with Iceberg)

Summary of Optimizations

Memory Used by Library

User Agent

Benchmark Results

Benchmarking Results -- November 25, 2024

Contributions

Security

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

repo1.maven.org: software.amazon.s3.analyticsaccelerator:analyticsaccelerator-s3

Rankings

Dependencies