https://github.com/awslabs/analytics-accelerator-s3
Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Repository
Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.
Basic Info
Statistics
- Stars: 52
- Watchers: 5
- Forks: 14
- Open Issues: 30
- Releases: 8
Metadata Files
README.md
Analytics Accelerator Library for Amazon S3
The Analytics Accelerator Library for Amazon S3 helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads.
With this library, you can: * Lower processing times and compute costs for data analytics workloads. * Implement S3 best practices for performance. * Utilize optimizations specific to Apache Parquet files, such as pre-fetching metadata located in the footer of the object and predictive column pre-fetching. * Improve the price performance for your data analytics applications, such as workloads based on Apache Spark.
Current Status
The Analytics Accelerator Library for Amazon S3 has been tested and integrated with the Apache Hadoop S3A client, which will be released in version 3.4.2. For the Apache Iceberg S3FileIO client, it's released in version 1.9.0. It is also tested for datasets stored in S3 Table Buckets.
We're constantly working on improving Analytics Accelerator Library for Amazon S3 and are interested in hearing your feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue.
Getting Started
You can use Analytics Accelerator Library for Amazon S3 either as a standalone library or with Hadoop library and Spark engine. For Spark engine, it also supports Iceberg's open table format as well as S3 Table Buckets. Integrations with Hadoop and Iceberg are turned off by default. We describe how to enable Analytics Accelerator Library for Amazon S3 below.
Standalone Usage
To get started, import the library dependency from Maven into your project:
<dependency>
<groupId>software.amazon.s3.analyticsaccelerator</groupId>
<artifactId>analyticsaccelerator-s3</artifactId>
<version>1.3.0</version>
<scope>compile</scope>
</dependency>
Then, initialize the library S3SeekableInputStreamFactory
S3AsyncClient crtClient = S3CrtAsyncClient.builder().maxConcurrency(600).build();
S3SeekableInputStreamFactory s3SeekableInputStreamFactory = new S3SeekableInputStreamFactory(
new S3SdkObjectClient(this.crtClient), S3SeekableInputStreamConfiguration.DEFAULT);
Note: The S3SeekableInputStreamFactory can be initialized with either the S3AsyncClient or the S3 CRT client.
We recommend that you use the S3 CRT client due to its enhanced connection pool management and higher throughput on downloads.
For either client, we recommend you initialize with a higher concurrency value to fully benefit from the library's optimizations.
This is because the library makes multiple parallel requests to S3 to prefetch data asynchronously. For the Java S3AsyncClient, you can increase the maximum connections by doing the following:
``` NettyNioAsyncHttpClient.Builder httpClientBuilder = NettyNioAsyncHttpClient.builder() .maxConcurrency(600);
S3AsyncClient s3AsyncClient = S3AsyncClient.builder().httpClientBuilder(httpClientBuilder).build(); ```
To open a stream:
S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key));
For more details on the usage of this stream, refer to the SeekableInputStream interface.
When the S3SeekableInputStreamFactory is no longer required to create new streams, close it to free resources (eg: caches for prefetched data) held by the factory.
s3SeekableInputStreamFactory.close();
Accessing SSE_C encrypted objects
To access SSE_C encrypted objects using AAL, set the customer key which was used to encrypt the object in the OpenStreamInformation object and pass the openStreamInformation object in the stream. The customer key must be base64 encoded.
``` OpenStreamInformation openStreamInformation = OpenStreamInformation.builder() .encryptionSecrets( EncryptionSecrets.builder().sseCustomerKey(Optional.of(base64EncodedCustomerKey)).build()) .build();
S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key), openStreamInformation);
```
Using with Hadoop
If you are using Analytics Accelerator Library for Amazon S3 with Hadoop, you need to set the stream type to analytics in the Hadoop configuration. An example configuration is as follows:
<property>
<name>fs.s3a.input.stream.type</name>
<value>analytics</value>
</property>
For more information, see the Hadoop documentation on the Analytics stream type.
Using with Spark (without Iceberg)
Using the Analytics Accelerator Library for Amazon S3 with Spark is similar to the Hadoop usage. The only difference is to prepend the property names with spark.hadoop.
For example, to enable Analytics Accelerator Library for Amazon S3 on Spark engine, you need to set following property in Spark configuration.
<property>
<name>spark.hadoop.fs.s3a.input.stream.type</name>
<value>analytics</value>
</property>
Using with Spark (with Iceberg)
When your data in S3 is organised in Iceberg open table format, data will be retrieved using Iceberg S3FileIO client instead of Hadoop S3A client. Analytics Accelerator Library for Amazon S3 is currently being integrated into Iceberg S3FileIO client. To enable it in the Spark engine, set the following Spark property.
<property>
<name>spark.sql.catalog.<CATALOG_NAME>.s3.analytics-accelerator.enabled</name>
<value>true</value>
</property>
For S3 General Purpose Buckets and S3 Directory Buckets set the <CATALOG_NAME>to spark_catalog, the default catalog.
S3 Table Buckets require you to set a custom catalog name, as outlined here.
Once you set the catalog, you can replace the <CATALOG_NAME> parameter with your chosen name.
To learn more about how to set rest of the configurations, read our configuration documents.
Summary of Optimizations
Analytics Accelerator Library for Amazon S3 accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimizations specific to Apache Parquet files. The AWS CRT is a software library built for interacting with AWS services, that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput.
You can use S3SeekableInputStreamFactory to initialize streams for all file types to benefit from read optimizations on top of benefits coming from CRT.
These optimizations are:
- Sequential prefetching - The library detects sequential read patterns to prefetch data and reduce latency, and reads the full object when the object is small to minimize the number of read operations.
- Small object prefetching - The library will prefetch the object if the object size is less than 8MB.
- Closed Range requests - The library exclusively uses closed range requests when accessing S3, which is the recommended best practice for making requests to S3.
- Read Vectored support - The library provides built-in implementation of Read Vectored functionality, enabling efficient reading of multiple non-contiguous ranges of data in a single operation.
When the object key ends with the file extension .parquet or .par, we use the following Apache Parquet specific optimizations:
- Parquet footer caching - The library reads the tail of the object with configurable size (1MB by default) as soon as a stream to a Parquet object is opened and caches it in memory. This is done to prevent multiple small GET requests that occur at the tail
of the file for the Parquet metadata,
pageIndex, and bloom filter structures. - Predictive column prefetching - The library tracks recent columns being read using parquet metadata. When
subsequent Parquet files which have these columns are opened, the library will prefetch these columns. For example, if columns
xandyare read fromA.parquet, and thenB.parquetis opened, and it also contains columns namedxandy, the library will prefetch them asynchronously.
When the object key ends with the file extension .csv, .json, or .txt, we use the following sequential format optimizations:
Partition-aligned prefetching - The library implements proactive prefetching up to the configured partition size. The default partition size is 128MB, which can be modified by setting the partition.size configuration parameter. This optimization reduces the number of GET requests by fetching larger chunks of data in advance, resulting in improved read throughput for sequential access patterns. To disable prefetching, set use.format.specific.io to false.
Memory Used by Library
Analytics Accelerator Library for Amazon S3 implements a best-effort memory limiting mechanism. The library fetches data from S3 in blocks of bytes and keeps them in memory. Memory management is achieved through a dual strategy combining Time-to-Live (TTL) and maximum memory threshold.
When time to live or memory usage exceeds the configured threshold, blocks to be removed are identified using Timebasedeviction and Window TinyLfu algorithm respectively, implemented by Caffeine library. Removal is done using an async process that runs at configured intervals, meaning memory usage might temporarily exceed the threshold. This overflow period can be minimized by increasing the cleanup frequency, though at the cost of higher CPU utilization.
You can change TTL, memory usage threshold and cleanup frequency as follows:
Note: We allow only positive values for the below configs.
* Memory limit can be set using the key max.memory.limit by default which is 2GB. Take into consideration workload and system resources when configuring this value. For eg: For parquet workload consider factors like row group size and number of vCPUs on executors.
* Cache data timeout can be set using the key cache.timeout by default which is 1s.
* Cleanup frequency can be set using the key memory.cleanup.frequency by default which is 5s.
To learn more about how to set the configurations, read our configuration documents.
User Agent
We prepend user agent prefixes from both USER_AGENT_PREFIX_KEY set in ObjectClientConfiguration and USER_AGENT_PREFIX in s3AsyncClient configuration to s3analyticsaccelerator user agent. For CRT clients as of today there is no value set in USER_AGENT_PREFIX, so if you need to set the custom user agent pass it in the ObjectClientConfiguration.
Benchmark Results
Benchmarking Results -- November 25, 2024
The current benchmarking results are provided for reference only. It is important to note that the performance of these queries can be affected by a variety of factors, including compute and storage variability, cluster configuration, and compute choice. All of the results presented have a margin of error of up to 3%.
To establish the performance impact of changes, we rely on a benchmark derived from an industry standard TPC-DS benchmark at a 3 TB scale. It is important to note that our TPC-DS derived benchmark results are not directly comparable with official TPC-DS benchmark results. We also found that the sizing of Apache Parquet files and partitioning of the dataset have a substantive impact on the workload performance. As a result, we have created several versions of the test dataset, with a focus on different object sizes, ranging from singular MiBs to tens of GiBs, as well as various partitioning approaches
On S3A, we have observed a total suite execution acceleration between 10% and 27%, with some queries showing a speed-up of up to 40%.
Contributions
We welcome contributions to Analytics Accelerator Library for Amazon S3! See the contributing guidelines for more information on how to report bugs, build from source code, or submit pull requests.
Security
If you discover a potential security issue in this project we ask that you notify Amazon Web Services (AWS) Security via our vulnerability reporting page. Do not create a public GitHub issue.
License
Analytics Accelerator Library for Amazon S3 is licensed under the Apache-2.0 license. The pull request template will ask you to confirm the licensing of your contribution and to agree to the Developer Certificate of Origin (DCO).
Owner
- Name: Amazon Web Services - Labs
- Login: awslabs
- Kind: organization
- Location: Seattle, WA
- Website: http://amazon.com/aws/
- Repositories: 914
- Profile: https://github.com/awslabs
AWS Labs
GitHub Events
Total
- Create event: 38
- Commit comment event: 3
- Release event: 6
- Delete event: 18
- Member event: 6
- Pull request event: 227
- Fork event: 14
- Issues event: 18
- Watch event: 47
- Issue comment event: 51
- Push event: 141
- Public event: 1
- Pull request review comment event: 646
- Pull request review event: 641
Last Year
- Create event: 38
- Commit comment event: 3
- Release event: 6
- Delete event: 18
- Member event: 6
- Pull request event: 227
- Fork event: 14
- Issues event: 18
- Watch event: 47
- Issue comment event: 51
- Push event: 141
- Public event: 1
- Pull request review comment event: 646
- Pull request review event: 641
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 9
- Total pull requests: 122
- Average time to close issues: 2 months
- Average time to close pull requests: 22 days
- Total issue authors: 4
- Total pull request authors: 11
- Average comments per issue: 0.56
- Average comments per pull request: 0.26
- Merged pull requests: 69
- Bot issues: 0
- Bot pull requests: 11
Past Year
- Issues: 9
- Pull requests: 121
- Average time to close issues: 2 months
- Average time to close pull requests: 20 days
- Issue authors: 4
- Pull request authors: 11
- Average comments per issue: 0.56
- Average comments per pull request: 0.26
- Merged pull requests: 69
- Bot issues: 0
- Bot pull requests: 11
Top Authors
Issue Authors
- oleg-lvovitch-aws (7)
- stubz151 (4)
- ahmarsuhail (2)
- fuatbasik (2)
- dependabot[bot] (1)
- Neuw84 (1)
Pull Request Authors
- ahmarsuhail (29)
- ozkoca (19)
- dependabot[bot] (16)
- fuatbasik (16)
- rajdchak (13)
- SanjayMarreddi (12)
- vaibhav5140 (9)
- stubz151 (6)
- CsengerG (4)
- sullis (2)
- petergalati (2)
- oleg-lvovitch-aws (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 8
repo1.maven.org: software.amazon.s3.analyticsaccelerator:analyticsaccelerator-s3
S3 Analytics Accelerator Library for Amazon S3
- Homepage: https://github.com/awslabs/analytics-accelerator-s3
- Documentation: https://appdoc.app/artifact/software.amazon.s3.analyticsaccelerator/analyticsaccelerator-s3/
- License: The Apache License, Version 2.0
-
Latest release: 1.2.1
published 12 months ago
Rankings
Dependencies
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-java v4 composite
- actions/upload-artifact v4 composite
- aws-actions/configure-aws-credentials v4.0.2 composite
- gradle/actions/setup-gradle 417ae3ccd767c252f5661f1ace9f835f9654f2b5 composite
- webfactory/ssh-agent v0.9.0 composite
- actions/checkout v4 composite
- actions/setup-java v4 composite
- aws-actions/configure-aws-credentials v4.0.2 composite
- gradle/actions/setup-gradle v3 composite
- actions/checkout v4 composite
- actions/setup-java v4 composite
- gradle/actions/setup-gradle v3 composite
- actions/checkout v4 composite
- actions/setup-java v4 composite
- gradle/actions/setup-gradle v3 composite