https://github.com/awslabs/simple-database-archival-solution

Simple Database Archival Solution (SDAS) is an open source solution which you can deploy in your AWS account to archive data to AWS. SDAS provides organizations with an efficient, easy and cost-effective solution for archiving Oracle, Microsoft SQL, PostgreSQL, and MySQL databases.

https://github.com/awslabs/simple-database-archival-solution

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.3%) to scientific vocabulary

Keywords

archive archive-data aws cdk cloudscape database database-schema mysql-database oracle-database postgresql-database sql-database
Last synced: 5 months ago · JSON representation

Repository

Simple Database Archival Solution (SDAS) is an open source solution which you can deploy in your AWS account to archive data to AWS. SDAS provides organizations with an efficient, easy and cost-effective solution for archiving Oracle, Microsoft SQL, PostgreSQL, and MySQL databases.

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: TypeScript
  • Default Branch: main
  • Homepage:
  • Size: 17 MB
Statistics
  • Stars: 40
  • Watchers: 6
  • Forks: 8
  • Open Issues: 16
  • Releases: 11
Topics
archive archive-data aws cdk cloudscape database database-schema mysql-database oracle-database postgresql-database sql-database
Created almost 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog Contributing License Code of conduct

README.md

Simple Database Archival Solution

Build

Introduction

Simple Database Archival Solution (SDAS) is an open source solution, which you can deploy in your AWS account to archive data to AWS. SDAS will connect to your database which can be on premises or in the cloud, map the schema, perform validation, and finally transfer data to AWS for storage in Amazon S3. This is accomplished by primarily leveraging AWS Step Functions and AWS Glue. The main purpose of SDAS is to provide an out-of-the-box solution that easily allows customers to archive data in the cloud coming from on premises or cloud-hosted databases.

⚠️ This solution includes a public-facing web application for user access. User authentication is managed through Amazon Cognito using username and password credentials. You should conduct your own independent assessment before using the content in this sample for production purposes. This may include, but is not limited to, testing, securing, and optimizing the content based on your specific quality control practices and standards.

What is Simple Database Archival Solution (SDAS)?

As businesses accumulate more and more data over time, the need for effective database archiving solutions has become increasingly important, for example moving older, rarely used data to an archive. Businesses can reduce the size of their active databases, which can improve performance and reduce storage costs. Archiving also helps organizations meet legal and regulatory requirements for data retention, as well as ensure that important data is available for future use and discovery, if necessary. Out of the box, SDAS provides the following key features:

  • Support for Oracle, Microsoft SQL Server, MySQL or PostgreSQL
  • Identify the data type and table schema
  • Validate the data on the target after the archiving process has completed
  • Ability to configure WORM (“Write Once Read Many”)
  • Ability to defined data retention period for the data
  • Detailed information about the status of the data
  • Perform various data validation and integrity checks
  • Make it simple for operation to ingest and archive database
  • Ability to preview data archived in Amazon S3

Give SDAS a try!

  1. Install the Simple Database Archival Solution in your AWS Account.
  2. Send any issues, improvements, or suggestions to us at our GitHub page.
  3. To help you get started, we have also published a self-guided workshop that walks through the installation and core features of SDAS.

Architecture

AWS SDAS components

Tool Versions

To build and deploy SDAS the following tools are required.

  1. NodeJs >= 18
  2. Python3 >= 3.9
  3. Docker

1. Build

The top level package.json is only for easy to use top level commands and doesn't contain any packages so there is no reason to install it. When pulling latest its always best to run a build all to ensure you have the latest code.

To build the entire project run this command from the root of the cloned repository:

bash npm run build

Then during development, individual parts of the project can be built separately using the scoped commands:

```bash

Execute this command from the root of the cloned repository.

npm run build.cdk npm run build.web ```

2. Deploy

If you are deploying to a new account or region you will need to bootstrap the CDK. By default CDK bootstraps with AdministratorAccess policy which is restricted in certain environments. If you need greater access than PowerUserAccess and IAMFullAccess, add the role arns to the list.

If you are installing the application into a region other than us-east-1 you must bootstrap both regions. You can do this by setting the environment variable CDK_DEPLOY_REGION to us-east-1 and running the command below, then clearing the environment variable to pick up the set default. Or you can manually run the command with both regions provided. See statements below.

Execute this command from the root of the cloned repository:

bash npm run deploy.bootstrap

Deployment

To deploy run the following command. Please substitute ${YOURADMINISTRATOREMAIL} with the email address of your administrator. An email containing your password for signing in will be sent to that address. Execute this command from the root of the cloned repository. It will take approximately 10 mins for the stack to be deployed.

bash npm run deploy -- -c admin_email="${YOUR_ADMINISTRATOR_EMAIL}"

Access the Front-end

  1. Check your email for your temporary password
  2. Check the Cloudformation sdas stack outputs value of WebAppCloudFrontDistributionDomainNameXYXY
  3. Open your web browser, enter the web address obtained in step 2, and login as admin and the temporary password provided via email

3. Developer Setup

  1. Add example.com to your hosts file. This is needed to support CORS restrictions for authenticated requests during development.

On mac/linux, edit /etc/hosts and add the following line:

127.0.0.1 example.com

  1. Set the api gateway url for the shared development environment

After deployment of the shared development environment, create a .env.development file in the web-app folder. Replace {REACT_APP_API_URL} to your ApiGatewayUrl, include the trailing slash. This only needs to be done once.

bash HOST=example.com REACT_APP_API_URL=https://{REACT_APP_API_URL}/

4. Using SDAS

4.1. Start and Discover

To start the archiving process, gather essential connection information, including the database name, database URI, and credentials. With this information, SDAS attempts to connect to the database, and if successful, proceeds to the next step. In the next step, SDAS collects the tables and associated schema from the target database to be archived.

SDAS Add Archive

To identify the data that needs to be archived, SDAS uses a technique to scan the metadata associated with the table. This process is designed to accurately identify the data type and schema of the table and ensure that the data is properly formatted and validated before being transferred to AWS. The process involves running multiple SQL queries to extract the database schema definition to allow AWS Glue to read and finally write the data to Amazon S3.

Once the data type and schema of the table have been identified, SDAS can begin the process of transferring the data to AWS.

SDAS Source Database Scanning

4.2. Archive

The archive phase of SDAS is a critical step in the process of archiving data to Amazon S3. SDAS is designed to automatically archive data from Oracle, Microsoft SQL, and MySQL databases, providing flexibility and versatility for customers. The archiving process can be triggered either manually or automatically based on a defined schedule, enabling customers to customize the solution to their specific needs.

Archive Status Archive Database

At the core of the archive phase is AWS Glue, a fully managed Extract, Transform, and Load (ETL) service that provides a flexible and scalable solution for copying the database from the source to the target. SDAS leverages the power of AWS Glue to perform necessary transformations on the data, including data cleaning and schema transformations, ensuring that the data is properly formatted and validated before being transferred to Amazon S3.

Once the data is transferred to Amazon S3, it is stored as Parquet files, a columnar storage format that is optimized for query performance and storage efficiency. This makes the archived data easy to query, for instance using Amazon Athena, a serverless query service that allows customers to query data stored in S3 using SQL. By leveraging the power of Amazon Athena, customers can easily perform ad-hoc analysis on their archived data without the need for complex setup or maintenance.

4.3. Data Validation

The data validation phase of SDAS is a critical step that ensures the accuracy and completeness of the archived data. After the archival process is complete, SDAS automatically triggers a validation process to ensure that the data has been properly transferred and stored in Amazon S3.

The validation process begins by comparing the source data to the archived data stored in Amazon S3, using a variety of techniques such as checksums, and data sampling. This process ensures that the data has been accurately transferred and stored, with no data loss or corruption. SDAS does not perform validation on the source data, only on the data stored in Amazon S3.

If any discrepancies are detected, SDAS provides you with the ability to identify the affected table. In addition to ensuring the accuracy of the archived data, SDAS also provides security features to protect against unauthorized access or modification of the data. Passwords are stored in AWS Secrets Manager, which provides a highly secure mechanism for storing and managing secrets, such as database passwords.

SDAS Validation

4.4. Access to Archived Databases

Access to the archived databases in SDAS is limited to authorized users who can access them through the Amazon Athena Console. To explore and visualize the data using Business Intelligence tools, users can download, install, and configure either an ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity) driver to connect to Amazon Athena.

SDAS also includes a preview mode through the console, which allows users to quickly view the database that has been archived without the need for additional drivers or tools. This preview mode provides users with a quick and easy way to assess the quality and completeness of the archived data before proceeding with further analysis or querying.

SDAS Data Access

4.5. Object Lock

SDAS includes a powerful feature that enables users to enable Amazon S3 Object Lock, a feature that allows objects to be stored using a WORM (Write Once, Read Many) model. This feature is designed for use in scenarios where it is critical that data is not modified or deleted after it has been written.

By enabling Amazon S3 Object Lock, users can ensure that their archived data is fully protected from accidental or malicious deletion or modification. This feature provides a powerful layer of security that helps to prevent data loss or corruption, ensuring that the archived data remains complete and accurate for future analysis and querying.

SDAS Object Lock

5. Destroy

To destroy the dev environment, run:

```bash

Execute this command from the root of the cloned repository.

npm run destroy ```

6. Development

The top level project structure follows a responsibility structure:

  • /api - contains lambda functions for the api
  • /deploy - contains cloud development kit (CDK) to deploy the solution
  • /web-app - contains the SPA web client for the application
  • /functions - contains the lambda functions not associated with APIs
  • /step-functions - contains the lambda functions for AWS Step Functions

Data Type Conversions

PostgreSQL Data Type Conversions

This section covers data type conversions that SDAS processes between PostgreSQL and Athena databases.

| Source Data Type | Supported? | SDAS Data Type | | --------------------------- | ---------- | -------------- | | ARRAY | Y | array | | bigserial | Y | int | | bigint | Y | bigint | | bit [ (n) ] | Y | string | | bit varying [ (n) ] | Y | string | | boolean | Y | boolean | | box | Y | string | | bytea | Y | binary | | character varying [ (n) ] | Y | string | | cidr | Y | string | | circle | Y | string | | date | Y | date | | double precision | Y | decimal(38,6) | | inet | Y | string | | integer | Y | int | | interval [ fields ] [ (p) ] | Y | string | | json | Y | string | | jsonb | Y | string | | lseg | Y | string | | macaddr | Y | string | | macaddr8 | Y | string | | money | Y | decimal(19,4) | | numeric | Y | decimal(38,18) | | path | Y | string | | pglsn | Y | string | | pgsnapshot | Y | string | | point | Y | string | | polygon | Y | string | | real | Y | decimal(19,4) | | serial | Y | int | | smallint | Y | smallint | | smallserial | Y | int | | text | Y | string | | time | Y | string | | time with time zone | Y | string | | timestamp | Y | string | | timestamp with time zone | Y | string | | timestamp without time zone | Y | string | | tsquery | Y | string | | tsvector | Y | string | | txid_snapshot | Y | string | | USER-DEFINED | Y | string | | uuid | Y | string | | xml | Y | string |

Troubleshooting

Docker issues

Build fails during a docker step due to OSError: [Errno 28] No space left on device: or something similar. Open docker desktop, click on Images, click on Clean up, check Unused and Dangling, then click Remove.
or run from the command line: docker image prune -a

Disclaimer

This solution provides a public-facing web application where users can access. User authentication is managed through Amazon Cognito using username and password credentials. You should consider doing your own independent assessment before using the content in this sample for production purposes. This may include (amongst other things) testing, securing, and optimizing the content provided in this sample, based on your specific quality control practices and standards. This solution is provided as-is, without production-grade guarantees. You are responsible for ensuring the solution is suitable for your specific use case, including managing security, compliance, and cost considerations. Deploying this solution may incur AWS charges associated with the use of services such as Amazon S3, AWS Glue, and others.

License

See the LICENSE file for our project's licensing.

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Release event: 1
  • Watch event: 5
  • Delete event: 18
  • Issue comment event: 16
  • Push event: 7
  • Pull request review event: 2
  • Pull request event: 24
  • Create event: 5
Last Year
  • Release event: 1
  • Watch event: 5
  • Delete event: 18
  • Issue comment event: 16
  • Push event: 7
  • Pull request review event: 2
  • Pull request event: 24
  • Create event: 5

Issues and Pull Requests

Last synced: almost 2 years ago

All Time
  • Total issues: 9
  • Total pull requests: 32
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 10 days
  • Total issue authors: 7
  • Total pull request authors: 4
  • Average comments per issue: 1.89
  • Average comments per pull request: 0.22
  • Merged pull requests: 23
  • Bot issues: 0
  • Bot pull requests: 18
Past Year
  • Issues: 8
  • Pull requests: 28
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 10 days
  • Issue authors: 6
  • Pull request authors: 4
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.25
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 16
Top Authors
Issue Authors
  • mmuller88 (3)
  • mhmt1420 (1)
  • TinCongHuynh (1)
  • ruiscosta (1)
  • avinash1sriram (1)
  • jrgwv (1)
  • marsoy (1)
  • sgopalakrishnan1980 (1)
Pull Request Authors
  • dependabot[bot] (38)
  • ruiscosta (14)
  • ztanruan (5)
  • mmuller88 (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (38)

Dependencies

.github/workflows/ci.yml actions
  • Borales/actions-yarn v4.2.0 composite
  • actions/checkout v2 composite
  • actions/setup-node v3 composite
  • actions/setup-python v2 composite
  • snok/install-poetry v1 composite
deploy/package-lock.json npm
  • 548 dependencies
deploy/package.json npm
  • @aws-cdk/aws-apigatewayv2-alpha 2.70.0-alpha.0 development
  • @aws-cdk/aws-apigatewayv2-authorizers-alpha 2.70.0-alpha.0 development
  • @aws-cdk/aws-apigatewayv2-integrations-alpha 2.70.0-alpha.0 development
  • @aws-cdk/aws-athena ^1.197.0 development
  • @aws-cdk/aws-lambda-python-alpha ^2.70.0-alpha.0 development
  • @types/git-branch ^2.0.2 development
  • @types/git-repo-name ^1.0.1 development
  • @types/jest ^29.2.4 development
  • @types/node 18.11.15 development
  • @typescript-eslint/eslint-plugin 5.46.1 development
  • @typescript-eslint/parser 5.46.1 development
  • aws-cdk 2.70.0 development
  • aws-cdk-lib 2.80.0 development
  • aws-sdk ^2.1354.0 development
  • cdk-nag ^2.21.0 development
  • constructs ^10.0.0 development
  • eslint ^8.9.0 development
  • eslint-config-prettier 8.5.0 development
  • eslint-plugin-prettier 4.2.1 development
  • git-branch ^2.0.1 development
  • git-repo-name ^1.0.1 development
  • jest ^29.3.1 development
  • source-map-support ^0.5.16 development
  • ts-node ^10.9.1 development
  • typescript ~4.9.4 development
package-lock.json npm
package.json npm
web-app/package-lock.json npm
  • 1951 dependencies
web-app/package.json npm
  • @testing-library/jest-dom ^5.16.2 development
  • @testing-library/react ^12.1.3 development
  • @testing-library/user-event ^13.5.0 development
  • @types/jest ^27.4.0 development
  • @types/node ^16.11.25 development
  • @types/react ^17.0.39 development
  • @types/react-dom ^17.0.11 development
  • @types/react-router-dom ^5.1.7 development
  • @typescript-eslint/eslint-plugin ^5.57.0 development
  • @typescript-eslint/parser ^5.57.0 development
  • eslint ^8.14.0 development
  • eslint-config-prettier ^8.5.0 development
  • eslint-config-react-app ^7.0.1 development
  • eslint-plugin-prettier ^4.0.0 development
  • eslint-plugin-react ^7.32.2 development
  • prettier ^2.6.2 development
  • react-scripts 5.0.0 development
  • typescript ^4.5.5 development
  • @aws-amplify/ui-react ^4.3.6
  • @awsui/components-react ^3.0.736
  • @awsui/design-tokens ^3.0.35
  • @awsui/global-styles ^1.0.19
  • @cloudscape-design/components ^3.0.96
  • @cloudscape-design/design-tokens ^3.0.5
  • @cloudscape-design/global-styles ^1.0.1
  • @material-ui/core ^4.11.3
  • @material-ui/icons ^4.11.2
  • ace-builds ^1.13.0
  • aws-amplify ^5.0.12
  • axios ^0.27.2
  • moment ^2.29.4
  • moment-timezone ^0.5.42
  • react ^17.0.2
  • react-dom ^17.0.2
  • react-moment ^1.1.2
  • react-player ^2.10.1
  • react-router-dom ^5.2.0
  • sass ^1.54.4
  • use-awsui ^1.0.8
  • web-vitals ^3.0.4
api/archive/create/requirements.txt pypi
  • pytz *
api/archive/source/test-connection/requirements.txt pypi
  • oracledb * test
  • psycopg2-binary ==2.9.6 test
  • pymssql * test
  • pymysql * test
functions/async-get-schema/requirements.txt pypi
  • oracledb *
  • psycopg2-binary ==2.9.6
  • pymssql *
  • pymysql *