https://github.com/big-data-lab-umbc/reproducible_and_portable_app_in_cloud

A toolkit to deploy, execute, analyze, and reproduce big data analytics automatically in the cloud.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

aws azure cloud reproducibility

Last synced: 5 months ago · JSON representation

Repository

A toolkit to deploy, execute, analyze, and reproduce big data analytics automatically in the cloud.

Basic Info

Host: GitHub
Owner: big-data-lab-umbc
Language: Python
Default Branch: main
Homepage:
Size: 31.3 MB

Statistics

Stars: 6
Watchers: 2
Forks: 6
Open Issues: 6
Releases: 1

Topics

aws azure cloud reproducibility

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme

Reproducible and Portable Big Data Analytics in Cloud

Introduction

We implement the Reproducible and Portable big data Analytics in the Cloud (RPAC) Toolkit, which help us deploy, execute, analyze, and reproduce big data analytics automatically in cloud.

Abstract

Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.

Prerequisite

RPAC currently supports Python>=3.6.

Install Dependencies

bash pip3 install configparser

Optional dependencies, only for AWS-based execution. Commands for other OSs are at AWS User Guilde: bash pip3 install aws-sam-cli curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip awscliv2.zip && sudo ./aws/install Optional dependencies, only for Azure-based execution: bash curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

Set your credentials and other configurations for cloud platforms:

For AWS: aws configure set aws_access_key_id <aws_access_key> && aws configure set aws_secret_access_key <aws_secert_key> && aws configure set default.region us-west-2

For Azure: az login

Usage

main.py To perform RPAC of an experiment on a cloud.

``` usage: python3 main.py [-h] [--executionhistory EXECUTIONHISTORY] [--one_click] [--terminate]

RPAC Toolkit.

optional arguments: -h, --help show this help message and exit --executionhistory EXECUTIONHISTORY Folder name of execution history to reproduce, or URI of execution history. --oneclick Allow oneclick execution to be used by RPAC, implies '--one_click'. Note this argument will terminate all cloud resources after execution finished. --terminate Terminate all cloud resources, implies '--terminate'. ```

To use RPAC toolkit, make the following changes to your configuration:

Update configurations resource.ini, application.ini, personal.ini in ConfigTemplate folder.
- For resource.ini: reproduce_storage is the S3 Bucket name, which will store all reproduction historical files. You need to create your bucket before running RPAC. We recommend a name only with lowercase letters, numbers, and hyphens (-). The detailed Bucket naming rules can be find in here.
- For personal.ini: cloud_credentials is the key:value pair of your cloud credentials (Access key ID:Secret key ID). In order to find your credentials, see here.
Run python3 main.py to execute the big data analytics.

Example usage: python3 main.py --one_click, python3 main.py --execution_history 0ec2088f-a3b8-4730-8e76-cac2015c74df --one_click, python3 main.py --execution_history s3://aws-sam-cli-managed-default-samclisourcebucket-xscicpwnc0z3/a57f212d-c7c3-46eb-ace4-d62bb6b294f6 --one_click.

For a closer look, please refer to our demo.

Getting Started

Three-pointers for experimental execution to get you started: - First execution: get first execution with understanding and using RPAC toolkit - Examples: easy to understand RPAC across three applications - Reproduce: reproduce existing execution with RPAC - New application: create your own applications with RPAC

End-to-end execution also provided in RPAC: - End-to-end: one-click execution with RPAC

Citation

If you use this code for your research, please cite our paper:

@article{wang2023reproducible, title={Reproducible and Portable Big Data Analytics in the Cloud}, author={Wang, Xin and Guo, Pei and Li, Xingyan and Gangopadhyay, Aryya and Busart, Carl and Freeman, Jade and Wang, Jianwu}, journal={IEEE Transactions on Cloud Computing}, year={2023}, publisher={IEEE} }

Owner

Name: Big Data Analytics Lab @ UMBC
Login: big-data-lab-umbc
Kind: organization
Location: University of Maryland, Baltimore County

Website: https://bdal.umbc.edu/
Twitter: jianwuwang
Repositories: 5
Profile: https://github.com/big-data-lab-umbc

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Dependencies

AzureServerlessTemplate/CausalityAnalyticsViaSpark/lambda/requirements.txt pypi

azure-functions *

AzureServerlessTemplate/CloudRetrievalViaDask/lambda/requirements.txt pypi

azure-cli-core *
azure-functions *
azure-identity *
azure-mgmt-compute *
azure-mgmt-resource *

AzureServerlessTemplate/DomainAdaptationViaHovorod/lambda/requirements.txt pypi

azure-functions *

AzureServerlessTemplate/requirements.txt pypi

AzureCLI ==v2.26.0
AzureFunctionsCoreTools ==v3.x.
dotnet-runtimes ==Microsoft.NETCore.App==v5.0.7
dotnet-runtimes ==Microsoft.AspNetCore.App==v5.0.7

docker/CausalityAnalyticsViaSpark/Dockerfile docker

amazoncorretto 8 build

docker/CloudRetrievalViaDask/Dockerfile docker

ubuntu 20.04 build

docker/CloudRetrievalViaHovorod/Dockerfile docker

nvidia/cuda 10.1-cudnn7-devel-ubuntu16.04 build

docker/DomainAdaptationViaHovorod/Dockerfile docker

nvidia/cuda 10.1-cudnn7-devel-ubuntu16.04 build

docker/SatelliteCollocationLocally/Dockerfile docker

ubuntu 20.04 build

docker/COT_retrievals_from_LES/Dockerfile docker

nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 build

docker/CloudPhasePredictionDAMA-WL/Dockerfile docker

ubuntu latest build

docker/GPU-CloudPhasePredictionDAMA-WL/Dockerfile docker

nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 build

AwsServerlessTemplate/CausalityAnalyticsViaSpark/lambda/requirements.txt pypi

AwsServerlessTemplate/CloudPhasePredictionDAMA-WL/lambda/requirements.txt pypi

AwsServerlessTemplate/CloudRetrievalViaDask/lambda/requirements.txt pypi

AwsServerlessTemplate/DomainAdaptationViaHovorod/lambda/requirements.txt pypi

AwsServerlessTemplate/GPU-CloudPhasePredictionDAMA-WL/lambda/requirements.txt pypi

AwsServerlessTemplate/NewAppTemplate/lambda/requirements.txt pypi

AwsServerlessTemplate/SatelliteCollocationViaDask/lambda/requirements.txt pypi

ExecutionHistory/0ec2088f-a3b8-4730-8e76-cac2015c74df/067f4d746ef9e15d72a489723fee57ff_FILES/requirements.txt pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/big-data-lab-umbc/reproducible_and_portable_app_in_cloud

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Reproducible and Portable Big Data Analytics in Cloud

Introduction

Abstract

Prerequisite

Install Dependencies

Usage

Getting Started

Citation

Owner

GitHub Events

Total

Last Year

Dependencies