https://github.com/big-data-lab-umbc/reproducible_and_portable_app_in_cloud
A toolkit to deploy, execute, analyze, and reproduce big data analytics automatically in the cloud.
https://github.com/big-data-lab-umbc/reproducible_and_portable_app_in_cloud
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Repository
A toolkit to deploy, execute, analyze, and reproduce big data analytics automatically in the cloud.
Basic Info
Statistics
- Stars: 6
- Watchers: 2
- Forks: 6
- Open Issues: 6
- Releases: 1
Topics
Metadata Files
README.md
Reproducible and Portable Big Data Analytics in Cloud
Introduction
We implement the Reproducible and Portable big data Analytics in the Cloud (RPAC) Toolkit, which help us deploy, execute, analyze, and reproduce big data analytics automatically in cloud.
Abstract
Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.
Prerequisite
RPAC currently supports Python>=3.6.
Install Dependencies
bash
pip3 install configparser
Optional dependencies, only for AWS-based execution. Commands for other OSs are at AWS User Guilde:
bash pip3 install aws-sam-cli curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip awscliv2.zip && sudo ./aws/installOptional dependencies, only for Azure-based execution:bash curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
Set your credentials and other configurations for cloud platforms:
For AWS: aws configure set aws_access_key_id <aws_access_key> && aws configure set aws_secret_access_key <aws_secert_key> && aws configure set default.region us-west-2
For Azure: az login
Usage
main.pyTo perform RPAC of an experiment on a cloud.
``` usage: python3 main.py [-h] [--executionhistory EXECUTIONHISTORY] [--one_click] [--terminate]
RPAC Toolkit.
optional arguments: -h, --help show this help message and exit --executionhistory EXECUTIONHISTORY Folder name of execution history to reproduce, or URI of execution history. --oneclick Allow oneclick execution to be used by RPAC, implies '--one_click'. Note this argument will terminate all cloud resources after execution finished. --terminate Terminate all cloud resources, implies '--terminate'. ```
To use RPAC toolkit, make the following changes to your configuration:
- Update configurations resource.ini, application.ini, personal.ini in ConfigTemplate folder.
- For resource.ini: reproduce_storage is the S3 Bucket name, which will store all reproduction historical files. You need to create your bucket before running RPAC. We recommend a name only with lowercase letters, numbers, and hyphens (-). The detailed Bucket naming rules can be find in here.
- For personal.ini: cloud_credentials is the key:value pair of your cloud credentials (Access key ID:Secret key ID). In order to find your credentials, see here.
- Run
python3 main.pyto execute the big data analytics.
Example usage: python3 main.py --one_click, python3 main.py --execution_history 0ec2088f-a3b8-4730-8e76-cac2015c74df --one_click, python3 main.py --execution_history s3://aws-sam-cli-managed-default-samclisourcebucket-xscicpwnc0z3/a57f212d-c7c3-46eb-ace4-d62bb6b294f6 --one_click.
For a closer look, please refer to our demo.
Getting Started
Three-pointers for experimental execution to get you started: - First execution: get first execution with understanding and using RPAC toolkit - Examples: easy to understand RPAC across three applications - Reproduce: reproduce existing execution with RPAC - New application: create your own applications with RPAC
End-to-end execution also provided in RPAC: - End-to-end: one-click execution with RPAC
Citation
If you use this code for your research, please cite our paper:
@article{wang2023reproducible,
title={Reproducible and Portable Big Data Analytics in the Cloud},
author={Wang, Xin and Guo, Pei and Li, Xingyan and Gangopadhyay, Aryya and Busart, Carl and Freeman, Jade and Wang, Jianwu},
journal={IEEE Transactions on Cloud Computing},
year={2023},
publisher={IEEE}
}
Owner
- Name: Big Data Analytics Lab @ UMBC
- Login: big-data-lab-umbc
- Kind: organization
- Location: University of Maryland, Baltimore County
- Website: https://bdal.umbc.edu/
- Twitter: jianwuwang
- Repositories: 5
- Profile: https://github.com/big-data-lab-umbc
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Dependencies
- azure-functions *
- azure-cli-core *
- azure-functions *
- azure-identity *
- azure-mgmt-compute *
- azure-mgmt-resource *
- azure-functions *
- AzureCLI ==v2.26.0
- AzureFunctionsCoreTools ==v3.x.
- dotnet-runtimes ==Microsoft.NETCore.App==v5.0.7
- dotnet-runtimes ==Microsoft.AspNetCore.App==v5.0.7
- amazoncorretto 8 build
- ubuntu 20.04 build
- nvidia/cuda 10.1-cudnn7-devel-ubuntu16.04 build
- nvidia/cuda 10.1-cudnn7-devel-ubuntu16.04 build
- ubuntu 20.04 build
- nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 build
- ubuntu latest build
- nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 build