https://github.com/dadananjesha/redshift-etl-project

The project covers the complete data pipeline—from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary

Keywords

apache-spark aws data-engineering-etl-assignment data-ingestion data-pipeline etl-processes hdfs rds redshift spark sqoop

Last synced: 10 months ago · JSON representation

Repository

The project covers the complete data pipeline—from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

Basic Info

Host: GitHub
Owner: DadaNanjesha
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 833 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

apache-spark aws data-engineering-etl-assignment data-ingestion data-pipeline etl-processes hdfs rds redshift spark sqoop

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

Redshift-ETL-Project 🚀🔧

Data Engineering ETL Project is a comprehensive project demonstrating data ingestion, ETL processes, and analytical querying using AWS Redshift, Apache Spark, and Sqoop. The project covers the complete data pipeline—from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

📖 Overview

This project is designed to showcase a real-world ETL workflow for a data engineering assignment: - Data Ingestion: Import data from an RDS (MySQL) database to HDFS using Sqoop. - ETL Processing: Use Apache Spark for data transformation and loading. - Analytical Queries: Execute complex analytical queries on an AWS Redshift cluster to derive insights.

The provided documents include detailed Redshift queries, cluster setup screenshots, Spark ETL code, and Sqoop data ingestion commands.

🛠️ Technologies & Tools

🔄 Data Flow Diagram

mermaid flowchart TD A[🗄️ RDS - MySQL] --> B[📥 Sqoop Import] B --> C[📁 HDFS] C --> D[🔄 Spark ETL Processing] D --> E[📤 Data Load] E --> F[AWS Redshift] F --> G[🔍 Analytical Queries]

🗂️ Project Structure

plaintext DataEngineeringETL/ ├── RedshiftQueries.pdf # PDF containing analytical queries for the Redshift cluster ├── RedshiftSetup.pdf # PDF with screenshots and details on setting up the Redshift cluster ├── SparkETLCode.ipynb # Jupyter Notebook with Spark ETL code and transformation logic ├── SqoopDataIngestion.pdf # PDF outlining the Sqoop import commands and HDFS data inspection └── README.md # Project documentation (this file)

💻 Setup & Deployment

Prerequisites

AWS Account: For setting up Redshift and S3.
RDS MySQL Instance: Source of data.
Hadoop Cluster: For HDFS (local or cloud-based).
Apache Sqoop & Spark: Installed on your data processing cluster.

Setup Steps

Data Ingestion with Sqoop:
- Use the Sqoop commands detailed in SqoopDataIngestion.pdf to import tables from RDS into HDFS.
- Verify data import using Hadoop FS commands.
ETL Processing with Spark:
- Open SparkETLCode.ipynb in Jupyter Notebook.
- Follow the ETL workflow to clean, transform, and load data.
Redshift Cluster Setup:
- Follow the guidelines in RedshiftSetup.pdf to create a Redshift cluster and configure databases/tables.
- Execute the SQL queries from RedshiftQueries.pdf on the AWS Redshift Query Editor.

🚀 Usage

Run ETL:
Execute the Spark ETL Notebook (SparkETLCode.ipynb) to process and prepare data.
Load & Query Data:
Load the transformed data into Redshift and run analytical queries to generate insights.
Review Documentation:
Refer to the PDF files for detailed instructions on Redshift setup, query execution, and Sqoop data ingestion.

⭐️ Call-to-Action

If you find this project useful, please consider: - Starring the repository ⭐ - Forking to contribute improvements or customizations - Following for updates on similar data engineering projects

Your engagement is greatly appreciated and helps boost visibility!

📜 License

This project is licensed under the MIT License.

🙏 Acknowledgements

AWS & Azure: For providing robust cloud infrastructure.
Data Engineering Community: For continuous inspiration and support.

Happy Data Engineering! 🚀🔧

Owner

Name: DADA NANJESHA
Login: DadaNanjesha
Kind: user
Location: BERLIN

Repositories: 1
Profile: https://github.com/DadaNanjesha

GitHub Events

Total

Watch event: 1
Push event: 4
Pull request event: 2
Create event: 3

Last Year

Watch event: 1
Push event: 4
Pull request event: 2
Create event: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

https://github.com/dadananjesha/redshift-etl-project

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Redshift-ETL-Project 🚀🔧

📖 Overview

🛠️ Technologies & Tools

🔄 Data Flow Diagram

🗂️ Project Structure

💻 Setup & Deployment

Prerequisites

Setup Steps

🚀 Usage

⭐️ Call-to-Action

📜 License

🙏 Acknowledgements

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels