https://github.com/awslabs/amazon-emr-vscode-toolkit

A VS Code Extension to make it easier to manage and develop Spark jobs on EMR

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

amazon-emr apache-spark pyspark python

Last synced: 9 months ago · JSON representation

Repository

A VS Code Extension to make it easier to manage and develop Spark jobs on EMR

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: TypeScript
Default Branch: main
Homepage: https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools
Size: 907 KB

Statistics

Stars: 37
Watchers: 6
Forks: 5
Open Issues: 17
Releases: 2

Topics

amazon-emr apache-spark pyspark python

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct

Amazon EMR Toolkit for VS Code (Developer Preview)

EMR Toolkit is a VS Code Extension to make it easier to develop Spark jobs on EMR.

Requirements

A local AWS profile
Access to the AWS API to list EMR and Glue resources
Docker (if you want to use the devcontainer)

Features

Amazon EMR Explorer
Glue Data Catalog Explorer
EMR Development Container
- Spark shell support
- Jupyter Notebook support
EMR Serverless Deployment

Amazon EMR Explorer

The Amazon EMR Explorer allows you to browse job runs and steps across EMR on EC2, EMR on EKS, and EMR Serverless. To see the Explorer, choose the EMR icon in the Activity bar.

Note: If you do not have default AWS credentials or AWS_PROFILE environment variable, use the EMR: Select AWS Profile command to select your profile.

Glue Catalog Explorer

The Glue Catalog Explorer displays databases and tables in the Glue Data Catalog. By right-clicking on a table, you can select View Glue Table that will show the table columns.

PySpark EMR Development Container

The toolkit provides an EMR: Create local Spark environment command that creates a development container based off of an EMR on EKS image for the EMR version you choose. This container can be used to develop Spark and PySpark code locally that is fully compatible with your remote EMR environment.

You choose a region and EMR version you want to use, and the extension creates the relevant Dockerfile and devcontainer.json.

Once the container is created, follow the instructions in the emr-local.md file to authenticate to ECR and use the Dev--Containers: Reopen in Container command to build and open your local Spark environment.

You can choose to configure AWS authentication in the container in 1 of 3 ways:

Use existing ~/.aws config - This mounts your ~/.aws directory to the container.
Environment variables - If you already have AWS environment variables configured in your shell, the container will reference those variables.
.env file - Creates a .devcontainer/aws.env file that you can populate with AWS credentials.

Spark Shell Support

The EMR Development Container is configured to run Spark in local mode. You can use it like any Spark-enabled environment. Inside the VS Code Terminal, you can use the pyspark or spark-shell commands to start a local Spark session.

Jupyter Notebook Support

By default, the EMR Development Container also supports Jupyter. Use the Create: New Jupyter Notebook command to create a new Jupyter notebook. The following code snippet shows how to initialize a Spark Session inside the notebook. By default, the Container environment is also configured to use the Glue Data Catalog so you can use spark.sql commands against Glue tables.

```python from pyspark.sql import SparkSession

spark = ( SparkSession.builder.appName("EMRLocal") .getOrCreate() ) ```

EMR Serverless Deployment

You can deploy and run a single PySpark file on EMR Serverless with the EMR Serverless: Deploy and run PySpark job command. You'll be prompted for the following information:

S3 URI - Your PySpark file will be copied here
IAM Role - A job runtime role that can be used to run your EMR Serverless job
EMR Serverless Application ID - The ID of an existing EMR Serverless Spark application
Filename - The name of the local PySpark file you want to run on EMR Serverless

https://user-images.githubusercontent.com/1512/195953681-4e7e7102-4974-45b1-a695-195e91d45124.mp4

Future Considerations

Allow for the ability to select different profiles
Persist state (region selection)
Create a Java environment
Automate deployments to EMR
- Create virtualenv and upload to S3
- Pack pom into jar file
Link to open logs in S3 or CloudWatch
Testing :) https://vscode.rocks/testing/

Feedback Notes

I'm looking for feedback in a few different areas:

How do you use Spark on EMR today?
- EMR on EC2, EMR on EKS, or EMR Serverless
- PySpark, Scala Spark, or SparkSQL
Does the tool work as expected for browsing your EMR resources
Do you find the devcontainer useful for local development
What functionality is missing that you would like to see

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Create event: 4
Issues event: 4
Watch event: 8
Delete event: 3
Issue comment event: 5
Member event: 1
Push event: 7
Pull request review event: 3
Pull request review comment event: 1
Pull request event: 5
Fork event: 1

Last Year

Create event: 4
Issues event: 4
Watch event: 8
Delete event: 3
Issue comment event: 5
Member event: 1
Push event: 7
Pull request review event: 3
Pull request review comment event: 1
Pull request event: 5
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 25
Total pull requests: 18
Average time to close issues: about 2 months
Average time to close pull requests: 6 days
Total issue authors: 11
Total pull request authors: 5
Average comments per issue: 1.2
Average comments per pull request: 0.33
Merged pull requests: 16
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: about 7 hours
Issue authors: 2
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dacort (12)
vishalovercome (2)
ramondiez (2)
lmouhib (1)
btlem (1)
monometa (1)
jlafaye (1)
dgilmanAIDENTIFIED (1)
corvuslee (1)
nihakue (1)
mohanmane-a (1)
Jeppefs (1)

Pull Request Authors

dacort (9)
lmouhib (7)
dependabot[bot] (2)
arunsathiya (2)
dabrun (1)

https://github.com/awslabs/amazon-emr-vscode-toolkit

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Amazon EMR Toolkit for VS Code (Developer Preview)

Requirements

Features

Amazon EMR Explorer

Glue Catalog Explorer

PySpark EMR Development Container

Spark Shell Support

Jupyter Notebook Support

EMR Serverless Deployment

Future Considerations

Feedback Notes

Security

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels