spark-submit

Python manager for spark-submit jobs

https://github.com/papostol/spark-submit

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

apache spark submit
Last synced: 6 months ago · JSON representation

Repository

Python manager for spark-submit jobs

Basic Info
  • Host: GitHub
  • Owner: PApostol
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 53.7 KB
Statistics
  • Stars: 10
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 7
Topics
apache spark submit
Created over 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog License

README.md

Spark-submit

PyPI version Downloads PyPI - Downloads Code style: blue License contributions welcome

TL;DR: Python manager for spark-submit jobs

Description

This package allows for submission and management of Spark jobs in Python scripts via Apache Spark's spark-submit functionality.

Installation

The easiest way to install is using pip:

pip install spark-submit

To install from source: git clone https://github.com/PApostol/spark-submit.git cd spark-submit python setup.py install

For usage details check help(spark_submit).

Usage Examples

Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.

Simple example:

``` from spark_submit import SparkJob

app = SparkJob('/path/some_file.py', master='local', name='simple-test') app.submit()

print(app.get_state()) ```

Another example:

``` from spark_submit import SparkJob

sparkargs = { 'master': 'spark://some.spark.master:6066', 'deploymode': 'cluster', 'name': 'spark-submit-app', 'class': 'main.Class', 'executormemory': '2G', 'executorcores': '1', 'totalexecutorcores': '2', 'verbose': True, 'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"], 'mainfileargs': '--foo arg1 --bar arg2' }

app = SparkJob('s3a://bucket/path/somefile.jar', **sparkargs) print(app.getsubmitcmd(multiline=True))

poll state in the background every x seconds with poll_time=x

app.submit(useenvvars=True, extraenvvars={'PYTHONPATH': '/some/path/'}, poll_time=10 )

print(app.get_state()) # 'SUBMITTED'

while not app.concluded: # do other stuff... print(app.get_state()) # 'RUNNING'

print(app.get_state()) # 'FINISHED' ```

Examples of spark-submit to spark_args dictionary:

A client example:

~/spark_home/bin/spark-submit \ --master spark://some.spark.master:7077 \ --name spark-submit-job \ --total-executor-cores 8 \ --executor-cores 4 \ --executor-memory 4G \ --driver-memory 2G \ --py-files /some/utils.zip \ --files /some/file.json \ /path/to/pyspark/file.py --data /path/to/data.csv

becomes

spark_args = { 'master': 'spark://some.spark.master:7077', 'name': 'spark_job_client', 'total_executor_cores: '8', 'executor_cores': '4', 'executor_memory': '4G', 'driver_memory': '2G', 'py_files': '/some/utils.zip', 'files': '/some/file.json', 'main_file_args': '--data /path/to/data.csv' } main_file = '/path/to/pyspark/file.py' app = SparkJob(main_file, **spark_args)

A cluster example:

~/spark_home/bin/spark-submit \ --master spark://some.spark.master:6066 \ --deploy-mode cluster \ --name spark_job_cluster \ --jars "s3a://mybucket/some/file.jar" \ --conf "spark.some.conf=foo" \ --conf "spark.some.other.conf=bar" \ --total-executor-cores 16 \ --executor-cores 4 \ --executor-memory 4G \ --driver-memory 2G \ --class my.main.Class \ --verbose \ s3a://mybucket/file.jar "positional_arg1" "positional_arg2"

becomes

spark_args = { 'master': 'spark://some.spark.master:6066', 'deploy_mode': 'cluster', 'name': 'spark_job_cluster', 'jars': 's3a://mybucket/some/file.jar', 'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes 'total_executor_cores: '16', 'executor_cores': '4', 'executor_memory': '4G', 'driver_memory': '2G', 'class': 'my.main.Class', 'verbose': True, 'main_file_args': '"positional_arg1" "positional_arg2"' } main_file = 's3a://mybucket/file.jar' app = SparkJob(main_file, **spark_args)

Testing

You can do some simple testing with local mode Spark after cloning the repo.

Note any additional requirements for running the tests: pip install -r tests/requirements.txt

pytest tests/

python tests/run_integration_test.py

Additional methods

spark_submit.system_info(): Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS

spark_submit.SparkJob.kill(): Kills the running Spark job (cluster mode only)

spark_submit.SparkJob.get_code(): Gets the spark-submit return code

spark_submit.SparkJob.get_output(): Gets the spark-submit stdout

spark_submit.SparkJob.get_id(): Gets the spark-submit submission ID

License

Released under MIT by @PApostol.

  • You can freely modify and reuse.
  • The original license must be included with copies of this software.
  • Please link back to this repo if you use a significant portion the source code.

Owner

  • Login: PApostol
  • Kind: user
  • Location: London, UK

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 55
  • Total Committers: 1
  • Avg Commits per committer: 55.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
PApostol p****s@g****m 55

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • PApostol (8)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 387 last-month
  • Total docker downloads: 116
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 7
  • Total maintainers: 1
pypi.org: spark-submit

Python manager for spark-submit jobs

  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 387 Last month
  • Docker Downloads: 116
Rankings
Docker downloads count: 4.5%
Downloads: 7.8%
Dependent packages count: 10.0%
Average: 13.4%
Forks count: 16.8%
Stargazers count: 19.3%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 7 months ago

Dependencies

setup.py pypi
  • requests *
requirements-dev.txt pypi
  • pyspark >=3.2.0 development
  • pytest >=7.0.0 development
  • requests >=2.26.0 development