Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Keywords
Repository
Python manager for spark-submit jobs
Basic Info
Statistics
- Stars: 10
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 7
Topics
Metadata Files
README.md
Spark-submit
TL;DR: Python manager for spark-submit jobs
Description
This package allows for submission and management of Spark jobs in Python scripts via Apache Spark's spark-submit functionality.
Installation
The easiest way to install is using pip:
pip install spark-submit
To install from source:
git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install
For usage details check help(spark_submit).
Usage Examples
Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.
Simple example:
``` from spark_submit import SparkJob
app = SparkJob('/path/some_file.py', master='local', name='simple-test') app.submit()
print(app.get_state()) ```
Another example:
``` from spark_submit import SparkJob
sparkargs = { 'master': 'spark://some.spark.master:6066', 'deploymode': 'cluster', 'name': 'spark-submit-app', 'class': 'main.Class', 'executormemory': '2G', 'executorcores': '1', 'totalexecutorcores': '2', 'verbose': True, 'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"], 'mainfileargs': '--foo arg1 --bar arg2' }
app = SparkJob('s3a://bucket/path/somefile.jar', **sparkargs) print(app.getsubmitcmd(multiline=True))
poll state in the background every x seconds with poll_time=x
app.submit(useenvvars=True, extraenvvars={'PYTHONPATH': '/some/path/'}, poll_time=10 )
print(app.get_state()) # 'SUBMITTED'
while not app.concluded: # do other stuff... print(app.get_state()) # 'RUNNING'
print(app.get_state()) # 'FINISHED' ```
Examples of spark-submit to spark_args dictionary:
A client example:
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv
becomes
spark_args = {
'master': 'spark://some.spark.master:7077',
'name': 'spark_job_client',
'total_executor_cores: '8',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'py_files': '/some/utils.zip',
'files': '/some/file.json',
'main_file_args': '--data /path/to/data.csv'
}
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)
A cluster example:
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"
becomes
spark_args = {
'master': 'spark://some.spark.master:6066',
'deploy_mode': 'cluster',
'name': 'spark_job_cluster',
'jars': 's3a://mybucket/some/file.jar',
'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
'total_executor_cores: '16',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'class': 'my.main.Class',
'verbose': True,
'main_file_args': '"positional_arg1" "positional_arg2"'
}
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)
Testing
You can do some simple testing with local mode Spark after cloning the repo.
Note any additional requirements for running the tests: pip install -r tests/requirements.txt
pytest tests/
python tests/run_integration_test.py
Additional methods
spark_submit.system_info(): Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS
spark_submit.SparkJob.kill(): Kills the running Spark job (cluster mode only)
spark_submit.SparkJob.get_code(): Gets the spark-submit return code
spark_submit.SparkJob.get_output(): Gets the spark-submit stdout
spark_submit.SparkJob.get_id(): Gets the spark-submit submission ID
License
Released under MIT by @PApostol.
- You can freely modify and reuse.
- The original license must be included with copies of this software.
- Please link back to this repo if you use a significant portion the source code.
Owner
- Login: PApostol
- Kind: user
- Location: London, UK
- Repositories: 4
- Profile: https://github.com/PApostol
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 8
- Average time to close issues: N/A
- Average time to close pull requests: 2 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- PApostol (8)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 387 last-month
- Total docker downloads: 116
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 7
- Total maintainers: 1
pypi.org: spark-submit
Python manager for spark-submit jobs
- Homepage: https://github.com/PApostol/spark-submit
- Documentation: https://spark-submit.readthedocs.io/
- License: MIT
-
Latest release: 1.4.0
published almost 3 years ago
Rankings
Maintainers (1)
Dependencies
- requests *
- pyspark >=3.2.0 development
- pytest >=7.0.0 development
- requests >=2.26.0 development