https://github.com/aspirina765/awesome-spark
A curated list of awesome Apache Spark packages and resources.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, researchgate.net -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
A curated list of awesome Apache Spark packages and resources.
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of awesome-spark/awesome-spark
Created over 3 years ago
· Last pushed over 3 years ago
https://github.com/aspirina765/awesome-spark/blob/main/
[](https://spark.apache.org/) # Awesome Spark [](https://github.com/sindresorhus/awesome) A curated list of awesome [Apache Spark](https://spark.apache.org/) packages and resources. _Apache Spark is an open-source cluster-computing framework. Originally developed at the [University of California](https://www.universityofcalifornia.edu/), [Berkeley's AMPLab](https://amplab.cs.berkeley.edu/), the Spark codebase was later donated to the [Apache Software Foundation](https://www.apache.org/), which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance_ ([Wikipedia 2017](#wikipedia-2017)). Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs. ## Contents - [Packages](#packages) - [Language Bindings](#language-bindings) - [Notebooks and IDEs](#notebooks-and-ides) - [General Purpose Libraries](#general-purpose-libraries) - [SQL Data Sources](#sql-data-sources) - [Storage](#storage) - [Bioinformatics](#bioinformatics) - [GIS](#gis) - [Time Series Analytics](#time-series-analytics) - [Graph Processing](#graph-processing) - [Machine Learning Extension](#machine-learning-extension) - [Middleware](#middleware) - [Utilities](#utilities) - [Natural Language Processing](#natural-language-processing) - [Streaming](#streaming) - [Interfaces](#interfaces) - [Testing](#testing) - [Web Archives](#web-archives) - [Workflow Management](#workflow-management) - [Resources](#resources) - [Books](#books) - [Papers](#papers) - [MOOCS](#moocs) - [Workshops](#workshops) - [Projects Using Spark](#projects-using-spark) - [Docker Images](#docker-images) - [Miscellaneous](#miscellaneous) ## Packages ### Language Bindings * [Flambo](https://github.com/yieldbot/flambo)
- Clojure DSL. * [Mobius](https://github.com/Microsoft/Mobius)
- C# bindings (Deprecated in favor of .NET for Apache Spark). * [.NET for Apache Spark](https://github.com/dotnet/spark)
- .NET bindings. * [sparklyr](https://github.com/rstudio/sparklyr)
- An alternative R backend, using [`dplyr`](https://github.com/hadley/dplyr). * [sparkle](https://github.com/tweag/sparkle)
- Haskell on Apache Spark. ### Notebooks and IDEs * [almond](https://almond.sh/)
- A scala kernel for [Jupyter](https://jupyter.org/). * [Apache Zeppelin](https://zeppelin.incubator.apache.org/)
- Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box. * [Polynote](https://polynote.org/)
- Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from [Netflix](https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447). * [Spark Notebook](https://github.com/andypetrella/spark-notebook)
- Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts). * [sparkmagic](https://github.com/jupyter-incubator/sparkmagic)
- [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks. ### General Purpose Libraries * [Succinct](http://succinct.cs.berkeley.edu/)
- Support for efficient queries on compressed data. * [itachi](https://github.com/yaooqinn/itachi)
- A library that brings useful functions from modern database management systems to Apache Spark. * [spark-daria](https://github.com/mrpowers/spark-daria)
- A Scala library with essential Spark functions and extensions to make you more productive. * [quinn](https://github.com/mrpowers/quinn)
- A native PySpark implementation of spark-daria. * [Apache DataFu](https://github.com/apache/datafu/tree/master/datafu-spark)
- A library of general purpose functions and UDF's. * [Joblib Apache Spark Backend](https://github.com/joblib/joblib-spark)
- [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters. ### SQL Data Sources SparkSQL has [serveral built-in Data Sources](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options) for files. These include `csv`, `json`, `parquet`, `orc`, and `avro`. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own. * [Spark CSV](https://github.com/databricks/spark-csv)
- CSV reader and writer (obsolete since Spark 2.0 [[SPARK-12833]](https://issues.apache.org/jira/browse/SPARK-12833)). * [Spark Avro](https://github.com/databricks/spark-avro)
- [Apache Avro](https://avro.apache.org/) reader and writer (obselete since Spark 2.4 [[SPARK-24768]](https://issues.apache.org/jira/browse/SPARK-24768)). * [Spark XML](https://github.com/databricks/spark-xml)
- XML parser and writer. * [Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector)
- Cassandra support including data source and API and support for arbitrary queries. * [Spark Riak Connector](https://github.com/basho/spark-riak-connector)
- Riak TS & Riak KV connector. * [Mongo-Spark](https://github.com/mongodb/mongo-spark)
- Official MongoDB connector. * [OrientDB-Spark](https://github.com/orientechnologies/spark-orientdb)
- Official OrientDB connector. ### Storage * [Delta Lake](https://github.com/delta-io/delta)
- Storage layer with ACID transactions. * [lakeFS](https://docs.lakefs.io/integrations/spark.html)
- Integration with the lakeFS atomic versioned storage layer. ### Bioinformatics * [ADAM](https://github.com/bigdatagenomics/adam)
- Set of tools designed to analyse genomics data. * [Hail](https://github.com/hail-is/hail)
- Genetic analysis framework. ### GIS * [Magellan](https://github.com/harsha2010/magellan)
- Geospatial analytics using Spark. * [Apache Sedona](https://github.com/apache/incubator-sedona)
- Cluster computing system for processing large-scale spatial data. ### Time Series Analytics * [Spark-Timeseries](https://github.com/cloudera/spark-timeseries)
- Scala / Java / Python library for interacting with time series data on Apache Spark. * [flint](https://github.com/twosigma/flint)
- A time series library for Apache Spark. ### Graph Processing * [Mazerunner](https://github.com/neo4j-contrib/neo4j-mazerunner)
- Graph analytics platform on top of Neo4j and GraphX. * [GraphFrames](https://github.com/graphframes/graphframes)
- Data frame based graph API. * [neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector)
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support. * [SparklingGraph](http://sparkling.ml)
- Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.). ### Machine Learning Extension * [Clustering4Ever](https://github.com/Clustering4Ever/Clustering4Ever)
Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate. * [dbscan-on-spark](https://github.com/irvingc/dbscan-on-spark)
- An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by [irvingc](https://github.com/irvingc) and based on the paper from He, Yaobin, et al. [MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data](https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf). * [Apache SystemML](https://systemml.apache.org/)
- Declarative machine learning framework on top of Spark. * [Mahout Spark Bindings](https://mahout.apache.org/users/sparkbindings/home.html) \[status unknown\] - linear algebra DSL and optimizer with R-like syntax. * [spark-sklearn](https://github.com/databricks/spark-sklearn)
- Scikit-learn integration with distributed model training. * [KeystoneML](http://keystone-ml.org/) - Type safe machine learning pipelines with RDDs. * [JPMML-Spark](https://github.com/jpmml/jpmml-spark)
- PMML transformer library for Spark ML. * [Distributed Keras](https://github.com/cerndb/dist-keras)
- Distributed deep learning framework with PySpark and Keras. * [ModelDB](https://mitdbg.github.io/modeldb)
- A system to manage machine learning models for `spark.ml` and [`scikit-learn`](https://github.com/scikit-learn/scikit-learn)
. * [Sparkling Water](https://github.com/h2oai/sparkling-water)
- [H2O](http://www.h2o.ai/) interoperability layer. * [BigDL](https://github.com/intel-analytics/BigDL)
- Distributed Deep Learning library. * [MLeap](https://github.com/combust/mleap)
- Execution engine and serialization format which supports deployment of `o.a.s.ml` models without dependency on `SparkSession`. * [Microsoft ML for Apache Spark](https://github.com/Azure/mmlspark)
- A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment. * [MLflow](https://mlflow.org/docs/latest/python_api/mlflow.spark.html#module-mlflow.spark)
- Machine learning orchestration platform. ### Middleware * [Livy](https://github.com/apache/incubator-livy)
- REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing. * [spark-jobserver](https://github.com/spark-jobserver/spark-jobserver)
- Simple Spark as a Service which supports objects sharing using so called named objects. JVM only. * [Mist](https://github.com/Hydrospheredata/mist)
- Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services. * [Apache Toree](https://github.com/apache/incubator-toree)
- IPython protocol based middleware for interactive applications. * [Apache Kyuubi](https://github.com/apache/incubator-kyuubi)
- A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark. ### Monitoring * [Data Mechanics Delight](https://github.com/datamechanics/delight)
- Cross-platform monitoring tool (Spark UI / Spark History Server replacement). ### Utilities * [silex](https://github.com/willb/silex)
- Collection of tools varying from ML extensions to additional RDD methods. * [sparkly](https://github.com/Tubular/sparkly)
- Helpers & syntactic sugar for PySpark. * [pyspark-stubs](https://github.com/zero323/pyspark-stubs)
- Static type annotations for PySpark (obsolete since Spark 3.1. See [SPARK-32681](https://issues.apache.org/jira/browse/SPARK-32681)). * [Flintrock](https://github.com/nchammas/flintrock)
- A command-line tool for launching Spark clusters on EC2. * [Optimus](https://github.com/ironmussa/Optimus/)
- Data Cleansing and Exploration utilities with the goal of simplifying data cleaning. ### Natural Language Processing * [spark-corenlp](https://github.com/databricks/spark-corenlp)
- DataFrame wrapper for [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/). * [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp)
- Natural language processing library built on top of Apache Spark ML. ### Streaming * [Apache Bahir](https://bahir.apache.org/)
- Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ). ### Interfaces * [Apache Beam](https://beam.apache.org/)
- Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments. * [Blaze](https://github.com/blaze/blaze)
- Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark `DataFrames` and `RDDs`. * [Koalas](https://github.com/databricks/koalas)
- Pandas DataFrame API on top of Apache Spark. ### Testing * [deequ](https://github.com/awslabs/deequ)
- Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. * [spark-testing-base](https://github.com/holdenk/spark-testing-base)
- Collection of base test classes. * [spark-fast-tests](https://github.com/MrPowers/spark-fast-tests)
- A lightweight and fast testing framework. ### Web Archives * [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut)
- Open-source toolkit for analyzing web archives. ### Workflow Management * [Cromwell](https://github.com/broadinstitute/cromwell#spark-backend)
- Workflow management system with [Spark backend](https://github.com/broadinstitute/cromwell#spark-backend). ## Resources ### Books * [Learning Spark, 2nd Edition](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/) - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts. * [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do) - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas). * [Mastering Apache Spark](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/) - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals. * [Spark Gotchas](https://github.com/awesome-spark/spark-gotchas) - Subjective compilation of tips, tricks and common programming mistakes. * [Spark in Action](https://www.manning.com/books/spark-in-action) - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition). ### Papers * [Large-Scale Intelligent Microservices](https://arxiv.org/pdf/2009.08044.pdf) - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives. * [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf) - Paper introducing a core distributed memory abstraction. * [Spark SQL: Relational Data Processing in Spark](https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf) - Paper introducing relational underpinnings, code generation and Catalyst optimizer. * [Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark](https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf) - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query. ### MOOCS * [Data Science and Engineering with Apache Spark (edX XSeries)](https://www.edx.org/xseries/data-science-engineering-apache-spark) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented. * [Big Data Analysis with Scala and Spark (Coursera)](https://www.coursera.org/learn/big-data-analysys) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala). ### Workshops * [AMP Camp](http://ampcamp.berkeley.edu) - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/). ### Projects Using Spark * [Oryx 2](https://github.com/OryxProject/oryx) - [Lambda architecture](http://lambda-architecture.net/) platform built on Apache Spark and [Apache Kafka](http://kafka.apache.org/) with specialization for real-time large scale machine learning. * [Photon ML](https://github.com/linkedin/photon-ml) - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model. * [PredictionIO](https://prediction.io/) - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time. * [Crossdata](https://github.com/Stratio/Crossdata) - Data integration platform with extended DataSource API and multi-user environment. ### Docker Images - [jupyter/docker-stacks/pyspark-notebook](https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook) - PySpark with Jupyter Notebook and Mesos client. - [sequenceiq/docker-spark](https://github.com/sequenceiq/docker-spark) - Yarn images from [SequenceIQ](http://www.sequenceiq.com/). - [datamechanics/spark](https://hub.docker.com/r/datamechanics/spark) - An easy to setup Docker image for Apache Spark from [Data Mechanics](https://www.datamechanics.co/). ### Miscellaneous - [Spark with Scala Gitter channel](https://gitter.im/spark-scala/Lobby) - "_A place to discuss and ask questions about using Scala for Spark programming_" started by [@deanwampler](https://github.com/deanwampler). - [Apache Spark User List](http://apache-spark-user-list.1001560.n3.nabble.com/) and [Apache Spark Developers List](http://apache-spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively. ## References
Wikipedia. 2017. Apache Spark Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.
## LicenseApache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation. Inspired by [sindresorhus/awesome](https://github.com/sindresorhus/awesome).
![]()
This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.
Owner
- Login: aspirina765
- Kind: user
- Repositories: 423
- Profile: https://github.com/aspirina765