fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

https://github.com/fugue-project/fugue

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

dask data-practitioners distributed distributed-computing distributed-systems machine-learning pandas spark sql
Last synced: 6 months ago · JSON representation

Repository

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

Basic Info
Statistics
  • Stars: 2,105
  • Watchers: 23
  • Forks: 94
  • Open Issues: 39
  • Releases: 136
Topics
dask data-practitioners distributed distributed-computing distributed-systems machine-learning pandas spark sql
Created almost 6 years ago · Last pushed 11 months ago
Metadata Files
Readme Contributing License

README.md

PyPI version PyPI pyversions PyPI license codecov Codacy Badge Downloads

| Tutorials | API Documentation | Chat with us on slack! | | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | | Jupyter Book Badge | Doc | Slack Status |

Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites.

Fugue is most commonly used for:

  • Parallelizing or scaling existing Python and Pandas code by bringing it to Spark, Dask, or Ray with minimal rewrites.
  • Using FugueSQL to define end-to-end workflows on top of Pandas, Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface that can invoke Python code.

To see how Fugue compares to other frameworks like dbt, Arrow, Ibis, PySpark Pandas, see the comparisons

Fugue API

The Fugue API is a collection of functions that are capable of running on Pandas, Spark, Dask, and Ray. The simplest way to use Fugue is the transform() function. This lets users parallelize the execution of a single function by bringing it to Spark, Dask, or Ray. In the example below, the map_letter_to_food() function takes in a mapping and applies it on a column. This is just Pandas and Python so far (without Fugue).

```python import pandas as pd from typing import Dict

inputdf = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])}) mapdict = {"A": "Apple", "B": "Banana", "C": "Carrot"}

def mapletterto_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame: df["value"] = df["value"].map(mapping) return df ```

Now, the map_letter_to_food() function is brought to the Spark execution engine by invoking the transform() function of Fugue. The output schema and params are passed to the transform() call. The schema is needed because it's a requirement for distributed frameworks. A schema of "*" below means all input columns are in the output.

```python from pyspark.sql import SparkSession from fugue import transform

spark = SparkSession.builder.getOrCreate() sdf = spark.createDataFrame(input_df)

out = transform(sdf, maplettertofood, schema="*", params=dict(mapping=mapdict), )

out is a Spark DataFrame

out.show() rst +---+------+ | id| value| +---+------+ | 0| Apple| | 1|Banana| | 2|Carrot| +---+------+ ```

PySpark equivalent of Fugue transform() ```python from typing import Iterator, Union from pyspark.sql.types import StructType from pyspark.sql import DataFrame, SparkSession spark_session = SparkSession.builder.getOrCreate() def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping): for df in dfs: yield map_letter_to_food(df, mapping) def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping): # conversion if isinstance(input_df, pd.DataFrame): sdf = spark_session.createDataFrame(input_df.copy()) else: sdf = input_df.copy() schema = StructType(list(sdf.schema.fields)) return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping), schema=schema) result = run_map_letter_to_food(input_df, map_dict) result.show() ```

This syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original Pandas-based function to bring it to Spark. It is still usable on Pandas DataFrames. Fugue transform() also supports Dask and Ray as execution engines alongside the default Pandas-based engine.

The Fugue API has a broader collection of functions that are also compatible with Spark, Dask, and Ray. For example, we can use load() and save() to create an end-to-end workflow compatible with Spark, Dask, and Ray. For the full list of functions, see the Top Level API

```python import fugue.api as fa

def run(engine=None): with fa.enginecontext(engine): df = fa.load("/path/to/file.parquet") out = fa.transform(df, maplettertofood, schema="*") fa.save(out, "/path/to/output_file.parquet")

run() # runs on Pandas run(engine="spark") # runs on Spark run(engine="dask") # runs on Dask ```

All functions underneath the context will run on the specified backend. This makes it easy to toggle between local execution, and distributed execution.

FugueSQL

FugueSQL is a SQL-based language capable of expressing end-to-end data workflows on top of Pandas, Spark, and Dask. The map_letter_to_food() function above is used in the SQL expression below. This is how to use a Python-defined function along with the standard SQL SELECT statement.

```python from fugue.api import fugue_sql import json

query = """ SELECT id, value FROM inputdf TRANSFORM USING maplettertofood(mapping={{mapping}}) SCHEMA * """ mapdictstr = json.dumps(map_dict)

returns Pandas DataFrame

fuguesql(query,mapping=mapdict_str)

returns Spark DataFrame

fuguesql(query, mapping=mapdict_str, engine="spark") ```

Installation

Fugue can be installed through pip or conda. For example:

bash pip install fugue

In order to use Fugue SQL, it is strongly recommended to install the sql extra:

bash pip install fugue[sql]

It also has the following installation extras:

  • sql: to support Fugue SQL. Without this extra, the non-SQL part still works. Before Fugue 0.9.0, this extra is included in Fugue's core dependency so you don't need to install explicitly. But for 0,9.0+, this becomes required if you want to use Fugue SQL.
  • spark: to support Spark as the ExecutionEngine.
  • dask: to support Dask as the ExecutionEngine.
  • ray: to support Ray as the ExecutionEngine.
  • duckdb: to support DuckDB as the ExecutionEngine, read details.
  • polars: to support Polars DataFrames and extensions using Polars.
  • ibis: to enable Ibis for Fugue workflows, read details.
  • cppsqlparser: to enable the CPP antlr parser for Fugue SQL. It can be 50+ times faster than the pure Python parser. For the main Python versions and platforms, there is already pre-built binaries, but for the remaining, it needs a C++ compiler to build on the fly.

For example a common use case is:

bash pip install "fugue[duckdb,spark]"

Note if you already installed Spark or DuckDB independently, Fugue is able to automatically use them without installing the extras.

Getting Started

The best way to get started with Fugue is to work through the 10 minute tutorials:

For the top level API, see:

The tutorials can also be run in an interactive notebook environment through binder or Docker:

Using binder

Binder

Note it runs slow on binder because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers.

Using Docker

Alternatively, you should get decent performance by running this Docker image on your own machine:

bash docker run -p 8888:8888 fugueproject/tutorials:latest

Jupyter Notebook Extension

There is an accompanying notebook extension for FugueSQL that lets users use the %%fsql cell magic. The extension also provides syntax highlighting for FugueSQL cells. It works for both classic notebook and Jupyter Lab. More details can be found in the installation instructions.

FugueSQL gif

Ecosystem

By being an abstraction layer, Fugue can be used with a lot of other open-source projects seamlessly.

Python backends:

FugueSQL backends:

  • Pandas - FugueSQL can run on Pandas
  • Duckdb - in-process SQL OLAP database management
  • dask-sql - SQL interface for Dask
  • SparkSQL
  • BigQuery
  • Trino

Fugue is available as a backend or can integrate with the following projects:

Registered 3rd party extensions (majorly for Fugue SQL) include:

  • Pandas plot - visualize data using matplotlib or plotly
  • Seaborn - visualize data using seaborn
  • WhyLogs - visualize data profiling
  • Vizzu - visualize data using ipyvizzu

Community and Contributing

Feel free to message us on Slack. We also have contributing instructions.

Case Studies

Mentioned Uses

Further Resources

View some of our latest conferences presentations and content. For a more complete list, check the Content page in the tutorials.

Blogs

Conferences

Owner

  • Name: The Fugue Project
  • Login: fugue-project
  • Kind: organization
  • Email: fugue-project@googlegroups.com

Democratizing distributed computing and machine learning

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 4
  • Watch event: 123
  • Delete event: 1
  • Issue comment event: 15
  • Push event: 8
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 2
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 4
  • Watch event: 123
  • Delete event: 1
  • Issue comment event: 15
  • Push event: 8
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 374
  • Total Committers: 24
  • Avg Commits per committer: 15.583
  • Development Distribution Score (DDS): 0.556
Past Year
  • Commits: 6
  • Committers: 2
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
Han Wang g****n@g****m 166
Kevin Kho k****o@g****m 106
gityow m****3@g****m 35
Han Wang h****g@l****m 33
Megan Yow M****w@o****m 8
Rowan Molony r****y@c****e 6
WangCHX w****z@g****m 3
Marek Kondziołka m****a@d****m 1
Alexander Beedie a****e 1
Andrew 1****1 1
CRSantiago 4****o 1
David Vegh 6****v 1
Dustin Ngo d****o@g****m 1
Fahad Akbar f****r@g****m 1
Gábor Lipták g****k@g****m 1
Jamie Broomall 8****6 1
Joshua Adelman s****s 1
Nils Braun n****n 1
Richard Pelgrim 6****m 1
Ron Bergeron R****n 1
bitsofinfo b****g@g****m 1
Anthony Holten a****1@g****m 1
John-Jay Manalastas j****s@o****m 1
LaurentErreca e****t@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 100
  • Total pull requests: 89
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 22
  • Total pull request authors: 18
  • Average comments per issue: 1.25
  • Average comments per pull request: 0.83
  • Merged pull requests: 78
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 5
  • Average time to close issues: 17 days
  • Average time to close pull requests: 5 days
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 5.67
  • Average comments per pull request: 0.8
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • goodwanghan (62)
  • juanitorduz (3)
  • rdmolony (3)
  • jstammers (3)
  • bitsofinfo (2)
  • lukeb88 (2)
  • alxmrs (1)
  • rcshetty3 (1)
  • amotl (1)
  • abishekchiffon (1)
  • guilhermedelyra (1)
  • alunap (1)
  • kondziolka9ld (1)
  • naive-forecaster (1)
  • andreall (1)
Pull Request Authors
  • goodwanghan (71)
  • kvnkho (9)
  • iamhatesz (2)
  • kondziolka9ld (2)
  • bitsofinfo (1)
  • guilhermedelyra (1)
  • jamie256 (1)
  • jaymanalastas (1)
  • gliptak (1)
  • mfahadakbar (1)
  • step-security-bot (1)
  • amotl (1)
  • alexander-beedie (1)
  • veghdev (1)
  • anticorrelator (1)
Top Labels
Issue Labels
enhancement (34) core feature (24) compatibility (11) behavior change (10) high priority (10) refactoring (8) feature removal (6) Fugue SQL (6) programming interface (4) spark (4) bug (4) polars (3) dask (3) Fugue Backend (2) python deprecation (2) unit test (2) fugueless (2) low priority (2) ray (2) visualization (2) documentation (2) api (2) arrow (2) bag (1) pandas (1) good first issue (1) duckdb (1) IO (1) ibis (1) deprecation (1)
Pull Request Labels
enhancement (10) compatibility (7) core feature (5) bug (4) spark (4) behavior change (3) polars (2) refactoring (2) dask (2) high priority (1) api (1) documentation (1) programming interface (1) arrow (1) ray (1)

Packages

  • Total packages: 5
  • Total downloads:
    • pypi 1,502,882 last-month
  • Total docker downloads: 3,131
  • Total dependent packages: 21
    (may contain duplicates)
  • Total dependent repositories: 216
    (may contain duplicates)
  • Total versions: 161
  • Total maintainers: 2
pypi.org: fugue

An abstraction layer for distributed computation

  • Versions: 117
  • Dependent Packages: 19
  • Dependent Repositories: 97
  • Downloads: 1,241,913 Last month
  • Docker Downloads: 1,635
Rankings
Downloads: 0.5%
Dependent packages count: 0.6%
Dependent repos count: 1.5%
Stargazers count: 1.7%
Average: 1.9%
Docker downloads count: 2.2%
Forks count: 4.7%
Maintainers (2)
Last synced: 6 months ago
pypi.org: fugue-sql-antlr

Fugue SQL Antlr Parser

  • Versions: 18
  • Dependent Packages: 2
  • Dependent Repositories: 110
  • Downloads: 259,124 Last month
  • Docker Downloads: 1,496
Rankings
Downloads: 0.5%
Dependent repos count: 1.4%
Stargazers count: 1.7%
Average: 2.3%
Docker downloads count: 2.6%
Dependent packages count: 3.2%
Forks count: 4.7%
Maintainers (2)
Last synced: 6 months ago
pypi.org: fugue-sql-antlr-cpp

Fugue SQL Antlr C++ Parser

  • Versions: 16
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 1,845 Last month
Rankings
Stargazers count: 1.7%
Forks count: 4.7%
Downloads: 5.6%
Average: 8.7%
Dependent packages count: 10.1%
Dependent repos count: 21.5%
Maintainers (2)
Last synced: 6 months ago
conda-forge.org: fugue

Fugue is a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites. It is meant for data scientists/analysts who want to focus on defining logic rather than worrying about execution. It is also suitable for SQL users wanting to use SQL to define end-to-end workflows in pandas, Spark, and Dask. Data scientists using pandas wanting to take advantage of Spark or Dask with minimal effort, as well as big data practitioners finding testing code to be costly and slow would also find Fugue useful.

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 4
Rankings
Stargazers count: 11.1%
Dependent repos count: 16.1%
Forks count: 21.4%
Average: 25.1%
Dependent packages count: 51.6%
Last synced: 6 months ago
anaconda.org: fugue

Fugue is a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites. It is meant for data scientists/analysts who want to focus on defining logic rather than worrying about execution. It is also suitable for SQL users wanting to use SQL to define end-to-end workflows in pandas, Spark, and Dask. Data scientists using pandas wanting to take advantage of Spark or Dask with minimal effort, as well as big data practitioners finding testing code to be costly and slow would also find Fugue useful.

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 4
Rankings
Stargazers count: 19.6%
Forks count: 33.1%
Average: 37.1%
Dependent repos count: 44.5%
Dependent packages count: 51.0%
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • autopep8 *
  • black >=22.3.0
  • fastavro *
  • flake8 *
  • flask *
  • mypy *
  • pandavro *
  • pre-commit *
  • psutil *
  • pylint *
  • pytest *
  • pytest-cov *
  • pytest-mock *
  • pytest-rerunfailures *
  • pytest-spark *
  • qpd ==0.3.0.dev2
  • sphinx >=2.4.0
  • sphinx-autodoc-typehints *
  • sphinx-rtd-theme *
  • twine *
  • wheel *
setup.py pypi
  • adagio >=0.2.4
  • fugue-sql-antlr >=0.1.0
  • importlib-metadata *
  • jinja2 *
  • pandas >=1.0.2
  • pyarrow >=0.15.1
  • qpd ==0.3.0.dev2
  • sqlalchemy *
  • triad >=0.6.4
.github/workflows/publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_all.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • codecov/codecov-action v3 composite
.github/workflows/test_core.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_notebook.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_win.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_dask.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_no_sql.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_ray.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/test_spark.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite