https://github.com/databricks/koalas

Koalas: pandas API on Apache Spark

https://github.com/databricks/koalas

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 52 committers (3.8%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

big-data data-science dataframe mlflow pandas pydata spark

Keywords from Contributors

jdbc mlops alignment flexible llmops prompt-engineering observability model-management llm-evaluation langchain
Last synced: 5 months ago · JSON representation

Repository

Koalas: pandas API on Apache Spark

Basic Info
  • Host: GitHub
  • Owner: databricks
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 11.7 MB
Statistics
  • Stars: 3,366
  • Watchers: 316
  • Forks: 366
  • Open Issues: 108
  • Releases: 47
Topics
big-data data-science dataframe mlflow pandas pydata spark
Created about 7 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License

README.md

DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is officially included to PySpark in Apache Spark 3.2. This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use PySpark directly.

pandas API on Apache Spark
Explore Koalas docs »

Live notebook · Issues · Mailing list
Help Thirsty Koalas Devastated by Recent Fires

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can: - Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas. - Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.

Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.

Github Actions codecov Documentation Status Latest Release Conda Version Binder Downloads

Getting Started

Koalas can be installed in many ways such as Conda and pip.

```bash

Conda

conda install koalas -c conda-forge ```

```bash

pip

pip install koalas ```

See Installation for more details.

For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try Databricks Community Edition for free. You can also follow these steps to manually install a library on Databricks.

Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set ARROW_PRE_0_15_IPC_FORMAT environment variable to 1 manually. Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.

Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:

```python import databricks.koalas as ks import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

Create a Koalas DataFrame from pandas DataFrame

df = ks.from_pandas(pdf)

Rename the columns

df.columns = ['x', 'y', 'z1']

Do some operations in place:

df['x2'] = df.x * df.x ```

For more details, see Getting Started and Dependencies in the official documentation.

Contributing Guide

See Contributing Guide and Design Principles in the official documentation.

FAQ

See FAQ in the official documentation.

Best Practices

See Best Practices in the official documentation.

Koalas Talks and Blogs

See Koalas Talks and Blogs in the official documentation.

Owner

  • Name: Databricks
  • Login: databricks
  • Kind: organization
  • Location: United States of America

Helping data teams solve the world’s toughest problems using data and AI

GitHub Events

Total
  • Issues event: 4
  • Watch event: 51
  • Pull request event: 1
  • Fork event: 10
Last Year
  • Issues event: 4
  • Watch event: 51
  • Pull request event: 1
  • Fork event: 10

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 1,554
  • Total Committers: 52
  • Avg Commits per committer: 29.885
  • Development Distribution Score (DDS): 0.644
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Takuya UESHIN u****n@d****m 553
Hyukjin Kwon g****3@a****g 396
Haejoon Lee 4****c 258
Reynold Xin r****n@d****m 63
xinrong-databricks 4****s 63
rain 6****7@g****m 31
Kaiqi Dong k****i@k****e 25
Harutaka Kawamura h****0@g****m 22
Florian Schäfer 3****a 18
박현우 c****y@n****m 15
Timothy Hunter t****m@d****m 14
Walid Gara 2****d 12
LucasG0 4****0 9
Shril Kumar s****n@g****m 7
Deepyaman Datta d****a@u****u 5
Joy j****y@d****m 4
FWANI s****i@g****m 4
Thein Oo t****o 3
AbdealiJK a****i@g****m 3
Xinrong Meng x****g@d****m 3
90jam 9****z@g****m 3
Li Jin i****s@g****m 2
Gábor Lipták g****k@g****m 2
Ratin Kumar r****k@g****m 2
Abishek Ganesh a****2@g****m 2
Stephanie Bodoff s****f@d****m 2
Thomas Spura t****a@g****m 2
Daniel Voigt Godoy d****y@g****m 2
nitlev n****v 2
Xiao Li g****e@g****m 2
and 22 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 87
  • Total pull requests: 23
  • Average time to close issues: over 1 year
  • Average time to close pull requests: 8 months
  • Total issue authors: 69
  • Total pull request authors: 14
  • Average comments per issue: 3.25
  • Average comments per pull request: 4.43
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • HyukjinKwon (6)
  • itholic (5)
  • RainFung (3)
  • thehomebrewnerd (3)
  • haelipark (2)
  • kylegilde (2)
  • ChuckConnell (2)
  • nitinmnsn (2)
  • crucis (2)
  • Akashdesarda (1)
  • ericbugin (1)
  • CodyGreen-Datavant (1)
  • hrxx (1)
  • devarshml (1)
  • ParthRMehta (1)
Pull Request Authors
  • itholic (7)
  • ueshin (3)
  • xinrong-meng (2)
  • dependabot[bot] (2)
  • Cedric-Magnan (1)
  • lopez- (1)
  • tnixon (1)
  • beobest2 (1)
  • chi2liu (1)
  • awdavidson (1)
  • shril (1)
  • LSturtew (1)
  • AishwaryaKalloli (1)
  • eavilaes (1)
Top Labels
Issue Labels
enhancement (25) bug (10) question (8) discussions (8) not a koalas issue (2) help wanted (2) docs (1)
Pull Request Labels
dependencies (2)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 1,549,469 last-month
  • Total docker downloads: 3,943
  • Total dependent packages: 12
    (may contain duplicates)
  • Total dependent repositories: 444
    (may contain duplicates)
  • Total versions: 137
  • Total maintainers: 7
pypi.org: koalas

Koalas: pandas API on Apache Spark

  • Versions: 47
  • Dependent Packages: 11
  • Dependent Repositories: 444
  • Downloads: 1,549,469 Last month
  • Docker Downloads: 3,943
Rankings
Downloads: 0.3%
Dependent packages count: 0.6%
Dependent repos count: 0.7%
Average: 1.2%
Stargazers count: 1.3%
Docker downloads count: 1.6%
Forks count: 2.8%
Last synced: 6 months ago
proxy.golang.org: github.com/databricks/koalas
  • Versions: 48
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.5%
Average: 6.7%
Dependent repos count: 6.9%
Last synced: 6 months ago
conda-forge.org: koalas
  • Versions: 42
  • Dependent Packages: 1
  • Dependent Repositories: 0
Rankings
Stargazers count: 6.4%
Forks count: 8.3%
Average: 19.4%
Dependent packages count: 28.8%
Dependent repos count: 34.0%
Last synced: 6 months ago

Dependencies

.github/workflows/master.yml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • actions/setup-java v1 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v1 composite
requirements-dev.txt pypi
  • black ==19.10b0 development
  • docutils ==0.16 development
  • flake8 * development
  • ipython * development
  • matplotlib >=3.0.0,<3.3.0 development
  • mlflow >=1.0 development
  • mypy * development
  • nbconvert * development
  • nbformat <5.1 development
  • nbsphinx * development
  • numpy >=1.14,<1.20.0 development
  • numpydoc >=1.1.0 development
  • openpyxl * development
  • pandas >=0.23.2 development
  • plotly >=4.8 development
  • pyarrow >=0.10 development
  • pydata-sphinx-theme * development
  • pypandoc * development
  • pytest * development
  • pytest-cov * development
  • scikit-learn * development
  • sphinx >=2.0.0,<3.1.0 development
  • sphinx-plotly-directive * development
  • xlrd <2.0.0 development
setup.py pypi
  • numpy >=1.14
  • pandas >=0.23.2
  • pyarrow >=0.10