https://github.com/databricks/koalas
Koalas: pandas API on Apache Spark
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 52 committers (3.8%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Koalas: pandas API on Apache Spark
Basic Info
Statistics
- Stars: 3,366
- Watchers: 316
- Forks: 366
- Open Issues: 108
- Releases: 47
Topics
Metadata Files
README.md
DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is officially included to PySpark in Apache Spark 3.2. This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use PySpark directly.
pandas API on Apache Spark
Explore Koalas docs »
Live notebook
·
Issues
·
Mailing list
Help Thirsty Koalas Devastated by Recent Fires
The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.
pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can: - Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas. - Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.
Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.
Getting Started
Koalas can be installed in many ways such as Conda and pip.
```bash
Conda
conda install koalas -c conda-forge ```
```bash
pip
pip install koalas ```
See Installation for more details.
For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try Databricks Community Edition for free. You can also follow these steps to manually install a library on Databricks.
Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set ARROW_PRE_0_15_IPC_FORMAT environment variable to 1 manually.
Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.
Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:
```python import databricks.koalas as ks import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
Rename the columns
df.columns = ['x', 'y', 'z1']
Do some operations in place:
df['x2'] = df.x * df.x ```
For more details, see Getting Started and Dependencies in the official documentation.
Contributing Guide
See Contributing Guide and Design Principles in the official documentation.
FAQ
See FAQ in the official documentation.
Best Practices
See Best Practices in the official documentation.
Koalas Talks and Blogs
See Koalas Talks and Blogs in the official documentation.
Owner
- Name: Databricks
- Login: databricks
- Kind: organization
- Location: United States of America
- Website: https://databricks.com
- Repositories: 246
- Profile: https://github.com/databricks
Helping data teams solve the world’s toughest problems using data and AI
GitHub Events
Total
- Issues event: 4
- Watch event: 51
- Pull request event: 1
- Fork event: 10
Last Year
- Issues event: 4
- Watch event: 51
- Pull request event: 1
- Fork event: 10
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Takuya UESHIN | u****n@d****m | 553 |
| Hyukjin Kwon | g****3@a****g | 396 |
| Haejoon Lee | 4****c | 258 |
| Reynold Xin | r****n@d****m | 63 |
| xinrong-databricks | 4****s | 63 |
| rain | 6****7@g****m | 31 |
| Kaiqi Dong | k****i@k****e | 25 |
| Harutaka Kawamura | h****0@g****m | 22 |
| Florian Schäfer | 3****a | 18 |
| 박현우 | c****y@n****m | 15 |
| Timothy Hunter | t****m@d****m | 14 |
| Walid Gara | 2****d | 12 |
| LucasG0 | 4****0 | 9 |
| Shril Kumar | s****n@g****m | 7 |
| Deepyaman Datta | d****a@u****u | 5 |
| Joy | j****y@d****m | 4 |
| FWANI | s****i@g****m | 4 |
| Thein Oo | t****o | 3 |
| AbdealiJK | a****i@g****m | 3 |
| Xinrong Meng | x****g@d****m | 3 |
| 90jam | 9****z@g****m | 3 |
| Li Jin | i****s@g****m | 2 |
| Gábor Lipták | g****k@g****m | 2 |
| Ratin Kumar | r****k@g****m | 2 |
| Abishek Ganesh | a****2@g****m | 2 |
| Stephanie Bodoff | s****f@d****m | 2 |
| Thomas Spura | t****a@g****m | 2 |
| Daniel Voigt Godoy | d****y@g****m | 2 |
| nitlev | n****v | 2 |
| Xiao Li | g****e@g****m | 2 |
| and 22 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 87
- Total pull requests: 23
- Average time to close issues: over 1 year
- Average time to close pull requests: 8 months
- Total issue authors: 69
- Total pull request authors: 14
- Average comments per issue: 3.25
- Average comments per pull request: 4.43
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- HyukjinKwon (6)
- itholic (5)
- RainFung (3)
- thehomebrewnerd (3)
- haelipark (2)
- kylegilde (2)
- ChuckConnell (2)
- nitinmnsn (2)
- crucis (2)
- Akashdesarda (1)
- ericbugin (1)
- CodyGreen-Datavant (1)
- hrxx (1)
- devarshml (1)
- ParthRMehta (1)
Pull Request Authors
- itholic (7)
- ueshin (3)
- xinrong-meng (2)
- dependabot[bot] (2)
- Cedric-Magnan (1)
- lopez- (1)
- tnixon (1)
- beobest2 (1)
- chi2liu (1)
- awdavidson (1)
- shril (1)
- LSturtew (1)
- AishwaryaKalloli (1)
- eavilaes (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- pypi 1,549,469 last-month
- Total docker downloads: 3,943
-
Total dependent packages: 12
(may contain duplicates) -
Total dependent repositories: 444
(may contain duplicates) - Total versions: 137
- Total maintainers: 7
pypi.org: koalas
Koalas: pandas API on Apache Spark
- Homepage: https://github.com/databricks/koalas
- Documentation: https://koalas.readthedocs.io/
- License: http://www.apache.org/licenses/LICENSE-2.0
-
Latest release: 1.8.2
published over 4 years ago
Rankings
Maintainers (7)
proxy.golang.org: github.com/databricks/koalas
- Documentation: https://pkg.go.dev/github.com/databricks/koalas#section-documentation
- License: apache-2.0
-
Latest release: v1.8.2
published over 4 years ago
Rankings
conda-forge.org: koalas
- Homepage: https://github.com/databricks/koalas
- License: Apache-2.0
-
Latest release: 1.8.2
published over 4 years ago
Rankings
Dependencies
- actions/cache v1 composite
- actions/checkout v2 composite
- actions/setup-java v1 composite
- actions/setup-python v2 composite
- codecov/codecov-action v1 composite
- black ==19.10b0 development
- docutils ==0.16 development
- flake8 * development
- ipython * development
- matplotlib >=3.0.0,<3.3.0 development
- mlflow >=1.0 development
- mypy * development
- nbconvert * development
- nbformat <5.1 development
- nbsphinx * development
- numpy >=1.14,<1.20.0 development
- numpydoc >=1.1.0 development
- openpyxl * development
- pandas >=0.23.2 development
- plotly >=4.8 development
- pyarrow >=0.10 development
- pydata-sphinx-theme * development
- pypandoc * development
- pytest * development
- pytest-cov * development
- scikit-learn * development
- sphinx >=2.0.0,<3.1.0 development
- sphinx-plotly-directive * development
- xlrd <2.0.0 development
- numpy >=1.14
- pandas >=0.23.2
- pyarrow >=0.10