Recent Releases of https://github.com/databricks/koalas

https://github.com/databricks/koalas - Version 1.8.2

Koalas 1.8.2 is a maintenance release. Koalas is officially included in PySpark as pandas API on Spark in Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

Although moving to pandas API on Spark is recommended, Koalas 1.8.2 still works with Spark 3.2 (#2203).

Improvements and bug fixes

  • builtintable import in groupby apply (changed in pandas>=1.3.0). (#2184)

- Python
Published by ueshin over 4 years ago

https://github.com/databricks/koalas - Version 1.8.1

Koalas 1.8.1 is a maintenance release. Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

Improvements and bug fixes

  • Remove the upperbound for numpy. (#2166)
  • Allow Python 3.9 when the underlying PySpark is 3.1 and above. (#2167)

Along with the following fixes: - Support x and y properly in plots (both matplotlib and plotly). (#2172) - Fix Index.different to work properly. (#2173) - Fix backward compatibility for Python version 3.5.*. (#2174)

- Python
Published by xinrong-meng over 4 years ago

https://github.com/databricks/koalas - Version 1.8.0

Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

Categorical type and ExtensionDtype

We added the support of pandas' categorical type (#2064, #2106).

```python

s = ks.Series(list("abbccc"), dtype="category") s 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['a', 'b', 'c'] s.cat.categories Index(['a', 'b', 'c'], dtype='object') s.cat.codes 0 0 1 1 2 1 3 2 4 2 5 2 dtype: int8 idx = ks.CategoricalIndex(list("abbccc")) idx CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

idx.codes Int64Index([0, 1, 1, 2, 2, 2], dtype='int64') idx.categories Index(['a', 'b', 'c'], dtype='object') ```

and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):

python def func() -> ks.Series[pd.Int32Dtype()]: ...

Other new features, improvements and bug fixes

We added the following new features:

DataFrame:

  • first (#2128)
  • at_time (#2116)

Series:

  • at_time (#2130)
  • first (#2128)
  • between_time (#2129)

DatetimeIndex:

  • indexer_between_time (#2104)
  • indexer_at_time (#2109)
  • between_time (#2111)

Along with the following fixes:

  • Support tuple to (DataFrame|Series).replace() (#2095)
  • Check indexdtype and datadtypes more strictly. (#2100)
  • Return actual values via toPandas. (#2077)
  • Add lines and orient to readjson and tojson to improve error message (#2110)
  • Fix isin to accept numpy array (#2103)
  • Allow multi-index column names for inferring return type schema with names. (#2117)
  • Add a short JDBC user guide (#2148)
  • Remove upper bound pandas 1.2 (#2141)
  • Standardize exceptions of arithmetic operations on Datetime-like data (#2101)

- Python
Published by HyukjinKwon almost 5 years ago

https://github.com/databricks/koalas - Version 1.7.0

Switch the default plotting backend to Plotly

We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde and Series.plot.kde (#2028).

python import databricks.koalas as ks kdf = ks.DataFrame({ 'a': [1, 2, 2.5, 3, 3.5, 4, 5], 'b': [1, 2, 3, 4, 5, 6, 7], 'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]}) kdf.plot.hist()

Koalas_plotly_hist_plot

Plotting backend can be switched to matplotlib by setting ks.options.plotting.backend to matplotlib.

python ks.options.plotting.backend = "matplotlib"

Add Int64Index, Float64Index, DatatimeIndex

We added more types of Index such as Index64Index, Float64Index and DatetimeIndex (#2025, #2066).

When creating an index, Index instance is always returned regardless of the data type.

But now Int64Index, Float64Index or DatetimeIndex is returned depending on the data type of the index.

```python

type(ks.Index([1, 2, 3])) type(ks.Index([1.1, 2.5, 3.0])) type(ks.Index([datetime.datetime(2021, 3, 9)])) ```

In addition, we added many properties for DatetimeIndex such as year, month, day, hour, minute, second, etc. (#2074) and added APIs for DatetimeIndex such as round(), floor(), ceil(), normalize(), strftime(), month_name() and day_name() (#2082, #2086, #2089).

Create Index from Series or Index objects

Index can be created by taking Series or Index objects (#2071).

```python

kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30]) ks.Index(kser) Int64Index([1, 2, 3], dtype='int64', name='a') ks.Int64Index(kser) Int64Index([1, 2, 3], dtype='int64', name='a') ks.Float64Index(kser) Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a') python kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20]) ks.Index(kser) DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None) ks.DatetimeIndex(kser) DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None) ```

Extension dtypes support

We added basic extension dtypes support (#2039).

```python

kdf = ks.DataFrame( ... { ... "a": [1, 2, None, 3], ... "b": [4.5, 5.2, 6.1, None], ... "c": ["A", "B", "C", None], ... "d": [False, None, True, False], ... } ... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"}) kdf a b c d 0 1 4.5 A False 1 2 5.2 B 2 6.1 C True 3 3 NaN False kdf.dtypes a Int32 b float64 c string d boolean dtype: object ```

The following types are supported per the installed pandas:

  • pandas >= 0.24
    • Int8Dtype
    • Int16Dtype
    • Int32Dtype
    • Int64Dtype
  • pandas >= 1.0
    • BooleanDtype
    • StringDtype
  • pandas >= 1.2
    • Float32Dtype
    • Float64Dtype

Binary operations and type casting are supported:

```python

kdf.a + kdf.b 0 5 1 7 2 3 dtype: Int64 kdf + kdf a b 0 2 8 1 4 10 2 12 3 6 kdf.a.astype('Float64') 0 1.0 1 2.0 2 3 3.0 Name: a, dtype: Float64 ```

Other new features, improvements and bug fixes

We added the following new features:

koalas:

  • date_range (#2081)
  • read_orc (#2017)

Series:

  • align (#2019)

DataFrame:

  • align (#2019)
  • to_orc (#2024)

Along with the following fixes:

  • PySpark 3.1.1 Support
  • Preserve index for statistical functions with axis==1 (#2036)
  • Use iloc to make sure it retrieves the first element (#2037)
  • Fix numeric_only to follow pandas (#2035)
  • Fix DataFrame.merge to work properly (#2060)
  • Fix astype(str) for some data types (#2040)
  • Fix binary operations Index by Series (#2046)
  • Fix bug on pow and rpow (#2047)
  • Support bool list-like column selection for loc indexer (#2057)
  • Fix window functions to resolve (#2090)
  • Refresh GitHub workflow matrix (#2083)
  • Restructure the hierarchy of Index unit tests (#2080)
  • Fix to delegate dtypes (#2061)

- Python
Published by itholic almost 5 years ago

https://github.com/databricks/koalas - Version 1.6.0

Improved Plotly backend support

We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:

  • DataFrame.plot.pie and Series.plot.pie (#1971) Screen Shot 2021-01-22 at 6 32 48 PM

  • DataFrame.plot.hist and Series.plot.hist (#1999) Screen Shot 2021-01-22 at 6 32 38 PM

  • Series.plot.box (#2007) Screen Shot 2021-01-22 at 6 32 31 PM

In addition, we optimized histogram calculation as a single pass in DataFrame (#1997) instead of launching each job to calculate each Series in DataFrame.

Operations between Series and Index

The operations between Series and Index are now supported as below (#1996):

```python

kser = ks.Series([1, 2, 3, 4, 5, 6, 7]) kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])

(kser + 1 + 10 * kidx).sortindex() 0 2 1 13 2 24 3 35 4 46 5 57 6 68 dtype: int64 (kidx + 1 + 10 * kser).sortindex() 0 11 1 22 2 33 3 44 4 55 5 66 6 77 dtype: int64 ```

Support setting to a Series via attribute access

We have added the support of setting a column via attribute assignment in DataFrame, (#1989).

```python

kdf = ks.DataFrame({'A': [1, 2, 3, None]}) kdf.A = kdf.A.fillna(kdf.A.median()) kdf A 0 1.0 1 2.0 2 3.0 3 2.0 ```

Other new features, improvements and bug fixes

We added the following new features:

Series:

  • factorize (#1972)
  • sem (#1993)

DataFrame

  • insert (#1983)
  • sem (#1993)

In addition, we also implement new parameters:

  • Add min_count parameter for Frame.sum. (#1978)
  • Added ddof parameter for GroupBy.std() and GroupBy.var() (#1994)
  • Support ddof parameter for std and var. (#1986)

Along with the following fixes:

  • Fix stat functions with no numeric columns. (#1967)
  • Fix DataFrame.replace with NaN/None values (#1962)
  • Fix cumsum and cumprod. (#1982)
  • Use Python type name instead of Spark's in error messages. (#1985)
  • Use object.__setattr__ in Series. (#1991)
  • Adjust Series.mode to match pandas Series.mode (#1995)
  • Adjust data when all the values in a column are nulls. (#2004)
  • Fix assparktype to not support "bigint". (#2011)

- Python
Published by HyukjinKwon about 5 years ago

https://github.com/databricks/koalas - Version 1.5.0

Index operations support

We improved Index operations support (#1944, #1955).

Here are some examples:

  • Before ```py

    kidx = ks.Index([1, 2, 3, 4, 5]) kidx + kidx Int64Index([2, 4, 6, 8, 10], dtype='int64') kidx + kidx + kidx Traceback (most recent call last): ... AssertionError: args should be single DataFrame or single/multiple Series ```

    ```py

    ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10]) Traceback (most recent call last): ... AssertionError: args should be single DataFrame or single/multiple Series ```

  • After ```python

    kidx = ks.Index([1, 2, 3, 4, 5]) kidx + kidx + kidx Int64Index([3, 6, 9, 12, 15], dtype='int64') ```

    ```python

    ks.options.compute.opsondiff_frames = True ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10]) Int64Index([7, 9, 13, 11, 15], dtype='int64') ```

Other new features and improvements

We added the following new features:

DataFrame:

  • swaplevel (#1928)
  • swapaxes (#1946)
    • dot (#1945)
    • itertuples (#1960)

Series:

  • swaplevel (#1919)
  • swapaxes (#1954)

Index:

  • to_list (#1948)

MultiIndex:

  • to_list (#1948)

GroupBy: - tail (#1949) - median (#1957)

Other improvements and bug fixes

  • Support DataFrame parameter in Series.dot (#1931)
  • Add a best practice for checkpointing. (#1930)
  • Remove implicit switch-ons of "compute.opsondiff_frames" (#1953)
  • Fix Series.tointernalpandas and introduce Index.tointernalpandas. (#1952)
  • Fix first/lastvalidindex to support empty column DataFrame. (#1923)
  • Use pandas' transpose when the data is expected to be small. (#1932)
  • Fix tail to use the resolved copy (#1942)
  • Avoid unneeded reset_index in DataFrameGroupBy.describe. (#1951)
  • TypeError when Index.name / Series.name is not a hashable type (#1883)
  • Adjust data column names before attaching default index. (#1947)
  • Add plotly into the optional dependency in Koalas (#1939)
  • Add plotly backend test cases (#1938)
  • Don't pass stacked in plotly area chart (#1934)
  • Set upperbound of matplotlib to avoid failure on Ubuntu (#1959)
  • Fix GroupBy.descirbe for multi-index columns. (#1922)
  • Upgrade pandas version in CI (#1961)
  • Compare Series from the same anchor (#1956)
  • Add videos from Data+AI Summit 2020 EUROPE. (#1963)
  • Set PYARROWIGNORETIMEZONE for binder. (#1965)

- Python
Published by xinrong-meng about 5 years ago

https://github.com/databricks/koalas - Version 1.4.0

Better type support

We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.

Here are some examples:

  • Added np.float32 and "float32" (matched to FloatType)

    ```python

    ks.Series([10]).astype(np.float32) 0 10.0 dtype: float32

    ks.Series([10]).astype("float32") 0 10.0 dtype: float32 ```

  • Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)

    ```python

    ks.Series(["2020-10-26"]).astype(np.datetime64) 0 2020-10-26 dtype: datetime64[ns]

    ks.Series(["2020-10-26"]).astype("datetime64[ns]") 0 2020-10-26 dtype: datetime64[ns] ```

  • Fixed np.int to match LongType, not IntegerType.

    ```python

    pd.Series([100]).astype(np.int) 0 100.0 dtype: int64

    ks.Series([100]).astype(np.int) 0 100.0 dtype: int32 # This fixed to int64 now. ```

  • Fixed np.float to match DoubleType, not FloatType.

    ```python

    pd.Series([100]).astype(np.float) 0 100.0 dtype: float64

    ks.Series([100]).astype(np.float) 0 100.0 dtype: float32 # This fixed to float64 now. ```

We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.

Return type annotations for major Koalas objects

To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).

The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:

  • Before

Before

  • After

After

It also helps mypy enable static analysis over the method body.

pandas 1.1.4 support

We verified the behaviors of pandas 1.1.4 in Koalas.

As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic (MultiIndex.is_monotonic_increasing) and MultiIndex.is_monotonic_decreasing (pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).

Other new features and improvements

We added the following new features:

DataFrame:

  • __neg__ (#1847)
  • rename_axis (#1843)
  • spark.repartition (#1864)
  • spark.coalesce (#1873)
  • spark.checkpoint (#1877)
  • spark.local_checkpoint (#1878)
  • reindex_like (#1880)

Series:

  • rename_axis (#1843)
  • compare (#1802)
  • reindex_like (#1880)

Index:

  • intersection (#1747)

MultiIndex:

  • intersection (#1747)

Other improvements and bug fixes

  • Use SF.repeat in series.str.repeat (#1844)
  • Remove warning when use cache in the context manager (#1848)
  • Support a non-string name in Series' boxplot (#1849)
  • Calculate fliers correctly in Series.plot.box (#1846)
  • Show type name rather than type class in error messages (#1851)
  • Fix DataFrame.spark.hint to reflect internal changes. (#1865)
  • DataFrame.reindex supports named columns index (#1876)
  • Separate InternalFrame.indexmap into indexsparkcolumnnames and index_names. (#1879)
  • Fix DataFrame.xs to handle internal changes properly. (#1896)
  • Explicitly disallow empty list as indexsparkcolumnames and indexnames. (#1895)
  • Use nullable inferred schema in function apply (#1897)
  • Introduce InternalFrame.index_level. (#1890)
  • Remove InternalFrame.index_map. (#1901)
  • Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. (#1904)
  • Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
  • Fix read_excel to support squeeze argument. (#1905)
  • Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. (#1912)

- Python
Published by ueshin over 5 years ago

https://github.com/databricks/koalas - Version 1.3.0

pandas 1.1 support

We verified the behaviors of pandas 1.1 in Koalas. Koalas now supports pandas 1.1 officially (#1688, #1822, #1829).

Support for non-string names

Now we support for non-string names (#1784). Previously names in Koalas, e.g., df.columns, df.colums.names, df.index.names, needed to be a string or a tuple of string, but it should allow other data types which are supported by Spark.

Before:

```py

kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']]) kdf.columns Index(['0', '1'], dtype='object') ```

After:

```py

kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']]) kdf.columns Int64Index([0, 1], dtype='int64') ```

Improve distributed-sequence default index

The performance is improved when creating a distributed-sequence as a default index type by avoiding the interaction between Python and JVM (#1699).

Standardize binary operations between int and str columns

Make behaviors of binary operations (+, -, *, /, //, %) between int and str columns consistent with respective pandas behaviors (#1828).

It standardizes binary operations as follows:

  • +: raise TypeError between int column and str column (or string literal)
  • *: act as spark SQL repeat between int column(or int literal) and str columns; raise TypeError if a string literal is involved
  • -, /, //, %(modulo): raise TypeError if a str column (or string literal) is involved

Other new features and improvements

We added the following new features:

DataFrame:

  • product (#1739)
  • from_dict (#1778)
  • pad (#1786)
  • backfill (#1798)

Series:

  • reindex (#1737)
  • explode (#1777)
  • pad (#1786)
  • argmin (#1790)
  • argmax (#1790)
  • argsort (#1793)
  • backfill (#1798)

Index:

  • inferred_type (#1745)
  • item (#1744)
  • is_unique (#1766)
  • asi8 (#1764)
  • is_type_compatible (#1765)
  • view (#1788)
  • insert (#1804)

MultiIndex:

  • inferred_type (#1745)
  • item (#1744)
  • is_unique (#1766)
  • asi8 (#1764)
  • is_type_compatible (#1765)
  • from_frame (#1762)
  • view (#1788)
  • insert (#1804)

GroupBy:

  • get_group (#1783)

Other improvements

  • Fix DataFrame.mad to work properly (#1749)
  • Fix Series name after binary operations. (#1753)
  • Fix GroupBy.cum~ for matching with pandas' behavior (#1708)
  • Fix cumprod to work properly with Integer columns. (#1750)
  • Fix DataFrame.join for MultiIndex (#1771)
  • Exception handling for from_frame properly (#1791)
  • Fix iloc for slice(None, 0) (#1767)
  • Fix Series.__repr__ when Series.name is None. (#1796)
  • DataFrame.reindex supports koalas Index parameter (#1741)
  • Fix Series.fillna with inplace=True on non-nullable column. (#1809)
  • Input check in various APIs (#1808, #1810, #1811, #1812, #1813, #1814, #1816, #1824)
  • Fix to_list work properly in pandas==0.23 (#1823)
  • Fix Series.astype to work properly (#1818)
  • Frame.groupby supports dropna (#1815)

- Python
Published by itholic over 5 years ago

https://github.com/databricks/koalas - Version 1.2.0

Non-named Series support

Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0" if no name is specified or None is set to the name, whereas pandas allows a Series without the name.

For example:

```py

ks.version '1.1.0' kser = ks.Series([1, 2, 3]) kser 0 1 1 2 2 3 Name: 0, dtype: int64 kser.name = None kser 0 1 1 2 2 3 Name: 0, dtype: int64 ```

Now the Series will be non-named.

```py

ks.version '1.2.0' ks.Series([1, 2, 3]) 0 1 1 2 2 3 dtype: int64 kser = ks.Series([1, 2, 3], name="a") kser.name = None kser 0 1 1 2 2 3 dtype: int64 ```

More stable "distributed-sequence" default index

Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:

```python

from databricks import koalas as ks ks.options.compute.defaultindextype = 'distributed-sequence' ks.range(10).reset_index() ```

did not work as below:

Traceback (most recent call last): File "<stdin>", line 1, in <module> ... pyspark.sql.utils.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): ... File "/.../koalas/databricks/koalas/internal.py", line 620, in offset current_partition_offset = sums[id.iloc[0]] KeyError: 103

We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.

Improve testing infrastructure

We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.

Other new features and improvements

We added the following new features:

DataFrame:

  • last_valid_index (#1705)

Series:

  • product (#1677)
  • last_valid_index (#1705)

GroupBy:

  • cumcount (#1702)

Other improvements

  • Refine Spark I/O. (#1667)
    • Set partitionBy explicitly in to_parquet.
    • Add mode and partition_cols to to_csv and to_json.
    • Fix type hints to use Optional.
  • Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
  • Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
  • Bug fixing for hasnans when non-DoubleType. (#1681)
  • Support axis=1 for DataFrame.dropna(). (#1689)
  • Allow assining index as a column (#1696)
  • Try to read pandas metadata in readparquet if indexcol is None. (#1695)
  • Include pandas Index object in dataframe indexing options (#1698)
  • Unified PlotAccessor for DataFrame and Series (#1662)
  • Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
  • Fix DataFrame.size to consider its number of columns. (#1715)
  • Fix firstvalidindex() for Empty object (#1704)
  • Fix index name when groupby.apply returns a single row. (#1719)
  • Support subtraction of date/timestamp with literals. (#1721)
  • DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)

- Python
Published by ueshin over 5 years ago

https://github.com/databricks/koalas - Version 1.1.0

API extensions

We added support for API extensions (#1617).

You can register your custom accessors to DataFrame, Seires, and Index.

For example, in your library code:

```py from databricks.koalas.extensions import registerdataframeaccessor

@registerdataframeaccessor("geo") class GeoAccessor:

def __init__(self, koalas_obj):
    self._obj = koalas_obj
    # other constructor logic

@property
def center(self):
    # return the geographic center point of this DataFrame
    lat = self._obj.latitude
    lon = self._obj.longitude
    return (float(lon.mean()), float(lat.mean()))

def plot(self):
    # plot this array's data on a map
    pass
...

```

Then, in a session:

```py

from myextlib import GeoAccessor kdf = ks.DataFrame({"longitude": np.linspace(0,10), ... "latitude": np.linspace(0, 20)}) kdf.geo.center (5.0, 10.0)

kdf.geo.plot() ... ```

See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html

Plotting backend

We introduced plotting.backend configuration (#1639).

Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.

```py

kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"]) kdf.plot(title="Example Figure") # defaults to backend="matplotlib" ```

image

```python

fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)

same as:

ks.options.plotting.backend = "plotly"

fig = kdf.plot(title="Example Figure", height=500, width=500)

fig.show() ```

image

Each backend returns the figure in their own format, allowing for further editing or customization if required.

```python

fig.updatelayout(template="plotlydark") fig.show() ```

image

Koalas accessor

We introduced koalas accessor and some methods specific to Koalas (#1613, #1628).

DataFrame.apply_batch, DataFrame.transform_batch, and Series.transform_batch are deprecated and moved to koalas accessor.

```py

kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]}) def pandasplus(pdf): ... return pdf + 1 # should always return the same length as input. ... kdf.koalas.transformbatch(pandas_plus) a b 0 2 5 1 3 6 2 4 7 ```

```py

kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]}) def pandasfilter(pdf): ... return pdf[pdf.a > 1] # allow arbitrary length ... kdf.koalas.applybatch(pandas_filter) a b 1 2 5 2 3 6 ```

or

```py

kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]}) def pandasplus(pser): ... return pser + 1 # should always return the same length as input. ... kdf.a.koalas.transformbatch(pandas_plus) 0 2 1 3 2 4 Name: a, dtype: int64 ```

See also: https://koalas.readthedocs.io/en/latest/userguide/transformapply.html

Other new features and improvements

We added the following new features:

DataFrame:

  • tail (#1632)
  • droplevel (#1622)

Series:

  • iteritems (#1603)
  • items (#1603)
  • tail (#1632)
  • droplevel (#1630)

Other improvements

  • Simplify Series.to_frame. (#1624)
  • Make Window functions create a new DataFrame. (#1623)
  • Fix Series.withnew_scol to use alias. (#1634)
  • Refine concat to handle the same anchor DataFrames properly. (#1627)
  • Add sort parameter to concat. (#1636)
  • Enable to assign list. (#1644)
  • Use SPARKINDEXNAMEFORMAT in combineframes to avoid ambiguity. (#1650)
  • Rename spark columns only when index=False. (#1649)
  • read_csv: Implement reading of number of rows (#1656)
  • Fixed ks.Index.to_series() to work properly with name paramter (#1643)
  • Fix fillna to handle "ffill" and "bfill" properly. (#1654)

- Python
Published by ueshin over 5 years ago

https://github.com/databricks/koalas - Version 1.0.1

Critical bug fix

We fixed a critical bug introduced in Koalas 1.0.0 (#1609).

If we call DataFrame.rename with columns parameter after some operations on the DataFrame, the operations will be lost:

```py

kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"]) kdf1 = kdf + 1 kdf1 A B C D 0 2 3 4 5 1 6 7 8 9 kdf1.rename(columns={"A": "aa", "B": "bb"}) aa bb C D 0 1 2 3 4 1 5 6 7 8 ```

This should be:

```py

pdf1.rename(columns={"A": "aa", "B": "bb"}) aa bb C D 0 2 3 4 5 1 6 7 8 9 ```

Other improvements

  • Clean up InternalFrame and around anchor. (#1601)
  • Fixing DataFrame.iteritems to return generator (#1602)
  • Clean up groupby to use the anchor. (#1610)

- Python
Published by ueshin over 5 years ago

https://github.com/databricks/koalas - Version 1.0.0

Better pandas API coverage

We implemented many APIs and features equivalent with pandas such as plotting, grouping, windowing, I/O, and transformation, and now Koalas reaches the pandas API coverage close to 80% in Koalas 1.0.0.

Apache Spark 3.0

Apache Spark 3.0 is now supported in Koalas 1.0 (#1586, #1558). Koalas does not require any change to use Spark 3.0. Apache Spark has more than 3400 fixes landed in Spark 3.0 and Koalas shares the most of fixes in many other components.

It also brings the performance improvement in Koalas APIs that execute Python native functions internally via pandas UDFs, for example, DataFrame.apply and DataFrame.apply_batch (#1508).

Python 3.8

With Apache Spark 3.0, Koalas supports the latest Python 3.8 which has many significant improvements (#1587), see also Python 3.8.0 release notes.

Spark accessor

spark accessor was introduced from Koalas 1.0.0 in order for the Koalas users to leverage the existing PySpark APIs more easily (#1530). For example, you can apply the PySpark functions as below:

```python import databricks.koalas as ks import pyspark.sql.functions as F

kss = ks.Series([1, 2, 3, 4]) kss.spark.apply(lambda s: F.collect_list(s)) ```

Better type hint support

In the early versions, it was required to use Koalas instances as the return type hints for the functions that return a pandas instances, which looks slightly awkward.

```python def pandas_div(pdf) -> koalas.DataFrame[float, float]: # pdf is a pandas DataFrame, return pdf[['B', 'C']] / pdf[['B', 'C']]

df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]}) df.groupby('A').apply(pandas_div) ```

In Koalas 1.0.0 with Python 3.7+, you can also use pandas instances in the return type as below:

python def pandas_div(pdf) -> pandas.DataFrame[float, float]: return pdf[['B', 'C']] / pdf[['B', 'C']]

In addition, the new type hinting is experimentally introduced in order to allow users to specify column names in the type hints as below (#1577):

python def pandas_div(pdf) -> pandas.DataFrame['B': float, 'C': float]: return pdf[['B', 'C']] / pdf[['B', 'C']]

See also the guide in Koalas documentation (#1584) for more details.

Wider support of in-place update

Previously in-place updates happen only within each DataFrame or Series, but now the behavior follows pandas in-place updates and the update of one side also updates the other side (#1592).

For example, the following updates kdf as well.

python kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]}) kser = kdf.x kser.fillna(0, inplace=True)

python kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]}) kser = kdf.x kser.loc[2] = 30

python kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]}) kser = kdf.x kdf.loc[2, 'x'] = 30

If the DataFrame and Series are connected, the in-place updates update each other.

Less restriction on compute.ops_on_diff_frames

In Koalas 1.0.0, the restriction of compute.ops_on_diff_frames became much more loosened (#1522, #1554). For example, the operations such as below can be performed without enabling compute.ops_on_diff_frames, which can be expensive due to the shuffle under the hood.

python df + df + df df['foo'] = df['bar']['baz'] df[['x', 'y']] = df[['x', 'y']].fillna(0)

Other new features and improvements

DataFrame:

  • __bool__ (#1526)
  • explode (#1507)
  • spark.apply (#1536)
  • spark.schema (#1530)
  • spark.print_schema (#1530)
  • spark.frame (#1530)
  • spark.cache (#1530)
  • spark.persist (#1530)
  • spark.hint (#1530)
  • spark.to_table (#1530)
  • spark.to_spark_io (#1530)
  • spark.explain (#1530)
  • spark.apply (#1530)
  • mad (#1538)
  • __abs__ (#1561)

Series:

  • item (#1502, #1518)
  • divmod (#1397)
  • rdivmod (#1397)
  • unstack (#1501)
  • mad (#1503)
  • __bool__ (#1526)
  • to_markdown (#1510)
  • spark.apply (#1536)
  • spark.data_type (#1530)
  • spark.nullable (#1530)
  • spark.column (#1530)
  • spark.transform (#1530)
  • filter (#1511)
  • __abs__ (#1561)
  • bfill (#1580)
  • ffill (#1580)

Index:

  • __bool__ (#1526)
  • spark.data_type (#1530)
  • spark.column (#1530)
  • spark.transform (#1530)
  • get_level_values (#1517)
  • delete (#1165)
  • __abs__ (#1561)
  • holds_integer (#1547)

MultiIndex:

  • __bool__ (#1526)
  • spark.data_type (#1530)
  • spark.column (#1530)
  • spark.transform (#1530)
  • get_level_values (#1517)
  • delete (#1165
  • __abs__ (#1561)
  • holds_integer (#1547)

Along with the following improvements:

  • Fix Series.clip not to create a new DataFrame. (#1525)
  • Fix combine_first to support tupled names. (#1534)
  • Add Spark accessors to usage logging. (#1540)
  • Implements multi-index support in Dataframe.filter (#1512)
  • Fix Series.fillna to avoid Spark jobs. (#1550)
  • Support DataFrame.spark.explain(extended: str) case. (#1563)
  • Support Series as repeats in Series.repeat. (#1573)
  • Fix fillna to handle NaN properly. (#1572)
  • Fix DataFrame.replace to avoid creating a new Spark DataFrame. (#1575)
  • Cache an internal pandas object to avoid run twice in Jupyter. (#1564)
  • Fix Series.div when div/floordiv np.inf by zero (#1463)
  • Fix Series.unstack to support non-numeric type and keep the names (#1527)
  • Fix hasnans to follow the modified column. (#1532)
  • Fix explode to use internal methods. (#1538)
  • Fix RollingGroupby and ExpandingGroupby to handle agg_columns. (#1546)
  • Fix reindex not to update internal. (#1582)

Backward Compatibility

  • Remove the deprecated pandas_wraps (#1529)
  • Remove compute function. (#1531)

- Python
Published by HyukjinKwon over 5 years ago

https://github.com/databricks/koalas - Version 0.33.0

apply and transform Improvements

We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)

```py

ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3) id 0 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 8 12 9 13 ```

```py

ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3) 0 6 1 7 2 8 3 9 4 10 5 11 6 12 7 13 8 14 9 15 Name: id, dtype: int64 ```

```py

kdf = ks.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"]) kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2) a b c 0 5 5 5 1 7 5 11 2 9 7 21 3 11 9 35 4 13 13 53 5 15 19 75 ```

Spark Schema

We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)

```py

kdf = ks.DataFrame({'a': list('abc'), ... 'b': list(range(1, 4)), ... 'c': np.arange(3, 6).astype('i1'), ... 'd': np.arange(4.0, 7.0, dtype='float64'), ... 'e': [True, False, True], ... 'f': pd.date_range('20130101', periods=3)}, ... columns=['a', 'b', 'c', 'd', 'e', 'f'])

Print the schema out in Spark’s DDL formatted string

kdf.sparkschema().simpleString() 'struct' kdf.sparkschema(index_col='index').simpleString() 'structindex:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp'

Print out the schema as same as DataFrame.printSchema()

kdf.print_schema() root |-- a: string (nullable = false) |-- b: long (nullable = false) |-- c: byte (nullable = false) |-- d: double (nullable = false) |-- e: boolean (nullable = false) |-- f: timestamp (nullable = false)

kdf.printschema(indexcol='index') root |-- index: long (nullable = false) |-- a: string (nullable = false) |-- b: long (nullable = false) |-- c: byte (nullable = false) |-- d: double (nullable = false) |-- e: boolean (nullable = false) |-- f: timestamp (nullable = false) ```

GroupBy Improvements

We fixed many bugs of GroupBy as listed below.

  • Fix groupby when as_index=False. (#1457)
  • Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
  • Fix Series.groupby on the Series from different DataFrames. (#1460)
  • Fix GroupBy.head to recognize agg_columns. (#1474)
  • Fix GroupBy.filter to follow complex group keys. (#1471)
  • Fix GroupBy.transform to follow complex group keys. (#1472)
  • Fix GroupBy.apply to follow complex group keys. (#1473)
  • Fix GroupBy.fillna to use GroupBy.applyseries_op. (#1481)
  • Fix GroupBy.filter and apply to handle agg_columns. (#1480)
  • Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
  • Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)

Other new features and improvements

We added the following new feature:

SeriesGroupBy:

  • filter (#1483)

Other improvements

  • dtype for DateType should be np.dtype("object"). (#1447)
  • Make reset_index disallow the same name but allow it when drop=True. (#1455)
  • Fix named aggregation for MultiIndex (#1435)
  • Raise ValueError that is not raised now (#1461)
  • Fix get dummies when uses the prefix parameter whose type is dict (#1478)
  • Simplify DataFrame.columns setter. (#1489)

- Python
Published by ueshin almost 6 years ago

https://github.com/databricks/koalas - Version 0.32.0

Koalas documentation redesign

Koalas documentation was redesigned with a better theme, pydata-sphinx-theme. Please check the new Koalas documentation site out.

transform_batch and apply_batch

We added the APIs that enable you to directly transform and apply a function against Koalas Series or DataFrame. map_in_pandas is deprecated and now renamed to apply_batch.

```python import databricks.koalas as ks kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]}) def pandas_plus(pdf): return pdf + 1 # should always return the same length as input.

kdf.transformbatch(pandasplus) ```

```python import databricks.koalas as ks kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]}) def pandas_plus(pdf): return pdf[pdf.a > 1] # allow arbitrary length

kdf.applybatch(pandasplus) ```

Please also check Transform and apply a function in Koalas documentation.

Other new features and improvements

We added the following new feature:

DataFrame:​

  • truncate (#1408)
  • hint (#1415)

SeriesGroupBy:

  • unique (#1426)

Index:

  • spark_column (#1438)

Series:

  • spark_column (#1438)

MultiIndex:

  • spark_column (#1438)

Other improvements

  • Fix from_pandas to handle the same index name as a column name. (#1419)
  • Add documentation about non-Koalas APIs (#1420)
  • Hot-fixing the lack of keyword argument 'deep' for DataFrame.copy() (#1423)
  • Fix Series.div when divide by zero (#1412)
  • Support expand parameter if n is a positive integer in Series.str.split/rsplit. (#1432)
  • Make Series.astype(bool) follow the concept of "truthy" and "falsey". (#1431)
  • Fix incompatible behaviour with pandas for floordiv with np.nan (#1429)
  • Use mapInPandas for apply_batch API in Spark 3.0 (#1440)
  • Use F.datediff() for subtraction of dates as a workaround. (#1439)

- Python
Published by HyukjinKwon almost 6 years ago

https://github.com/databricks/koalas - Version 0.31.0

PyArrow>=0.15 support is back

We added PyArrow>=0.15 support back (#1110).

Note that, when working with pyarrow>=0.15 and pyspark<3.0, Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1 if it does not exist, as per the instruction in SPARK-29367, but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.

Spark specific improvements

Broadcast hint

We added broadcast function in namespace.py (#1360).

We can use it with merge, join, and update which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.

For example,

```py

merged = df1.merge(ks.broadcast(df2), lefton='lkey', righton='rkey') merged.explain() == Physical Plan == ... ...BroadcastHashJoin... ... ```

persist function and storage level

We added persist function to specify the storage level when caching (#1381), and also, we added storage_level property to check the current storage level (#1385).

```py

with df.cache() as cacheddf: ... print(cacheddf.storage_level) ... Disk Memory Deserialized 1x Replicated

with df.persist(pyspark.StorageLevel.MEMORYONLY) as cacheddf: ... print(cacheddf.storagelevel) ... Memory Serialized 1x Replicated ```

Other new features and improvements

We added the following new feature:

DataFrame:

  • to_markdown (#1377)
  • squeeze (#1389)

Series:

  • squeeze (#1389)
  • asof (#1366)

Other improvements

  • Add a way to specify index column in I/O APIs (#1379)
  • Fix iloc.__setitem__ with the other Series from the same DataFrame. (#1388)
  • Add support Series from different DataFrames for loc/iloc.__setitem__. (#1391)
  • Refine __setitem__ for loc/iloc with DataFrame. (#1394)
  • Help misuse of options argument. (#1402)
  • Add blog posts in Koalas documentation (#1406)
  • Fix mod & rmod for matching with pandas. (#1399)

- Python
Published by ueshin almost 6 years ago

https://github.com/databricks/koalas - Version 0.30.0

Slice column selection support in loc

We continue to improve loc indexer and added the slice column selection support (#1351).

```python

from databricks import koalas as ks df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij'), 'c': range(10)}) df.loc[:, "b":"c"] b c 0 a 0 1 b 1 2 c 2 3 d 3 4 e 4 5 f 5 6 g 6 7 h 7 8 i 8 9 j 9 ```

Slice row selection support in loc for multi-index

We also added the support of slice as row selection in loc indexer for multi-index (#1344).

```python

from databricks import koalas as ks import pandas as pd df = ks.DataFrame({'a': range(3)}, index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c"), ("b", "d")])) df.loc[("a", "c"): "b"] a a c 1 b d 2 ```

Slice row selection support in iloc

We continued to improve iloc indexer to support iterable indexes as row selection (#1338).

```python

from databricks import koalas as ks df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij')}) df.iloc[[-1, 1, 2, 3]] a b 1 b b 2 c c 3 d d 9 j j ```

Support of setting values via loc and iloc at Series

Now, we added the basic support of setting values via loc and iloc at Series (#1367).

```python

from databricks import koalas as ks kser = ks.Series([1, 2, 3], index=["cobra", "viper", "sidewinder"]) kser.loc[kser % 2 == 1] = -kser kser cobra -1 viper 2 sidewinder -3 ```

Other new features and improvements

We added the following new feature:

DataFrame:

  • take (#1292)
  • eval (#1359)

Series:

  • dot (#1136)
  • take (#1357)
  • combine_first (#1290)

Index:

  • droplevel (#1340)
  • union (#1348)
  • take (#1357)
  • asof (#1350)

MultiIndex:

  • droplevel (#1340)
  • unique (#1342)
  • union (#1348)
  • take (#1357)

Other improvements

  • Compute Index.ismonotonic/Index.ismonotonic_decreasing in a distributed manner (#1354)
  • Fix SeriesGroupBy.apply() to respect various output (#1339)
  • Add the support for operations between different DataFrames in groupby() (#1321)
  • Explicitly don't support to disable numeric_only in stats APIs at DataFrame (#1343)
  • Fix index operator against Series and Frame to use iloc conditionally (#1336)
  • Make nunique in DataFrame to return a Koalas DataFrame instead of pandas' (#1347)
  • Fix MultiIndex.drop() to follow renaming et al. (#1356)
  • Add column axis in ks.concat (#1349)
  • Fix iloc for Series when the series is modified. (#1368)
  • Support MultiIndex for duplicated, drop_duplicates. (#1363)

- Python
Published by HyukjinKwon almost 6 years ago

https://github.com/databricks/koalas - Version 0.29.0

Slice support in iloc

We improved iloc indexer to support slice as row selection. (#1335)

For example,

```py

kdf = ks.DataFrame({'a':list('abcdefghij')}) kdf a 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j kdf.iloc[2:5] a 2 c 3 d 4 e kdf.iloc[2:-3:2] a 2 c 4 e 6 g kdf.iloc[5:] a 5 f 6 g 7 h 8 i 9 j kdf.iloc[5:2] Empty DataFrame Columns: [a] Index: [] ```

Documentation

We added links to the previous talks in our document. (#1319)

You can see a lot of useful talks from the previous events and we will keep updated.

https://koalas.readthedocs.io/en/latest/getting_started/videos.html

Other new features and improvements

We added the following new feature:

DataFrame: - stack (#1329)

Series:

  • repeat (#1328)

Index:

  • difference (#1325)
  • repeat (#1328)

MultiIndex:

  • difference (#1325)
  • repeat (#1328)

Other improvements

  • DataFrame.pivot should preserve the original index names. (#1316)
  • Fix _LocIndexerLike to handle a Series from index. (#1315)
  • Support MultiIndex in DataFrame.unstack. (#1322)
  • Support Spark UDT when converting from/to pandas DataFrame/Series. (#1324)
  • Allow negative numbers for head. (#1330)
  • Return a Koalas series instead of pandas' in stats APIs at Koalas DataFrame (#1333)

- Python
Published by ueshin almost 6 years ago

https://github.com/databricks/koalas - Version 0.28.0

pandas 1.0 support

We added pandas 1.0 support (#1197, #1299), and Koalas now can work with pandas 1.0.

mapinpandas

We implemented DataFrame.map_in_pandas API (#1276) so Koalas can allow any arbitrary function with pandas DataFrame against Koalas DataFrame. See the example below:

```python

import databricks.koalas as ks df = ks.DataFrame({'A': range(2000), 'B': range(2000)}) def queryfunc(pdf): ... num = 1995 ... return pdf.query('A > @num') ... df.mapinpandas(queryfunc) A B 1996 1996 1996 1997 1997 1997 1998 1998 1998 1999 1999 1999 ```

Standardize code style using Black

As a development only change, we added Black integration (#1301). Now, all code style is standardized automatically via running ./dev/reformat, and the style is checked as a part of ./dev/lint-python.

Other new features and improvements

We added the following new feature:

DataFrame:

  • query (#1273)
  • unstack (#1295)

Other improvements

  • Fix DataFrame.describe() to support multi-index columns. (#1279)
  • Add util function validateboolkwarg (#1281)
  • Rename data columns prior to filter to make sure the column names are as expected. (#1283)
  • Add an faq about Structured Streaming. (#1298)
  • Let extra options have higher priority to allow workarounds (#1296)
  • Implement 'keep' parameter for drop_duplicates (#1303)
  • Add a note when type hint is provided to DataFrame.apply (#1310)
  • Add a util method to verify temporary column names. (#1262)

- Python
Published by HyukjinKwon almost 6 years ago

https://github.com/databricks/koalas - Version 0.27.0

head ordering

Since Koalas doesn't guarantee the row ordering, head could return some rows from distributed partition and the result is not deterministic, which might confuse users.

We added a configuration compute.ordered_head (#1231), and if it is set to True, Koalas performs natural ordering beforehand and the result will be the same as pandas'. The default value is False because the ordering will cause a performance overhead.

```py

kdf = ks.DataFrame({'a': range(10)}) pdf = kdf.to_pandas() pdf.head(3) a 0 0 1 1 2 2

kdf.head(3) a 5 5 6 6 7 7 kdf.head(3) a 0 0 1 1 2 2

ks.options.compute.ordered_head = True kdf.head(3) a 0 0 1 1 2 2 kdf.head(3) a 0 0 1 1 2 2 ```

GitHub Actions

We started trying to use GitHub Actions for CI. (#1254, #1265, #1264, #1267, #1269)

Other new features and improvements

We added the following new feature:

DataFrame: - apply (#1259)

Other improvements

  • Fix identical and equals for the comparison between the same object. (#1220)
  • Select the series correctly in SeriesGroupBy APIs (#1224)
  • Fixes DataFrame/Series.clip function to preserve its index. (#1232)
  • Throw a better exception in DataFrame.sort_values when multi-index column is used (#1238)
  • Fix fillna not to change index values. (#1241)
  • Fix DataFrame.__setitem__ with tuple-named Series. (#1245)
  • Fix corr to support multi-index columns. (#1246)
  • Fix output of print() matches with pandas of Series (#1250)
  • Fix fillna to support partial column index for multi-index columns. (#1244)
  • Add as_index check logic to groupby parameter (#1253)
  • Raising NotImplementedError for elements that actually are not implemented. (#1256)
  • Fix where to support multi-index columns. (#1249)

- Python
Published by ueshin about 6 years ago

https://github.com/databricks/koalas - Version 0.26.0

iat indexer

We continued to improve indexers. Now, iat indexer is supported too (#1062).

```python

df = ks.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], ... columns=['A', 'B', 'C']) df A B C 0 0 2 3 1 0 4 1 2 10 20 30

df.iat[1, 2] 1 ```

Other new features and improvements

We added the following new features:

koalas.Index

  • equals (#1216)
  • identical (#1215)
  • is_all_dates (#1205)
  • append (#1163)
  • to_frame (#1187)

koalas.MultiIndex:

  • equals (#1216)
  • identical (#1215)
  • swaplevel (#1105)
  • is_all_dates (#1205)
  • is_monotonic_increasing (#1183)
  • is_monotonic_decreasing (#1183)
  • append (#1163)
  • to_frame (#1187)

koalas.DataFrameGroupBy

  • describe (#1168)

Other improvements

  • Change default write mode to overwrite to be consistent with pandas (#1209)
  • Prepare Spark 3 (#1211, #1181)
  • Fix DataFrame.idxmin/idxmax. (#1198)
  • Fix reset_index with the default index is "distributed-sequence". (#1193)
  • Fix column name as a tuple in multi column index (#1191)
  • Add favicon to doc (#1189)

- Python
Published by HyukjinKwon about 6 years ago

https://github.com/databricks/koalas - Version 0.25.0

loc and iloc indexers improvement

We improved loc and iloc indexers. Now, loc can support scalar values as indexers (#1172).

```python

import databricks.koalas as ks

df = ks.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['maxspeed', 'shield']) df.loc['sidewinder'] maxspeed 7 shield 8 Name: sidewinder, dtype: int64 df.loc['sidewinder', 'max_speed'] 7 ```

In addition, Series derived from a different Frame can be used as indexers (#1155).

```python

import databricks.koalas as ks

ks.options.compute.opsondiff_frames = True

df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]}, ... index=[20, 10, 30, 0, 50]) df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]}, ... index=[20, 10, 30, 0, 50]) df1.A.loc[df2.A > -3].sort_index() 10 1 20 0 30 2 ```

Lastly, now loc uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.

```python

df = ks.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['maxspeed', 'shield']) df.loc['cobra':'viper', 'maxspeed'] cobra 1 viper 4 Name: max_speed, dtype: int64 ```

Other new features and improvements

We added the following new features:

koalas.Series:

  • get (#1153)

koalas.Index

  • drop (#1117)
  • len (#1161)
  • set_names (#1134)
  • argmin (#1162)
  • argmax (#1162)

koalas.MultiIndex:

  • from_product (#1144)
  • drop (#1117)
  • len (#1161)
  • set_names (#1134)

Other improvements

  • Add support from_pandas for Index/MultiIndex. (#1170)
  • Add a hidden column __natural_order__. (#1146)
  • Introduce _LocIndexerLike and consolidate some logic. (#1149)
  • Refactor LocIndexerLike.__getitem__. (#1152)
  • Remove sort in GroupBy._reduce_for_stat_function. (#1147)
  • Randomize index in tests and fix some window-like functions. (#1151)
  • Explicitly don't support Index.duplicated (#1131)
  • Fix DataFrame._repr_html_(). (#1177)

- Python
Published by HyukjinKwon about 6 years ago

https://github.com/databricks/koalas - Version 0.24.0

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:

```python

import databricks.koalas as ks import numpy as np kdf = ks.range(10) np.log(kdf) id 0 NaN 1 0.000000 2 0.693147 3 1.098612 4 1.386294 5 1.609438 6 1.791759 7 1.945910 8 2.079442 9 2.197225 ```

Other new features and improvements

We added the following new features:

koalas:

  • to_numeric (#1060)

koalas.DataFrame:

  • idxmax (#1054)
  • idxmin (#1054)
  • pct_change (#1051)
  • info (#1124)

koalas.Index

  • fillna (#1102)
  • min (#1114)
  • max (#1114)
  • drop_duplicates (#1121)
  • nunique (#1132)
  • sort_values (#1120)

koalas.MultiIndex:

  • levshape (#1086)
  • min (#1114)
  • max (#1114)
  • sort_values (#1120)

koalas.SeriesGroupBy

  • head (#1050)

koalas.DataFrameGroupBy

  • head (#1050)

Other improvements

  • Setting index name / names for Series (#1079)
  • disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' (#1097)
  • Support 'compute.opsondiff_frames' for NumPy ufunc compay in Series (#1128)
  • Support arithmetic and comparison APIs on same DataFrames (#1129)
  • Fix rename() for Index to support MultiIndex also (#1125)
  • Set the upper-bound for pandas. (#1137)
  • Fix _cum() for Series to work properly (#1113)
  • Fix value_counts() to work properly when dropna is True (#1116, #1142)

- Python
Published by HyukjinKwon about 6 years ago

https://github.com/databricks/koalas - Version 0.23.0

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:

```python

import databricks.koalas as ks import numpy as np kdf = ks.range(10) kser = np.sqrt(kdf.id) type(kser) kser 0 0.000000 1 1.000000 2 1.414214 3 1.732051 4 2.000000 5 2.236068 6 2.449490 7 2.645751 8 2.828427 9 3.000000 ```

Other new features and improvements

We added the following new features:

koalas:

  • option_context (#1077)

koalas.DataFrame:

  • where (#1018)
  • mask (#1018)
  • iterrows (#1070)

koalas.Series:

  • pop (#866)
  • first_valid_index (#1092)
  • pct_change (#1071)

koalas.Index

  • symmetric_difference (#953, #1059)
  • to_numpy (#1058)
  • transpose (#1056)
  • T (#1056)
  • dropna (#938)
  • shape (#1085)
  • value_counts (#949)

koalas.MultiIndex:

  • symmetric_difference (#953, #1059)
  • to_numpy (#1058)
  • transpose (#1056)
  • T (#1056)
  • dropna (#938)
  • shape (#1085)
  • value_counts (#949)

Other improvements

  • Fix comparison operators to treat NULL as False (#1029)
  • Make corr return koalas.DataFrame (#1069)
  • Include link to Help Thirsty Koalas Fund (#1082)
  • Add Null handling for different frames (#1083)
  • Allow Series.__getitem__ to take boolean Series (#1075)
  • Produce correct output against multiIndex when 'compute.opsondiff_frames' is enabled (#1089)
  • Fix idxmax() / idxmin() for Series work properly (#1078)

- Python
Published by HyukjinKwon about 6 years ago

https://github.com/databricks/koalas - Version 0.22.0

Enable Arrow 0.15.1+

Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version. With Arrow 0.15.1, now it works in Koalas (#902).

Expanding and Rolling

We also added expanding() and rolling() APIs in all groupby(), Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)

  • min
  • max
  • sum
  • mean
  • std
  • var

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • median (#995)
  • at (#1049)

Documentation

We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/userguide/bestpractices.html

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • quantile (#984)
  • explain (#1042)

koalas.Series:

  • between (#997)
  • update (#923)
  • mask (#1017)

koalas.MultiIndex:

  • from_tuples (#970)
  • from_arrays (#1001)

Along with the following improvements:

  • Introduce columnscols in InternalFrame substitude for datacolumns. (#956)
  • Fix different index level assignment when 'compute.opsondiff_frames' is enabled (#1045)
  • Fix Dataframe.melt function & Add doctest case for melt function (#987)
  • Enable creating Index from list like 'Index([1, 2, 3])' (#986)
  • Fix combine_frames to handle where the right hand side arguments are modified Series (#1020)
  • setup.py should support Python 2 to show a proper error message. (#1027)
  • Remove Series.schema. (#993)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.21.0

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • nunique (#980)
  • to_csv (#983)

Documentation

Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)

Other new features and improvements

We added the following new features:

koalas

  • merge (#969)

koalas.DataFrame:

  • keys (#937)
  • ndim (#947)

koalas.Series:

  • keys (#935)
  • mode (#899)
  • truncate (#928)
  • xs (#921)
  • where (#922)
  • first_valid_index (#936)

koalas.Index:

  • copy (#939)
  • unique (#912)
  • ndim (#947)
  • has_duplicates (#946)
  • nlevels (#945)

koalas.MultiIndex:

  • copy (#939)
  • ndim (#947)
  • has_duplicates (#946)
  • nlevels (#945)

koalas.Expanding

  • count (#978)

Along with the following improvements:

  • Fix passing options as keyword arguments (#968)
  • Make is_monotonic~ work properly for index (#930)
  • Fix Series.__getitem__ to work properly (#934)
  • Fix reindex when all the given columns are included the existing columns (#975)
  • Add datetime as the equivalent python type to TimestampType (#957)
  • Fix is_unique to respect the current Spark column (#981)
  • Fix bug when assign None to name as Index (#974)
  • Use namelikestring instead of str directly. (#942, #950)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.20.0

Disable Arrow 0.15

Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.

We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.

  • Set an upper bound for pyarrow version. (#918)

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • pivot_table (#908)
  • melt (#920)

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • xs (#892)

koalas.Series:

  • drop_duplicates (#896)
  • replace (#903)

koalas.GroupBy:

  • shift (#910)

Along with the following improvements:

  • Implement nested renaming for groupby agg (#904)
  • Add 'indexcol' parameter to DataFrame.tospark (#906)
  • Add more options to read_csv (#916)
  • Add NamedAgg (#911)
  • Enable DataFrame setting value as list of labels (#905)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.19.0

Koalas Logo

Now that we have an official logo!

We can see the cute logo in our documents as well.

Documentation

Also we improved the documentation: https://koalas.readthedocs.io/en/latest/

  • Added the logo (#831)
  • Added a Jupyter notebook for 10 min tutorial (#843)
  • Added the tutorial to the documentation (#853)
  • Add some examples for plot implementations in their docstrings (#847)
  • Move contribution guide to the official documentation site (#841)

Binder integration for the 10 min tutorial

You can run a live Jupyter notebook for 10 min tutorial from Binder.

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • transform (#800)
  • round (#802)
  • unique (#809)
  • duplicated (#803)
  • assign (#811)
  • merge (#825)
  • plot (#830)
  • groupby and its functions (#833)
  • update (#848)
  • join (#848)
  • drop_duplicate (#856)
  • dtype (#858)
  • filter (#859)
  • dropna (#857)
  • replace (#860)

Plots

We also continue adding plot APIs as follows:

For DataFrame:

  • plot.kde() (#784)

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • pop (#791)
  • __iter__ (#836)
  • rename (#806)
  • expanding (#840)
  • rolling (#840)

koalas.Series:

  • aggregate (#816)
  • agg (#816)
  • expanding (#840)
  • rolling (#840)
  • drop (#829)
  • copy (#869)

koalas.DataFrameGroupBy:

  • expanding (#840)
  • rolling (#840)

koalas.SeriesGroupBy:

  • expanding (#840)
  • rolling (#840)

Along with the following improvements:

  • Add squeeze argument to read_csv (#812)
  • Raise a more helpful error for duplicated columns in Join (#820)
  • Issue with ks.merge to Series (#818)
  • Fix MultiIndex.to_pandas() and __repr__(). (#832)
  • Add unit and origin options for to_datetime (#839)
  • Fix on wrong error raise in DataFrame.fillna (#844)
  • Allow str and list in aggfunc in DataFrameGroupby.agg (#828)
  • Add index_col argument to to_koalas(). (#863)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.18.0

Multi-index columns support

We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:

  • applymap (#793)
  • shift (#793)
  • diff (#793)
  • fillna (#793)
  • rank (#793)

Also, we can set tuple or None name for Series and Index. (#776)

```python

import databricks.koalas as ks kser = ks.Series([1, 2, 3]) kser.name = ('a', 'b') kser 0 1 1 2 2 3 Name: (a, b), dtype: int64 ```

Plots

We also continue adding plot APIs as follows:

For Series:

  • plot.kde() (#767)

For DataFrame:

  • plot.hist() (#780)

Options

In addition, we added the support for namespace-access in options (#785).

```python

import databricks.koalas as ks ks.options.display.maxrows 1000 ks.options.display.maxrows = 10 ks.options.display.max_rows 10 ```

See also User Guide of our project docs.

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • aggregate (#796)
  • agg (#796)
  • items (#787)

koalas.indexes.Index/MultiIndex

  • is_boolean (#795)
  • is_categorical (#795)
  • is_floating (#795)
  • is_integer (#795)
  • is_interval (#795)
  • is_numeric (#795)
  • is_object (#795)

Along with the following improvements:

  • Add index_col for read_json (#797)
  • Add index_col for spark IO reads (#769, #775)
  • Add "sep" parameter for read_csv (#777)
  • Add axis parameter to dataframe.diff (#774)
  • Add readjson and let tojson use spark.write.json (#753)
  • Use spark.write.csv in to_csv of Series and DataFrame (#749)
  • Handle TimestampType separately when convert to pandas' dtype. (#798)
  • Fix spark_df when set_index(.., drop=False). (#792)

Backward compatibility

  • We removed some parameters in DataFrame.to_csv and DataFrame.to_json to allow distributed writing (#749, #753)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.17.0

Options

We started using options to configure the Koalas' behavior. Now we have the following options:

  • display.max_rows (#714, #742)
  • compute.max_rows (#721, #736)
  • compute.shortcut_limit (#717)
  • compute.ops_on_diff_frames (#725)
  • compute.default_index_type (#723)
  • plotting.max_rows (#728)
  • plotting.sample_ratio (#737)

We can also see the list and their descriptions in the User Guide of our project docs.

Plots

We continue adding plot APIs as follows:

For Series:

  • plot.area() (#704)

For DataFrame:

  • plot.line() (#686)
  • plot.bar() (#695)
  • plot.barh() (#698)
  • plot.pie() (#703)
  • plot.area() (#696)
  • plot.scatter() (#719)

Multi-index columns support

We also continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • koalas.concat() (#680)
  • koalas.get_dummies() (#695)
  • DataFrame.pivot_table() (#635)

Other new features and improvements

We added the following new features:

koalas:

  • read_sql_table() (#741)
  • read_sql_query() (#741)
  • read_sql() (#741)

koalas.DataFrame:

  • style (#712)

Along with the following improvements:

  • GroupBy.apply should return Koalas DataFrame instead of pandas DataFrame (#731)
  • Fix rpow and rfloordiv to use proper operators in Series (#735)
  • Fix rpow and rfloordiv to use proper operators in DataFrame (#740)
  • Add schema inference support at DataFrame.transform (#732)
  • Add Option class to support type check and value check in options (#739)
  • Added missing tests (#687, #692, #694, #709, #711, #730, #729, #733, #734)

Backward compatibility

  • We renamed two of the default index names from one-by-one and distributed-one-by-one to sequence and distributed-sequence respectively. (#679)
  • We moved the configuration for enabling operations on different DataFrames from the environment variable to the option. (#725)
  • We moved the configuration for the default index from the environment variable to the option. (#723)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.16.0

Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

```python

import databricks.koalas as ks

kdf1 = ks.range(5) kdf2 = ks.DataFrame({'id': [5, 4, 3]}) (kdf1 - kdf2).sort_index() id 0 -5.0 1 -3.0 2 -1.0 3 NaN 4 NaN ```

```python

import databricks.koalas as ks

kdf = ks.range(5) kdf['newcol'] = ks.Series([1, 2, 3, 4]) kdf id newcol 0 0 1.0 1 1 2.0 3 3 4.0 2 2 3.0 4 4 NaN ```

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

  • (default) one-by-one: It implements a one-by-one sequence by Window function without specifying partition. This index type should be avoided when the data is large.

    ```python

    ks.range(3) id 0 0 1 1 2 2 ```

  • distributed-one-by-one: It implements a one-by-one sequence by group-by and group-map approach. It still generates a one-by-one sequential index globally. If the default index must be a one-by-one sequence in a large dataset, this index can be used.

    ```python

    ks.range(3) id 0 0 1 1 2 2 ```

  • distributed: It implements a monotonically increasing sequence simply by using Spark's monotonically_increasing_id function. If the index does not have to be a one-by-one sequence, this index can be used. Performance-wise, this index almost does not have any penalty comparing to other index types.

    ```python

    ks.range(3) id 25769803776 0 60129542144 1 94489280512 2 ```

Thirdly, we implemented many plot APIs in Series as follows:

  • plot.pie() (#669)
  • plot.area() (#670)
  • plot.line() (#671)
  • plot.barh() (#673)

See the example below:

```python import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie() ```

image

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

  • DataFrame.sort_index()(#637)
  • GroupBy.diff()(#653)
  • GroupBy.rank()(#653)
  • Series.any()(#652)
  • Series.all()(#652)
  • DataFrame.any()(#652)
  • DataFrame.all()(#652)
  • DataFrame.assign()(#657)
  • DataFrame.drop()(#658)
  • DataFrame.reindex()(#659)
  • Series.quantile()(#663)
  • Series,transform()(#663)
  • DataFrame.select_dtypes()(#662)
  • DataFrame.transpose()(#664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

  • duplicated() (#569)
  • fillna() (#640)
  • bfill() (#640)
  • pad() (#640)
  • ffill() (#640)

koalas.groupby.GroupBy:

  • diff() (#622)
  • nunique() (#617)
  • nlargest() (#654)
  • nsmallest() (#654)
  • idxmax() (#649)
  • idxmin() (#649)

Along with the following improvements:

  • Add a basic infrastructure for configurations. (#645)
  • Always use column_index. (#648)
  • Allow to omit type hint in GroupBy.transform, filter, apply (#646)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.15.0

We rapidly improved and added new functionalities, especially for groupby-related functionalities, in the past weeks. We also added the following features:

koalas.groupby.GroupBy:

  • size() (#593)
  • filter() (#614)
  • cummax() (#610)
  • cummin() (#610)
  • cumsum() (#610)
  • cumprod() (#610)
  • rand() (#619)

koalas.groupby.SeriesGroupBy:

  • apply() (#609)
  • value_counts() (#613)

koalas.indexes.Index:

  • size() (#623)

Along with the following improvements:

  • Add multiple aggregations on a single column (#602)
  • Add axis=columns to count, var, std, max, sum, min, kurtosis, skew and mean in DataFrame (#605)
  • Add Spark DDL formatted string support in read_csv(names=...) (#604)
  • Support names of index levels (#621, #629)
  • Add as_index argument to groupby. (#627)
  • Fix issues related to multi-index column access (#594, #597, #606, #611, #612, #620)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.14.0

We added a basic multi-index support in columns (#590) as below. pandas multi-index can be also mapped.

```python

import databricks.koalas as ks import numpy as np

arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']), ... np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])] kdf = ks.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=arrays) ```

```python

kdf bar baz foo qux one two one two one two one two A -1.574777 0.805108 0.139748 1.287946 -1.782297 -0.152292 0.680594 1.419407 B 0.076886 -1.560807 0.403807 -0.715029 1.236899 -0.364483 -1.548554 0.076003 C -0.575168 0.061539 -2.083615 -0.816090 -1.267440 0.745949 -1.194421 0.468818 ```

```python

kdf['bar'] one two A -1.574777 0.805108 B 0.076886 -1.560807 C -0.575168 0.061539 ```

```python

kdf['bar']['two'] A 0.805108 B -1.560807 C 0.061539 Name: two, dtype: float64 ```

In addition, we are triaging APIs to support and unsupport explicitly (#574)(#580). Some of pandas APIs would explicitly be unsupported according to Guardrails to prevent users from shooting themselves in the foot and based upon other justifications such as the cost of their operations.

We also added the following features:

koalas.DataFrame:

  • ffill() (#571)
  • bfill() (#570)
  • filter() (#589)

koalas.Series:

  • idxmax() (#587)
  • idxmin() (#587)

koalas.indexes.Index:

  • Index.rename() (#581)

koalas.groupby.GroupBy:

  • apply() (#584)
  • transform() (#585)

Along with the following improvements:

  • pandas 0.25 support (#579)
  • method and limit parameter support in DataFrame.fillna() (#565)
  • Dots (.) in columns names are allowed (#490)
  • Add support of level argument for DataFrame/Series.sort_index() (#583)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.13.0

We rapidly improved and added new functionalities in the past week. We also added the following features:

koalas.DataFrame:

  • diff (#562)
  • shift (#562)
  • round (#537)
  • rank (#546)
  • any (#568)
  • all (#568)

koalas.Series:

  • diff (#564)
  • quantile (#566)
  • shift (#563)
  • is_monotonic (#560)
  • ismonotonicincreasing (#560)
  • ismonotonicdecreasing (#560)
  • round (#537)
  • rank (#546)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.12.0

We rapidly improved and added new functionalities in the past week. We also added the following features:

koalas:

  • isna (#548)
  • isnull (#548)
  • notna (#548)
  • notnull (#548)

koalas.DataFrame:

  • bool (#533)
  • reindex (#493)
  • pivot (#532)
  • transform (#541)
  • median (#544)
  • cumprod (#545)

koalas.Series:

  • cummax (#534)
  • cummin (#534)
  • cumsum (#534)
  • bool (#533)
  • median (#540)
  • transpose (#543)
  • T (#543)
  • cumprod (#545)
  • hasnans (#547)

Along with the following improvements:

  • Fix DataFrame.replace to take kdf.replace({0: 10, 1: 100}) (#527)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.11.0

We fixed a critical regression for pandas 0.23.x compatibility (#528, #529) Now, pandas 0.23.x support is back.

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.10.0

We added infrastructure for usage logging (#494). It allows to use a custom logger to handle each API process failure and success. In Koalas, it has a built-in Koalas logger, databricks.koalas.usage_logging.usage_logger, with Python logging.

In addition, Koalas experimentally introduced type hints for both Series and DataFrame (#453). The new type hints are used as below:

python def func(...) -> ks.Series[np.float]: ... def func(...) -> ks.DataFrame[np.float, int, str]: ...

We also added the following features:

koalas.DataFrame:

  • update (#498)
  • pivot_table (#386)
  • pow (#503)
  • rpow (#503)
  • mod (#503)
  • rmod (#503)
  • floordiv (#503)
  • rfloordiv (#503)
  • T (#469)
  • transpose (#469)
  • select_dtypes (#510)
  • replace (#495)
  • cummin (#521)
  • cummax (#521)
  • cumsum (#521)

koalas.Series:

  • rank (#516)

Along with the following improvements:

  • Remaining Koalas Series.str functions (#496)
  • nunique in koalas.groupby.GroupBy.agg (#512)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.9.0

We bumped up supporting MLflow to 1.0 and now we can use URI pointing to the model. Please see MLflow documentation for more details. Note that we don't support older versions any more. (#477)

We also added the following features:

koalas:

  • melt (#474)

koalas.DataFrame:

  • eq (#476)
  • ne (#476)
  • gt (#476)
  • ge(#476)
  • lt(#476)
  • le (#476)
  • join (#473)
  • melt (#474)
  • getdtypecounts (#480)

koalas.Series:

  • eq (#476)
  • ne (#476)
  • gt (#476)
  • ge(#476)
  • lt(#476)
  • le (#476)
  • getdtypecounts (#480)
  • to_frame (#483)

koalas.groupby.GroupBy:

  • all (#485)
  • any (#485)

Along with the following improvements:

  • The Koalas DataFrame constructor can now take Koalas Series. (#470)
  • A lot of missing properties and functions are added to Series.dt property (#478)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.8.0

We added new functionalities, improved the documentation and fixed some bugs in the past week. Also, koalas.sql has an improvement (#448). Now Koalas DataFrame and some regular Python types can be used directly in SQL, for instance, as below:

```python

mydf = ks.range(10) x = range(4) ks.sql("SELECT * from {mydf} WHERE id IN {x}") id 0 0 1 1 2 2 3 3 ```

We also added the following features:

koalas

  • readsparkio (#447)
  • read_table (#449)
  • read_delta (#456)

koalas.DataFrame:

  • append (#388)
  • from_records (#436)
  • to_parquet (#443)
  • tosparkio (#447)
  • to_table (#449)
  • cache (#397)
  • to_delta (#456)
  • drop_duplicates (#458)

koalas.Series:

  • append (#388)
  • str (#429)
  • plot (#294)
  • hist (#294)

Along with the following improvements:

  • mean, sum, skew, kurtosis, min, max, std and var at DataFrame and Series supports numeric_only argument (#422)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.7.0

We refined the internal structure, improved the documentation and added new functionalities in the past week.

We also added the following features:

koalas:

  • read_clipboard (#430)
  • read_excel (#430)
  • read_html (#430)

koalas.DataFrame:

  • at (#384)
  • nunique (#346)
  • add_prefix (#414)
  • add_suffix (#414)
  • add (#427)
  • radd (#427)
  • div (#427)
  • divide (#427)
  • rdiv (#427)
  • truediv (#427)
  • rtruediv (#427)
  • mul (#427)
  • multiply (#427)
  • rmul (#427)
  • sub (#427)
  • substract (#427)
  • rsub (#427)

koalas.Series:

  • at (#384)
  • nunique (#346)
  • add_prefix (#414)
  • add_suffix (#414)
  • transform (#428)

- Python
Published by HyukjinKwon over 6 years ago

https://github.com/databricks/koalas - Version 0.6.0

We added basic integration with MLflow, so that models that have the pyfunc flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details. (#353)

We also added the following features:

koalas.DataFrame:

  • sort_index (#380)
  • applymap (#390)
  • empty (#391)

koalas.Series:

  • sort_values (#366)
  • to_list (#379)
  • sort_index (#380)
  • pipe (#392)
  • map (#389)
  • empty (#391)
  • add (#401)
  • radd (#401)
  • div (#401)
  • divide (#401)
  • rdiv (#401)
  • truediv (#401)
  • rtruediv (#401)
  • mul (#401)
  • multiply (#401)
  • rmul (#401)
  • sub (#401)
  • substract (#401)
  • rsub (#401)

Along with the following improvements:

  • DataFrame.merge function now supports left_on and right_on arguments. (#381)
  • DataFrame.describe function now supports percentiles argument. (#378)

- Python
Published by ueshin over 6 years ago

https://github.com/databricks/koalas - Version 0.5.0

We refined the package management and pushed to conda-forge as well as PyPI. Now we can install Koalas with the conda package manager:

sh conda install koalas -c conda-forge

We also added the following features:

koalas:

  • concat (#348)

koalas.DataFrame:

  • astype (#349)
  • to_records (#298)
  • size (#356)
  • iloc (#364)
  • describe (#375)

koalas.Series:

  • to_json (#358)
  • to_csv (#358)
  • dtypes (#355)
  • size (#356)
  • to_excel (#361)
  • iloc (#364)
  • all (#359)
  • any (#359)
  • dt (#295, #372)
  • describe (#375)

Along with the following improvements:

  • Explicitly marked functions deprecated in pandas which we won't support without a special reason. (#342)
  • Introduced Index/MultiIndex corresponding to pandas', instead of reusing Series. (#341)

- Python
Published by ueshin almost 7 years ago

https://github.com/databricks/koalas - Version 0.4.0

We rapidly improved Koalas in documentation and added new functionalities in the past week. As of this release, all functions are documented. We also added the following features:

koalas:

  • range (#254) - for generating a distributed sequence of data
  • sql (#256) - for running SQL queries

koalas.DataFrame:

  • merge (#264)
  • to_json (#238)
  • to_csv (#239)
  • to_excel (#288)
  • to_clipboard (#257)
  • clip (#297)
  • to_latex (#297)

koalas.Series:

  • unique (#249)
  • to_clipboard (#257)
  • to_latex (#297)
  • clip (#297)
  • fillna (#317)
  • is_unique (#325)
  • sample (#327)

Along with the following improvements:

  • Design Principles and Contribution Guide (#246, #255)
  • DataFrame.drop now supports columns parameter (#253)
  • repr and repr_html improvements (#258) - only shows top 1000 when the number of values/rows in DataFrame and Series exceed 1000.

- Python
Published by rxin almost 7 years ago

https://github.com/databricks/koalas - Version 0.3.0

We fixed a critical bug for Python 3.5 introduced in v0.2.0. #241

Also we have added the following features:

koalas.DataFrame:

  • isin
  • to_dict

koalas.Series:

  • isin
  • to_dict

and improvements:

koalas.Series:

  • __add__ and __radd__ now supports string concatenation

koalas.groupby.GroupBy:

  • agg() now preserves the group keys as indices

and a lot of code and document cleanups.

- Python
Published by ueshin almost 7 years ago

https://github.com/databricks/koalas - Version 0.2.0

We have implemented a lot of major functionalities in the past week. Here's a summary of what's new in release v0.2.0.

spark.DataFrame:

  • to_koalas is monkey patched into Spark's DataFrame API when koalas package is imported

koalas.DataFrame:

  • count
  • corr
  • dtypes
  • groupby
  • sortvalues now supports ascending, naposition, and inplace parameters
  • to_numpy
  • to_pandas (with toPandas as an alias for compatibility with Spark)
  • to_string
  • Allow direct literal assignment to create a new column
  • Various stats functions now work with boolean type
  • In notebooks or REPL, automatically display the content of the DataFrame, similar to pandas

koalas.Series:

  • alias (as an alias for rename function)
  • count
  • groupby
  • to_numpy
  • to_pandas (with toPandas as an alias for compatibility with Spark)
  • to_string
  • fillna
  • Various stats functions now work with boolean type
  • In notebooks or REPL, automatically display the content of the Series, similar to pandas

Significantly improved documentation of the project.

Last but not least, we have done some major refactoring of the codebase and its infrastructure to make it more amenable to changes in the future, e.g.

  • Now koalas.DataFrame wraps around a Spark DataFrame, rather than directly monkey patching all methods.
  • Doctests are enabled and can be run directly in PyCharm
  • Mypy type hint linter is added
  • Switched from nose to pytest for test infrastructure.
  • Introduced utility methods to support older versions of pandas. #210
  • Code coverage report

- Python
Published by rxin almost 7 years ago

https://github.com/databricks/koalas - Version 0.1.0

We rewrote the internals of Koalas to make it more extensible for upcoming features. We also laid down the foundation for API reference docs in this release.

- Python
Published by rxin almost 7 years ago

https://github.com/databricks/koalas - Version 0.0.6

This version significantly expands the amount of functions available. It is still meant to be a technology preview, and users are encouraged to report issues that they encounter with their current pandas code.

Noteworthy features:

  • indexing is now supported
  • slicing and accessing columns is much improved
  • most of the methods are accessible as stubs
  • support for N/A (fillna, dropna, etc.) has been added

We thank all the contributors who have contributed to this release.

- Python
Published by thunterdb almost 7 years ago

https://github.com/databricks/koalas - Version 0.0.5

This is the initial release outside Databricks.

This release is meant to be a technology preview. See the README.md file for more information.

- Python
Published by thunterdb almost 7 years ago